python urllib2 處理編碼的兩個注意點

原創

2018-09-03 02:16

urllib2可以抓取網頁，爲了模擬瀏覽器需要增加如下header：

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:zh,en-US;q=0.8,en;q=0.6
Connection:keep-alive
Host:www.baidu.com
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36

把header作爲一個dict傳參數，但是由於請求gzip，所以需要對返回結果進行解壓，或者就不進行http gzip請求

from StringIO import StringIO

   import gzip

    req  = urllib2.Request(url, headers=headers)
    resp = urllib2.urlopen(req)

        content = ''
        # handle gzip compress

       # 這裏需要注意，因爲模擬chrome的請求，所以返回的是gzip格式的編碼，而urllib2是不會自動處理編碼的，需要用StringIO和gzip來協助處理，得到解壓後的串

       #否則會報錯：UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: invalid start byte

        if resp.info().get('Content-Encoding') == 'gzip':
            buf = StringIO(resp.read())
            f = gzip.GzipFile(fileobj=buf)
            content = f.read()
        else :
            content = resp.read()

       # 這裏根據網頁返回的實際charset進行unicode編碼
        encoding = resp.headers['content-type'].split('charset=')[-1]
        ucontent = unicode(content, encoding)

</pre><pre code_snippet_id="505001" snippet_file_name="blog_20141102_8_3978368" name="code" class="python">參考：

http://stackoverflow.com/questions/3947120/does-python-urllib2-automatically-uncompress-gzip-data-fetched-from-webpage

</pre><pre code_snippet_id="505001" snippet_file_name="blog_20141102_11_1775350" name="code" class="python">

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

微服務實踐Aspire項目發佈到遠程k8s集羣

前提你必須會創建aspire項目，不會的請先看微服務新體驗之Aspire初體驗 Aspirate (Aspir8) Aspirate 是將aspire項目發佈到k8s集羣的工具安裝aspirate dotnet tool install

2024-06-02 14:24:56

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

安裝配置相關軟件安裝 PowerShell 7 / Core dotnet tool install --global PowerShell 安裝 Visual Studio 擴展 Microsoft Child Process Deb

2024-06-02 14:24:56

.NET開源、跨平臺、使用簡單的面部識別庫

前言今天給大家分享一個.NET開源（MIT License）、免費、跨平臺（適用於 Windows、MacOS 和 Linux ）、使用簡單的面部識別庫：FaceRecognitionDotNet。項目介紹 FaceRecogniti

2024-06-02 14:21:55

Python 潮流週刊#53：我輩楷模，一個約見諾獎得主，一個成爲核心開發者

本週刊由 Python貓出品，精心篩選國內外的 250+ 信息源，爲你挑選最值得分享的文章、教程、開源項目、軟件工具、播客和視頻、熱門話題等內容。願景：幫助所有讀者精進 Python 技術，並增長職業和副業的收入。本期週刊分享了 12

豌豆花下貓

2024-06-02 14:19:15

Terraform管理OpenStack

官方安裝指南 https://developer.hashicorp.com/terraform/install https://developer.hashicorp.com/terraform/intro/getting-sta

2024-06-02 14:13:44

matlab練習程序（LQR路徑跟蹤）

LQR 是一種優化控制方法，設計目標是找到一組控制輸入，使得線性系統的狀態軌跡儘可能地接近目標，同時使控制輸入儘可能小。其目標函數是一個二次型成本函數。分爲以下幾個步驟： 1. 設系統動態方程爲：其中x爲狀態量，u爲控制輸入，A和B爲

2024-06-02 14:11:04

h32 Most commonly used tags in HTML

Most commonly used tags in HTML Last Updated : 08 Mar, 2024 Most commonly used tags in HTML refer to HTM

2024-06-02 14:10:23

css45 CSS Math Functions

https://www.w3schools.com/css/css_math_functions.asp The CSS math functions allow mathematical expressions to be used

2024-06-02 14:10:23

CSS tutorials (w3school)

CSS tutorials (w3school) https://www.schoolsw3.com/css/index.php (Русский язык) https://www.w3schools.com/css/css_intro

2024-06-02 14:10:23

css44 CSS The !important Rule

https://www.w3schools.com/css/css_important.asp What is !important? The !important rule in CSS is used to add more imp

2024-06-02 14:10:23

css41 CSS Website Layout

https://www.w3schools.com/css/css_website_layout.asp Website Layout A website is often divided into headers, menus, co

2024-06-02 14:10:23

css39 CSS Forms

https://www.w3schools.com/css/css_form.asp The look of an HTML form can be greatly improved with CSS: <!DOCTYPE html>

2024-06-02 14:10:23

css40 CSS Counters

https://www.w3schools.com/css/css_counters.asp CSS counters are "variables" maintained by CSS whose values can be inc

2024-06-02 14:10:23

css43 CSS Specificity

https://zhuanlan.zhihu.com/p/670589063 CSS Specificity(CSS 特異性)是一個用來決定當多個CSS規則應用於同一個元素時,哪個規則將優先應用的機制。 What is Specific

2024-06-02 14:10:23

css42 CSS Units

https://www.w3schools.com/css/css_units.asp CSS Units CSS has several different units for expressing a length. Many CS

2024-06-02 14:10:23

24小時熱門文章

最新文章

最新評論文章