【轉載】代碼分析Python requests庫中文編碼問題

轉載自：http://xiaorui.cc/2016/02/19/%E4%BB%A3%E7%A0%81%E5%88%86%E6%9E%90python-requests%E5%BA%93%E4%B8%AD%E6%96%87%E7%BC%96%E7%A0%81%E9%97%AE%E9%A2%98/

Python reqeusts在作爲代理爬蟲節點抓取不同字符集網站時遇到的一些問題總結. 簡單說就是中文亂碼的問題. 如果單純的抓取微博，微信，電商，那麼字符集charset很容易就確認，你甚至可以單方面把encoding給固定住。但作爲輿情數據來說，他每天要抓取幾十萬個不同網站的敏感數據，所以這就需要我們更好確認字符集編碼,避免中文的亂碼情況.

該文章寫的有些亂，歡迎來噴 ! 另外文章後續不斷更新中，請到原文地址查看更新。

http://xiaorui.cc/2016/02/19/%E4%BB%A3%E7%A0%81%E5%88%86%E6%9E%90python-requests%E5%BA%93%E4%B8%AD%E6%96%87%E7%BC%96%E7%A0%81%E9%97%AE%E9%A2%98/

我們首先看這個例子. 你會發現一些有意思的事情.

#blog: xiaorui.cc

In [9]: r = requests.get('http://cn.python-requests.org/en/latest/')

In [10]: r.encoding

Out[10]: 'ISO-8859-1'

In [11]: type(r.text)

Out[11]: unicode

In [12]: type(r.content)

Out[12]: str

In [13]: r.apparent_encoding

Out[13]: 'utf-8'

In [14]: chardet.detect(r.content)

Out[14]: {'confidence': 0.99, 'encoding': 'utf-8'}

第一個問題是，爲什麼會有ISO-8859-1這樣的字符集編碼？

iso-8859是什麼？他又被叫做Latin-1或“西歐語言” . 對於我來說，這屬於requests的一個bug，在requests庫的github裏可以看到不只是中國人提交了這個issue. 但官方的回覆說是按照http rfc設計的。

下面通過查看requests源代碼，看這問題是如何造成的 !

requests會從服務器返回的響應頭的 Content-Type 去獲取字符集編碼，如果content-type有charset字段那麼requests才能正確識別編碼，否則就使用默認的 ISO-8859-1. 一般那些不規範的頁面往往有這樣的問題.

In [52]: r.headers

Out[52]: {'content-length': '16907', 'via': 'BJ-H-NX-116(EXPIRED), http/1.1 BJ-UNI-1-JCS-116 ( [cHs f ])', 'ser': '3.81', 'content-encoding': 'gzip', 'age': '23', 'expires': 'Fri, 19 Feb 2016 07:36:25 GMT', 'vary': 'Accept-Encoding', 'server': 'JDWS', 'last-modified': 'Fri, 19 Feb 2016 07:35:25 GMT', 'connection': 'keep-alive', 'cache-control': 'max-age=60', 'date': 'Fri, 19 Feb 2016 07:35:31 GMT', 'content-type': 'text/html;'}

文件: requests.utils.py

#blog: xiaorui.cc

def get_encoding_from_headers(headers):

"""通過headers頭部的dict中獲取編碼格式"""

content_type = headers.get('content-type')

if not content_type:

return None

content_type, params = cgi.parse_header(content_type)

if 'charset' in params:

return params['charset'].strip("'\"")

if 'text' in content_type:

return 'ISO-8859-1'

第二個問題，那麼如何獲取正確的編碼？

requests的返回結果對象裏有個apparent_encoding函數, apparent_encoding通過調用chardet.detect()來識別文本編碼. 但是需要注意的是，這有些消耗計算資源.
至於爲毛，可以看看chardet的源碼實現.

#blog: xiaorui.cc

@property

def apparent_encoding(self):

"""使用chardet來計算編碼"""

return chardet.detect(self.content)['encoding']

第三個問題，requests的text() 跟 content() 有什麼區別？

requests在獲取網絡資源後，我們可以通過兩種模式查看內容。一個是r.text，另一個是r.content，那他們之間有什麼區別呢？

分析requests的源代碼發現，r.text返回的是處理過的Unicode型的數據，而使用r.content返回的是bytes型的原始數據。也就是說，r.content相對於r.text來說節省了計算資源，r.content是把內容bytes返回. 而r.text是decode成Unicode. 如果headers沒有charset字符集的化,text()會調用chardet來計算字符集，這又是消耗cpu的事情.

通過看requests代碼來分析text() content()的區別.

文件: requests.models.py

@property

def apparent_encoding(self):

"""The apparent encoding, provided by the chardet library"""

return chardet.detect(self.content)['encoding']

@property

def content(self):

"""Content of the response, in bytes."""

if self._content is False:

# Read the contents.

try:

if self._content_consumed:

raise RuntimeError(

'The content for this response was already consumed')

if self.status_code == 0:

self._content = None

else:

self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()

except AttributeError:

self._content = None

self._content_consumed = True

# don't need to release the connection; that's been handled by urllib3

# since we exhausted the data.

return self._content

@property

def text(self):

"""Content of the response, in unicode.

If Response.encoding is None, encoding will be guessed using

``chardet``.

The encoding of the response content is determined based solely on HTTP

headers, following RFC 2616 to the letter. If you can take advantage of

non-HTTP knowledge to make a better guess at the encoding, you should

set ``r.encoding`` appropriately before accessing this property.

"""

# Try charset from content-type

content = None

encoding = self.encoding

if not self.content:

return str('')

# 當爲空的時候會使用chardet來猜測編碼.

if self.encoding is None:

encoding = self.apparent_encoding

# Decode unicode from given encoding.

try:

content = str(self.content, encoding, errors='replace')

except (LookupError, TypeError):

# A LookupError is raised if the encoding was not found which could

# indicate a misspelling or similar mistake.

# A TypeError can be raised if encoding is None

# So we try blindly encoding.

content = str(self.content, errors='replace')

對於requests中文亂碼解決方法有這麼幾種.

方法一:

由於content是HTTP相應的原始字節串，可以根據headers頭部的charset把content decode爲unicode，前提別是ISO-8859-1編碼.

In [96]: r.encoding

Out[96]: 'gbk'

In [98]: print r.content.decode(r.encoding)[200:300]

="keywords" content="Python數據分析與挖掘實戰,,機械工業出版社,9787111521235,,在線購買,折扣,打折"/>

另外有一種特別粗暴方式，就是直接根據chardet的結果來encode成utf-8格式.

#http://xiaorui.cc

In [22]: r = requests.get('http://item.jd.com/1012551875.html')

In [23]: print r.content

KeyboardInterrupt

In [23]: r.apparent_encoding

Out[23]: 'GB2312'

In [24]: r.encoding

Out[24]: 'gbk'

In [25]: r.content.decode(r.encoding).encode('utf-8')

---------------------------------------------------------------------------

UnicodeDecodeError Traceback (most recent call last)

<ipython-input-25-918324cdc053> in <module>()

----> 1 r.content.decode(r.apparent_encoding).encode('utf-8')

UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 49882-49883: illegal multibyte sequence

In [27]: r.content.decode(r.apparent_encoding,'replace').encode('utf-8')

如果在確定使用text，並已經得知該站的字符集編碼時，可以使用 r.encoding = ‘xxx’ 模式，當你指定編碼後，requests在text時會根據你設定的字符集編碼進行轉換.

>>> import requests

>>> r = requests.get('https://up.xiaorui.cc')

>>> r.text

>>> r.encoding

'gbk'

>>> r.encoding = 'utf-8'

方法二:

根據我抓幾十萬的網站的經驗，大多數網站還是很規範的，如果headers頭部沒有charset，那麼就從html的meta中抽取.

In [78]: s

Out[78]: ' <meta http-equiv="Content-Type" content="text/html; charset=gbk"'

In [79]: b = re.compile("<meta.*content=.*charset=(?P<charset>[^;\s]+)", flags=re.I)

In [80]: b.search(s).group(1)

Out[80]: 'gbk"'

python requests的utils.py裏已經有個完善的從html中獲取meta charset的函數. 說白了還是一對的正則表達式.

1 2	In [32]: requests.utils.get_encodings_from_content(r.content) Out[32]: ['gbk']

文件: utils.py

def get_encodings_from_content(content):

charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)

pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)

xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

return (charset_re.findall(content) +

pragma_re.findall(content) +

xml_re.findall(content))

最後，針對requests中文亂碼的問題總結:

統一編碼，要不都成utf-8, 要不就用unicode做中間碼 !

國內的站點一般是utf-8、gbk、gb2312 , 當requests的encoding是這些字符集編碼後，是可以直接decode成unicode.

但當你判斷出encoding是 ISO-8859-1 時，可以結合re正則和chardet判斷出他的真實編碼. 可以把這邏輯封裝補丁引入進來.

import requests

def monkey_patch():

prop = requests.models.Response.content

def content(self):

_content = prop.fget(self)

if self.encoding == 'ISO-8859-1':

encodings = requests.utils.get_encodings_from_content(_content)

if encodings:

self.encoding = encodings[0]

else:

self.encoding = self.apparent_encoding

_content = _content.decode(self.encoding, 'replace').encode('utf8', 'replace')

self._content = _content

return _content

requests.models.Response.content = property(content)

monkey_patch()

Python3.x解決了這編碼問題，如果你還是python2.6 2.7，那麼還需要用上面的方法解決中文亂碼的問題.

END.

【轉載】代碼分析Python requests庫中文編碼問題

【轉載】python-reraise重新拋出自定義異常——如何使traceback選擇性保留？

djang對於選擇選項後進行編輯的一些嘗試

【轉載】Win7開啓遠程桌面連接憑證無法工作解決終極方法

django 導入SQLAlchemyJobStore 時，提示SQLAlchemyJobStore requires SQLAlchemy installed 錯誤

Django ModelForm 中自定義字段顯示順序

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結