背景:前面有一篇關於requests請求響應中文亂碼的解決辦法,但是心中仍有些疑惑,還是想知道答案,不管是否發送請求定義了content-type:text/html;charset=utf-8請求頭信息,還是響應的網頁源碼中有charset=utf-8字符集,經過試驗:response類headers中根本就沒有得到我們定義的字符集,還有response.encoding得到的也不是解析網頁的charset設置的字符集,很是奇怪,下面來找源碼分析一下:
首先我們來看requests的Response中的content源碼:
@property
def content(self):
"""Content of the response, in bytes."""
if self._content is False:
# Read the contents.
if self._content_consumed:
raise RuntimeError(
'The content for this response was already consumed')
if self.status_code == 0 or self.raw is None:
self._content = None
else:
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
self._content_consumed = True
# don't need to release the connection; that's been handled by urllib3
# since we exhausted the data.
return self._content
上面可以看出content屬性始終沒有關於encoding的輸出,那麼可以猜測requests是通過chardet去計算猜出編碼,實際與預期不符!
而response的encoding是類屬性,源碼註釋#:Encoding to decode with when accessing r.text.,是給text屬性解碼用的。所以更多情況使用content屬性來接收網頁響應源碼,再解碼一次即可得到正常的中文。
接下來再看text屬性的源碼:
@property
def text(self):
"""Content of the response, in unicode.
If Response.encoding is None, encoding will be guessed using
``chardet``.
The encoding of the response content is determined based solely on HTTP
headers, following RFC 2616 to the letter. If you can take advantage of
non-HTTP knowledge to make a better guess at the encoding, you should
set ``r.encoding`` appropriately before accessing this property.
"""
# Try charset from content-type
content = None
encoding = self.encoding
if not self.content:
return str('')
# Fallback to auto-detected encoding.
if self.encoding is None:
encoding = self.apparent_encoding
# Decode unicode from given encoding.
try:
content = str(self.content, encoding, errors='replace')
except (LookupError, TypeError):
# A LookupError is raised if the encoding was not found which could
# indicate a misspelling or similar mistake.
#
# A TypeError can be raised if encoding is None
#
# So we try blindly encoding.
content = str(self.content, errors='replace')
return content
中間有一個encoding=response的類屬性self.encoding,再判斷類屬性的值是否爲None,經調試:在if之前打印self.encoding類屬性,對不起它是有值的:ISO-8859-1,所以就不會執行下面的代碼計算encoding的值,這暫且不管,我們繼續進入apparent_encoding它也是個屬性,源碼如下,並加入調試代碼:調試return之前的東西:
@property
def apparent_encoding(self):
"""The apparent encoding, provided by the chardet library."""
print("這是個什麼東西:{}".format(chardet.detect(self.content)))
return chardet.detect(self.content)['encoding']
傳入的是content屬性的值(即接收的響應報文),輸出的結果是:{'encoding': 'utf-8', 'language': '', 'confidence': 0.99},剛好返回的這個dict數據類型的encoding:utf-8,如果不出意外,self.encoding就該是utf-8,那text屬性下面返回的content即是得到經過utf8解碼的響應文本數據。
如果我在源碼text屬性中,直接將if條件設置爲假,那麼執行這個apparent_encoding屬性,結果得到正常編碼utf-8,不管你的網頁響應是啥編碼,基本都可以得到正確的中文輸出!
所以此時我嚴重懷疑這是個bug,當然,requests大家還是用得好好的,怎麼可能是個bug呢?繼續深究。。。
那麼就只剩下一個問題:在請求響應之後的encoding屬性值是從哪裏來的?爲了一探究竟,再來看幾處源碼:
def get_encodings_from_content(content):
"""Returns encodings from given content string.
:param content: bytestring to extract encodings from.
"""
warnings.warn((
'In requests 3.0, get_encodings_from_content will be removed. For '
'more information, please see the discussion on issue #2266. (This'
' warning should only appear once.)'),
DeprecationWarning)
# print("content獲取encoding:",content)
charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')
return (charset_re.findall(content) +
pragma_re.findall(content) +
xml_re.findall(content))
def _parse_content_type_header(header):
"""Returns content type and parameters from given header
:param header: string
:return: tuple containing content type and dictionary of
parameters
"""
tokens = header.split(';')
# print("拆分請求頭:",tokens)
content_type, params = tokens[0].strip(), tokens[1:]
params_dict = {}
items_to_strip = "\"' "
for param in params:
param = param.strip()
if param:
key, value = param, True
index_of_equals = param.find("=")
if index_of_equals != -1:
key = param[:index_of_equals].strip(items_to_strip)
value = param[index_of_equals + 1:].strip(items_to_strip)
params_dict[key.lower()] = value
return content_type, params_dict
def get_encoding_from_headers(headers):
"""Returns encodings from given HTTP Header Dict.
:param headers: dictionary to extract encoding from.
:rtype: str
"""
# print("從請求頭獲取encoding:",headers)
# headers={"content-type":"text/html;charset=utf-9"}
content_type = headers.get('content-type')
# print(content_type)
if not content_type:
return None
content_type, params = _parse_content_type_header(content_type)
if 'charset' in params:
return params['charset'].strip("'\"")
if 'text' in content_type:
return 'ISO-8859-1'
不是bug,最終可以確定這個encoding屬性是從util.py的get_encoding_from_headers方法中最後的if條件判斷得到,至於爲甚發送請求明明定義了content-type:text/html;charset=utf-8,爲什麼響應結果的headers卻沒有;charset=utf-8內容,還需要多多通曉源碼,所以,最終我修改了源碼:在text屬性的if條件設置is not None不使用它的默認編碼,text不要再像上篇文章使用編碼再解碼得到正確的中文輸出。下面引入一個別人分析鏈接https://www.cnblogs.com/mswei/p/9835370.html,也是介紹requests請求響應中文亂碼的問題。