requests源碼分析：response類的text屬性都幹了啥，爲啥中文亂碼？

背景：前面有一篇關於requests請求響應中文亂碼的解決辦法，但是心中仍有些疑惑，還是想知道答案，不管是否發送請求定義了content-type：text/html;charset=utf-8請求頭信息，還是響應的網頁源碼中有charset=utf-8字符集，經過試驗：response類headers中根本就沒有得到我們定義的字符集，還有response.encoding得到的也不是解析網頁的charset設置的字符集，很是奇怪，下面來找源碼分析一下：

首先我們來看requests的Response中的content源碼：

@property
    def content(self):
        """Content of the response, in bytes."""

        if self._content is False:
            # Read the contents.
            if self._content_consumed:
                raise RuntimeError(
                    'The content for this response was already consumed')

            if self.status_code == 0 or self.raw is None:
                self._content = None
            else:
                self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''

        self._content_consumed = True
        # don't need to release the connection; that's been handled by urllib3
        # since we exhausted the data.
        return self._content

上面可以看出content屬性始終沒有關於encoding的輸出，那麼可以猜測requests是通過chardet去計算猜出編碼，實際與預期不符!

而response的encoding是類屬性，源碼註釋#：Encoding to decode with when accessing r.text.，是給text屬性解碼用的。所以更多情況使用content屬性來接收網頁響應源碼，再解碼一次即可得到正常的中文。

接下來再看text屬性的源碼：

@property
    def text(self):
        """Content of the response, in unicode.

        If Response.encoding is None, encoding will be guessed using
        ``chardet``.

        The encoding of the response content is determined based solely on HTTP
        headers, following RFC 2616 to the letter. If you can take advantage of
        non-HTTP knowledge to make a better guess at the encoding, you should
        set ``r.encoding`` appropriately before accessing this property.
        """

        # Try charset from content-type
        content = None
        encoding = self.encoding

        if not self.content:
            return str('')

        # Fallback to auto-detected encoding.
        if self.encoding is None:
            encoding = self.apparent_encoding

        # Decode unicode from given encoding.
        try:
            content = str(self.content, encoding, errors='replace')
        except (LookupError, TypeError):
            # A LookupError is raised if the encoding was not found which could
            # indicate a misspelling or similar mistake.
            #
            # A TypeError can be raised if encoding is None
            #
            # So we try blindly encoding.
            content = str(self.content, errors='replace')

        return content

中間有一個encoding=response的類屬性self.encoding,再判斷類屬性的值是否爲None，經調試：在if之前打印self.encoding類屬性，對不起它是有值的：ISO-8859-1，所以就不會執行下面的代碼計算encoding的值，這暫且不管，我們繼續進入apparent_encoding它也是個屬性，源碼如下，並加入調試代碼：調試return之前的東西：

@property
    def apparent_encoding(self):
        """The apparent encoding, provided by the chardet library."""
        print("這是個什麼東西：{}".format(chardet.detect(self.content)))
        return chardet.detect(self.content)['encoding']

傳入的是content屬性的值(即接收的響應報文)，輸出的結果是：{'encoding': 'utf-8', 'language': '', 'confidence': 0.99}，剛好返回的這個dict數據類型的encoding：utf-8，如果不出意外，self.encoding就該是utf-8，那text屬性下面返回的content即是得到經過utf8解碼的響應文本數據。

如果我在源碼text屬性中，直接將if條件設置爲假，那麼執行這個apparent_encoding屬性，結果得到正常編碼utf-8，不管你的網頁響應是啥編碼，基本都可以得到正確的中文輸出！

所以此時我嚴重懷疑這是個bug，當然，requests大家還是用得好好的，怎麼可能是個bug呢？繼續深究。。。

那麼就只剩下一個問題：在請求響應之後的encoding屬性值是從哪裏來的？爲了一探究竟，再來看幾處源碼：

def get_encodings_from_content(content):
    """Returns encodings from given content string.

    :param content: bytestring to extract encodings from.
    """
    warnings.warn((
        'In requests 3.0, get_encodings_from_content will be removed. For '
        'more information, please see the discussion on issue #2266. (This'
        ' warning should only appear once.)'),
        DeprecationWarning)

    # print("content獲取encoding：",content)
    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

    return (charset_re.findall(content) +
            pragma_re.findall(content) +
            xml_re.findall(content))

def _parse_content_type_header(header):
    """Returns content type and parameters from given header

    :param header: string
    :return: tuple containing content type and dictionary of
         parameters
    """

    tokens = header.split(';')
    # print("拆分請求頭：",tokens)
    content_type, params = tokens[0].strip(), tokens[1:]
    params_dict = {}
    items_to_strip = "\"' "

    for param in params:
        param = param.strip()
        if param:
            key, value = param, True
            index_of_equals = param.find("=")
            if index_of_equals != -1:
                key = param[:index_of_equals].strip(items_to_strip)
                value = param[index_of_equals + 1:].strip(items_to_strip)
            params_dict[key.lower()] = value
    return content_type, params_dict


def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    :rtype: str
    """
    # print("從請求頭獲取encoding：",headers)
    # headers={"content-type":"text/html;charset=utf-9"}
    content_type = headers.get('content-type')
    # print(content_type)
    if not content_type:
        return None

    content_type, params = _parse_content_type_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

不是bug，最終可以確定這個encoding屬性是從util.py的get_encoding_from_headers方法中最後的if條件判斷得到，至於爲甚發送請求明明定義了content-type：text/html;charset=utf-8,爲什麼響應結果的headers卻沒有;charset=utf-8內容，還需要多多通曉源碼，所以，最終我修改了源碼：在text屬性的if條件設置is not None不使用它的默認編碼，text不要再像上篇文章使用編碼再解碼得到正確的中文輸出。下面引入一個別人分析鏈接https://www.cnblogs.com/mswei/p/9835370.html，也是介紹requests請求響應中文亂碼的問題。

requests源碼分析：response類的text屬性都幹了啥，爲啥中文亂碼？

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

Jenkins集成docker插件build&publish，它究竟是如何完成docker工作的？

JMeter接口自動化測試框架通過java腳本設計通用替換參數的方法

Linux環境下如何試用腳本實現選擇查看服務日誌的辦法?

JMeter如何實現文件上傳的接口測試？

Python開發excel文件對比腳本,結果寫入log文件

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結