requests源码分析:response类的text属性都干了啥,为啥中文乱码?

背景:前面有一篇关于requests请求响应中文乱码的解决办法,但是心中仍有些疑惑,还是想知道答案,不管是否发送请求定义了content-type:text/html;charset=utf-8请求头信息,还是响应的网页源码中有charset=utf-8字符集,经过试验:response类headers中根本就没有得到我们定义的字符集,还有response.encoding得到的也不是解析网页的charset设置的字符集,很是奇怪,下面来找源码分析一下:

首先我们来看requests的Response中的content源码:

@property
    def content(self):
        """Content of the response, in bytes."""

        if self._content is False:
            # Read the contents.
            if self._content_consumed:
                raise RuntimeError(
                    'The content for this response was already consumed')

            if self.status_code == 0 or self.raw is None:
                self._content = None
            else:
                self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''

        self._content_consumed = True
        # don't need to release the connection; that's been handled by urllib3
        # since we exhausted the data.
        return self._content

上面可以看出content属性始终没有关于encoding的输出,那么可以猜测requests是通过chardet去计算猜出编码,实际与预期不符!

而response的encoding是类属性,源码注释#:Encoding to decode with when accessing r.text.,是给text属性解码用的。所以更多情况使用content属性来接收网页响应源码,再解码一次即可得到正常的中文。

接下来再看text属性的源码:

@property
    def text(self):
        """Content of the response, in unicode.

        If Response.encoding is None, encoding will be guessed using
        ``chardet``.

        The encoding of the response content is determined based solely on HTTP
        headers, following RFC 2616 to the letter. If you can take advantage of
        non-HTTP knowledge to make a better guess at the encoding, you should
        set ``r.encoding`` appropriately before accessing this property.
        """

        # Try charset from content-type
        content = None
        encoding = self.encoding

        if not self.content:
            return str('')

        # Fallback to auto-detected encoding.
        if self.encoding is None:
            encoding = self.apparent_encoding

        # Decode unicode from given encoding.
        try:
            content = str(self.content, encoding, errors='replace')
        except (LookupError, TypeError):
            # A LookupError is raised if the encoding was not found which could
            # indicate a misspelling or similar mistake.
            #
            # A TypeError can be raised if encoding is None
            #
            # So we try blindly encoding.
            content = str(self.content, errors='replace')

        return content

中间有一个encoding=response的类属性self.encoding,再判断类属性的值是否为None,经调试:在if之前打印self.encoding类属性,对不起它是有值的:ISO-8859-1,所以就不会执行下面的代码计算encoding的值,这暂且不管,我们继续进入apparent_encoding它也是个属性,源码如下,并加入调试代码:调试return之前的东西:

@property
    def apparent_encoding(self):
        """The apparent encoding, provided by the chardet library."""
        print("这是个什么东西:{}".format(chardet.detect(self.content)))
        return chardet.detect(self.content)['encoding']

传入的是content属性的值(即接收的响应报文),输出的结果是:{'encoding': 'utf-8', 'language': '', 'confidence': 0.99},刚好返回的这个dict数据类型的encoding:utf-8,如果不出意外,self.encoding就该是utf-8,那text属性下面返回的content即是得到经过utf8解码的响应文本数据。

如果我在源码text属性中,直接将if条件设置为假,那么执行这个apparent_encoding属性,结果得到正常编码utf-8,不管你的网页响应是啥编码,基本都可以得到正确的中文输出!

所以此时我严重怀疑这是个bug,当然,requests大家还是用得好好的,怎么可能是个bug呢?继续深究。。。

那么就只剩下一个问题:在请求响应之后的encoding属性值是从哪里来的?为了一探究竟,再来看几处源码:

def get_encodings_from_content(content):
    """Returns encodings from given content string.

    :param content: bytestring to extract encodings from.
    """
    warnings.warn((
        'In requests 3.0, get_encodings_from_content will be removed. For '
        'more information, please see the discussion on issue #2266. (This'
        ' warning should only appear once.)'),
        DeprecationWarning)

    # print("content获取encoding:",content)
    charset_re = re.compile(r'<meta.*?charset=["\']*(.+?)["\'>]', flags=re.I)
    pragma_re = re.compile(r'<meta.*?content=["\']*;?charset=(.+?)["\'>]', flags=re.I)
    xml_re = re.compile(r'^<\?xml.*?encoding=["\']*(.+?)["\'>]')

    return (charset_re.findall(content) +
            pragma_re.findall(content) +
            xml_re.findall(content))

def _parse_content_type_header(header):
    """Returns content type and parameters from given header

    :param header: string
    :return: tuple containing content type and dictionary of
         parameters
    """

    tokens = header.split(';')
    # print("拆分请求头:",tokens)
    content_type, params = tokens[0].strip(), tokens[1:]
    params_dict = {}
    items_to_strip = "\"' "

    for param in params:
        param = param.strip()
        if param:
            key, value = param, True
            index_of_equals = param.find("=")
            if index_of_equals != -1:
                key = param[:index_of_equals].strip(items_to_strip)
                value = param[index_of_equals + 1:].strip(items_to_strip)
            params_dict[key.lower()] = value
    return content_type, params_dict


def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    :rtype: str
    """
    # print("从请求头获取encoding:",headers)
    # headers={"content-type":"text/html;charset=utf-9"}
    content_type = headers.get('content-type')
    # print(content_type)
    if not content_type:
        return None

    content_type, params = _parse_content_type_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

不是bug,最终可以确定这个encoding属性是从util.py的get_encoding_from_headers方法中最后的if条件判断得到,至于为甚发送请求明明定义了content-type:text/html;charset=utf-8,为什么响应结果的headers却没有;charset=utf-8内容,还需要多多通晓源码,所以,最终我修改了源码:在text属性的if条件设置is not None不使用它的默认编码,text不要再像上篇文章使用编码再解码得到正确的中文输出。下面引入一个别人分析链接https://www.cnblogs.com/mswei/p/9835370.html,也是介绍requests请求响应中文乱码的问题。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章