requests爬取中文網站的字符編碼問題

這兩天在一些門戶網站使用requests爬數據的時候，發現打印或者保存到文件中的中文顯示爲Unicode碼，看着十分不爽快，於是就必須網上找了一下相關問題。其實，弄明白瞭解決也很簡單了
比如，爬取鳳凰網

response= requests.get("http://www.ifeng.com/")

我們都知道response有text和content這兩個property,它們都是指響應內容，但是又有區別。我們從doc中可以看到：

text的doc內容爲：

Content of the response, in unicode. If Response.encoding is None, encoding will be guessed using ``chardet``. The encoding of the response content is determined based solely on HTTP headers, following RFC 2616 to the letter. If you can take advantage of non-HTTP knowledge to make a better guess at the encoding, you should set ``r.encoding`` appropriately before accessing this property.

而content的doc內容爲：

Content of the response, in bytes.

其中text是unicode碼,content是字節碼，我們獲取到的響應內容的字符編碼只取決於HTTP headers，也就是我們查看網頁源碼時<head>標籤下<meta>標籤中charset指定的字符編碼方式，例如：

<meta http-equiv="content-type" content="text/html;charset=utf-8">

因此，當我們使用text屬性獲取html內容出現unicode碼時，我們可以通過設置字符編碼response.encoding，來使之匹配網頁源碼中指定的字符編碼，這樣打印輸出就不會很奇怪了。

import requests

response = requests.get("http://www.ifeng.com/")
response.encoding = "utf-8" #手動指定字符編碼爲utf-8
print(response.text)

有興趣的童鞋可以試試沒有指定字符編碼或者指定其他字符編碼的效果。有不懂的歡迎留言討論！

另外，我們使用python內置的文件操作函數打開文本文件（不是二進制文件，注意區別）時，默認使用的platform dependent的字符編碼進行編解碼文本文件，比如Windows中使用的是Ascii，Linux中使用的是utf-8，當然，我們再open()的時候可以通過encoding指定字符編碼，例如：

open(fileName,"r",encoding="utf-8")

以上就是關於python在爬取中文網頁時遇到的一些小問題，記錄一下，以便幫助自己和大家。

requests爬取中文網站的字符編碼問題

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

makefile學習 (三)

makefile學習 (二)

makefile 學習 (一)

我所理解的三次握手與四次揮手

Java綜合應用：線程與信號量實現數據管道傳輸

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結