在寫爬蟲時出現中文亂碼的幾種解決方法,測試代碼如下:
import requests
headers ={
"Accept": "text/plain, */*; q=0.01" ,
"Accept-Encoding": "gzip, deflate, br,",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Host": "www.douban.com",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
res = requests.get("https://www.douban.com/group/topic/157797102",headers=headers)
res.encoding ="UTF-8"
print(res.text)
開發環境:Pycharm
Python版本:3.7.2
操作系統:win10 64位
記錄時間:2020年6月12日21:36
問題重現:
嘗試解決方法如下:
第一種:請求網頁後,編碼設置不對
這種一般很好解決,自己手動指定編碼,或者使用chardet模塊進行編碼,使用方式如下:(此方法對下述代碼無效)
import requests
import chardet
headers ={
"Accept": "text/plain, */*; q=0.01" ,
"Accept-Encoding": "gzip, deflate, br,",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Host": "www.douban.com",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
res = requests.get("https://www.douban.com/group/topic/157797102",headers=headers)
rqg.encoding =chardet.detect(rqg.content)[’encoding’] #指定編碼
print(res.text)
第二種:刪除請求頭中的"Accept-Encoding": "gzip, deflate, br,",
‘Accept-Encoding’:是瀏覽器發給服務器,聲明瀏覽器支持的編碼類型。一般有gzip,deflate,br 等等。
python3中的 requests包中response.text 和 response.content
response.content #字節方式的響應體,會自動爲你解碼 gzip 和 deflate 壓縮 類型:bytes
reponse.text #字符串方式的響應體,會自動根據響應頭部的字符編碼進行解碼。類型:str
但是這裏是默認是不支持解碼br的!!!!
所以可以刪掉"Accept-Encoding": "gzip, deflate, br,",從而實現正確編碼
import requests
headers ={
"Accept": "text/plain, */*; q=0.01" ,
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Host": "www.douban.com",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
res = requests.get("https://www.douban.com/group/topic/157797102",headers=headers)
print(res.text)
測試效果如下:
第三種:刪除請求頭中的"Accept-Encoding"中的br,
br 指的是 Brotli,是一種全新的數據格式,無損壓縮,壓縮比極高(比gzip高的)
這樣接受的網頁頁面就是沒有壓縮的或者是默認可解析的了。可以使之正常編碼,但是既然有br這種編碼了,所以不刪除應該也可以解決!
import requests
headers ={
"Accept": "text/plain, */*; q=0.01" ,
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Host": "www.douban.com",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
res = requests.get("https://www.douban.com/group/topic/157797102",headers=headers)
rqg.encoding =chardet.detect(rqg.content)[’encoding’] #指定編碼
print(res.text)
測試效果如下:
第四種:安裝brotli模塊
pip install brotli
然後就可以正常進行編碼
import requests
headers ={
"Accept": "text/plain, */*; q=0.01" ,
"Accept-Encoding": "gzip, deflate, br,",
"Accept-Language": "zh-CN,zh;q=0.9",
"Connection": "keep-alive",
"Host": "www.douban.com",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
res = requests.get("https://www.douban.com/group/topic/157797102",headers=headers)
print(res.text)
測試效果如下: