網絡爬蟲（Spider）python研究（2）-網頁編碼解析

原創

2018-09-03 12:51

網頁編碼格式比較多，比如utf8，gb2313等等，我們需要轉化成統一格式，便於解析文本。

headers = {
    'x-requestted-with': 'XMLHttpRequest',
    'Accept-Language': 'zh-cn',
    'Accept-Encoding': 'gzip, deflate',
    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6',
    'ContentType': 'application/x-www-form-urlencoded; chartset=UTF-8',
}
req = urllib2.Request(url='http://www.baidu.com/', headers=headers)

try:
    response = opener.open(req)
    html = response.read()
except urllib2.HTTPError, e:
    print "error code:", e.code
except urllib2.URLError, e:
    print "reason:", e.reason

通過url下載到網頁文本後，需要做的幾件事：
1、是不是有壓縮（gzip）
2、判斷字符集（utf-8、gbk2313…）
3、解析文本（re、httpparse…）

一、網頁壓縮
有些網頁爲了節省流量會要文件，我們在url請求中，可以指定壓縮格式（Accept-Encoding）：’gzip, deflate’，一般網頁都是gzip。解壓gzip 用gzip。flate用zlip。
判斷下載的網頁是否壓縮的方法有兩種：
(1)根據頭文件消息：

encoding = response.info().get('Content-Encoding')
if encoding == 'gzip':
    html = gzip.GzipFile(fileobj=StringIO.StringIO(html)).read()

(2)根據網頁文本：

if html [:6] == '\x1f\x8b\x08\x00\x00\x00':
    html = gzip.GzipFile(fileobj=StringIO.StringIO(html)).read()

ref：http://www.jianshu.com/p/2c2781462902

二、判斷字符集
判斷網頁字符集，可以根據網頁上的meta上的charset判斷，不過這個一般不準，很多都是隨便寫的。所以還是自己原始判斷最可靠，這裏使用chardet庫。

import chardet

charset = chardet.detect(html)['encoding']
if charset != 'utf-8':
    html = html.decode('gbk', 'ignore').encode('utf8')

decode傳ignore 是爲了一些解析錯誤，不然有些gb2312的網頁會報錯：
UnicodeDecodeError: ‘gbk’ codec can’t decode bytes in position 23426-23427: illegal multibyte sequence

ref：http://python-china.org/t/146

三、解析文本
解析的工具就很多了，也是網絡爬蟲很重要的一部分。比如最原始的正則表達式（re）、httpparse、xpath、BeautifulSoup….

比如一個查找多有href=字段的正則表達式：

pattern = re.compile('(?<=href=").*?(?=")')  
links = re.findall(pattern, html)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

網絡爬蟲（Spider）python研究（2）-網頁編碼解析

protoc-gen-lua多個proto嵌套的實現

cocos2dx、cocostudio的一些技術討論一：觸摸吞噬

拆分TexturePacker打包的大圖

關於提高遊戲中的打擊感

cocos2dx shader -- Vol.2（blur, frost, bloom）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結