由於爬取的網頁編碼格式是“gb2312”格式的,所以第一反應就是也用這個格式編碼和解碼
import re
from lxml import etree
import html
with open('test.html','r',encoding='gbk') as f:
c = f.read()
s = re.sub(r'\n',' ',c)
tree = etree.HTML(c)
rows = tree.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")
for row in rows:
boards = {}
s1 = etree.tostring(row).decode('gbk')
s1 = html.unescape(s1)
print(s1)
break
由於 “gbk” 包括 “gb2312”所以使用了 “gbk”,其實結果都一樣
翻看了好多博客發現:
爬取的所有網頁無論何種編碼格式,都轉化爲 utf-8 格式進行存儲
具體什麼原因現在我也沒清楚,留着後續補充吧
但是關於 gbk 或者 gb2312 格式的網頁牽扯到存儲時,轉換成 utf-8 格式是沒錯的
import re
from lxml import etree
import html
with open('test.html','r',encoding='utf-8') as f:
c = f.read()
s = re.sub(r'\n',' ',c)
tree = etree.HTML(c)
rows = tree.xpath("//ul[@class='bang_list clearfix bang_list_mode']/li")
for row in rows:
boards = {}
s1 = etree.tostring(row).decode('utf-8')
s1 = html.unescape(s1)
print(s1)
break
正常顯示