Crawl GB2312 encoded webpages with Python 3.x

The following code works well.

from urllib.request import urlopen  
import bs4

doc= urlopen("http://www.w3school.com.cn/html/html_tables.asp")
soup = bs4.BeautifulSoup(doc,fromEncoding="GB2312")
a=soup.findAll("title")
print (soup.prettify())

output = open("C:\\Users\\yfeng14\\Desktop\\betting\\contents.txt", 'w', encoding="UTF-8")
output.write(soup.prettify())
output.close()


If we use "requests" package, it fails. 

import requests
import bs4

output = open("C:\\Users\\yfeng14\\Desktop\\betting\\contents.txt", 'w', encoding="UTF-8")
request_link = "http://www.songtaste.com/"
response = requests.get(request_link)<span style="white-space:pre">	</span>
soup = bs4.BeautifulSoup(response.text,"html.parser",from_encoding="GB2312")

output.write(soup.prettify())
output.close()


So, be careful when using the "request" package. 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章