【Python3網絡爬蟲】1-urllib庫的使用
內置模塊介紹
Python內置的HTTP請求庫,包含四個模塊
-
error
異常處理模塊,如果出現請求錯誤,我們可以捕獲這些異常,然後進行重試或其他操作以保證程序不會意外終止。 -
parse
一個工具模塊,提供了許多URL處理方法,比如拆分、解析、合併等。 -
request
它是最基本的HTTP請求模塊,可以用來模擬發送請求。就像在瀏覽器裏輸入網址然後回車一樣,只需要給庫方法傳入URL以及額外的參數,就可以模擬實現這個過程了。 -
response
最基本的HTTP響應模塊 -
robotparser
主要是用來識別網站的robots.txt文件,然後判斷哪些網站可以爬,哪些網站不可以爬,它其實用得比較少。
請求頭的設置
from urllib.request import urlopen
from urllib.request import Request
url = "http://www.baidu.com"
# 當不設置User-Agent,容易被識別
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/78.0.3904.97 Mobile Safari/537.36 '
}
request = Request(url, headers=headers)
response = urlopen(request)
info = response.read()
# 注意,這裏只能是User-agent
print(request.get_header('User-agent'))
print(info)
請求頭,利用fake_useragent獲取UserAgent
from fake_useragent import UserAgent
ua = UserAgent()
print(ua.chrome)
print(ua.opera)
print(ua.firefox)
代理設置
代理網站
https://www.kuaidaili.com/free/
https://www.xicidaili.com/nt/
from urllib.request import Request
from urllib.request import build_opener
from urllib.request import ProxyHandler
from fake_useragent import UserAgent
url = "http://httpbin.org/get"
headers = {
'User-Agent': UserAgent().chrome
}
request = Request(url, headers=headers)
handler = ProxyHandler({
"http": "112.95.23.90:8888"
})
opener = build_opener(handler)
response = opener.open(request)
print(response.read().decode())
cookie的設置
from urllib.request import HTTPCookieProcessor
from urllib.request import build_opener
from urllib.request import Request
from http.cookiejar import MozillaCookieJar
from fake_useragent import UserAgent
# 保存cookie
def get_cookie():
url = "http://baidu.com"
headers = {
'User-Agent': UserAgent().chrome
}
request = Request(url, headers=headers)
cookie_jar = MozillaCookieJar()
handler = HTTPCookieProcessor(cookie_jar)
opener = build_opener(handler)
response = opener.open(request)
cookie_jar.save("cookie.txt", ignore_expires=True, ignore_discard=True)
if __name__ == '__main__':
get_cookie()
異常處理URLError
from urllib.request import Request, urlopen, URLError
from fake_useragent import UserAgent
url = "https://missj.top/adda"
headers = {
'User-Agent': UserAgent().chrome
}
try:
req = Request(url, headers=headers)
resp = urlopen(req)
print(resp.read().decode())
except URLError as e:
if e.args == ():
print(e.code)
else:
print(e.args[0].errno)