Urllib庫詳解
Urllib:是請求庫,提供了強大的處理函數;Python內置的HTTP請求庫
urllib.request # 請求模塊
urllib.error # 異常處理模塊
urllib.parse # url解析模塊
urllib.robotparser # robots.txt解析模塊
重點前三個模塊,第四個用的少了
urlopen
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefaul=False, context=None)
e.g.1
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
.read() 獲取內容,decode()方法編碼;獲取到的內容應與request中第一個包相同
e.g. 2 請求post類型
import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())
e.g. 3 對timeout進行約束(超時時間)
import urllib.request
import urllib.error
import socket
try:
response= urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) # 在0.1s之內得到相應
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout):
print('TIME OUT')
輸出: TIME OUT
Response
響應類型
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(type(response))
OUTPUT: <class 'http.client.HTTPResponse'>
有用信息: 狀態碼,響應頭(響應是否成功);
查看他們:
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.status)
print(response.getheaders())
或者使用read,但read得到的是字節流形式,需要decode:
import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
Request
調用request使request的細節可以被改變:
import urllib.request
request = urllib.request.Request('http://www.baidu.com')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
可以看到,通過request也是可以正常的得到response
但是request的構造可以是請求方式變得不同:
from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)',
'Host':'httpbin.org'
}
dict = {
'name':'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf-8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
或者使用:
from urllib import request, parse
url = 'http://httpbin.org/post'
dict = {
'name':'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf-8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
得到的結果都一樣,爲:
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "Germey"
},
"headers": {
"Accept-Encoding": "identity",
"Connection": "close",
"Content-Length": "11",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
},
"json": null,
"origin": "182.116.195.55",
"url": "http://httpbin.org/post"
}
Handler 代理的使用因爲都不涉及就略過了
Cookie:用戶信息,維持登陸狀態
在瀏覽器F12中 Application, Cookie中可以看到; Cookie使我們保持網站的登錄/認證狀態
Cookie的處理:
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
print(item.name+"="+item.value)
得到:
BAIDUID=F8954DF46819CDEA7151D26EC87BAB92:FG=1
BIDUPSID=F8954DF46819CDEA7151D26EC87BAB92
H_PS_PSSID=26523_1437_21090_28329_28413_22072
PSTM=1548302451
delPer=0
BDSVRTM=0
BD_HOME=0
若要保存到文件,使用Mozilla/LWP的子函數,有save方法:
import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename) # or LWPCookieJar
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
若cookie沒有過期,則能繼續使用。在存儲過後的load方法:
import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar()
cookie.load(filename, ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf8'))
異常處理:請求不存在/錯誤的網頁
請求的狀態碼:404 not found類似
當我們error時:
from urllib import request, error
try:
response = request.urlopen('http://shaonian.com/index.htm')
except error.URLError as e:
print(e.reason)
如果沒有捕捉,程序可能中斷。
具體捕獲異常的類型:
URLError - reason 只能打印信息;
HTTPError - code/headers/reason 三種信息 - 子類錯誤
使用:
from urllib import request, error
try:
response = request.urlopen('http://shaonian.com/index.htm')
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
print(e.reason)
else:
print('Request Successfully')
輸出:
Not Found
404
Date: Thu, 24 Jan 2019 04:18:12 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 5045
Connection: close
Cache-Control: private
X-Powered-By: ASP.NET
Server: wts/1.2
e.reason 也是一個類,可以打印出來。我們可以通過isinstance來判斷原因:
import socket
from urllib import request, error
try:
response = request.urlopen('http://www.baidu.com', timeout=0.01)
except error.URLError as e:
print(type(e.reason))
if isinstance(e.reason, socket.timeout):
print('TIME OUT')
# 其實這樣也可以
# except error.URLError as e:
# print(e.reason)
URL解析: urlparse
把URL進行分割,分割後賦值 - 域名/路徑/...
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True) 如:
from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/s?ie=UTF-8&wd=%E4%BD%A0%E5%A5%BD')
print(type(result), result)
output:
<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='ie=UTF-8&wd=%E4%BD%A0%E5%A5%BD', fragment='')
注意到,這裏的參數是可以更改的,但只有當第一個參數不包含某種參數的時候,默認參數纔會有效
from urllib.parse import urlparse
result = urlparse('www.baidu.com/s?ie=UTF-8&wd=%E4%BD%A0%E5%A5%BD', scheme='https')
print(result)
會得到和上面PraseResult後一樣的結果
Allow_fragments 函數會使函數取消fragment部分的功能,使後面fragment直接拼接到前面的query/param上,如果query/param爲空,就會直接拼接到path上(向前拼接)
from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/s?ie=UTF-8&wd=%E4%BD%A0%E5%A5%BD',allow_fragments=False)
print(result)
output:
<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='ie=UTF-8&wd=%E4%BD%A0%E5%A5%BD', fragment='')
urlunparse: 拼接url地址
from urllib.parse import urlunparse
data= ['http', 'www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))
output: http://www.baidu.com/index.html;user?a=6#comment
urljoin:可以拼接地址,前面的字段會被後面的字段覆蓋。
urljoin('...','...')
urlencode: 把字典轉換成請求參數
from urllib.parse import urlencode
params = {
'name': 'germey',
'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)
Output: http://www.baidu.com?name=germey&age=22
urllib是一個很好的工具模塊,注意哦,上面已經搞定了開頭說的三個模塊:
urllib.request # 請求模塊
urllib.error # 異常處理模塊
urllib.parse # url解析模塊
如果要看更加高級的操作還是轉去看下官方文檔,但是常用的就是這麼幾個啦!