文章目錄
1 urllib實現
關於urllib、urllib2和urllib3的區別可以查看。python3中,urllib被打包成一個包,所擁有的模塊如下:
名稱 | 作用 |
---|---|
urllib.request | 打開和讀取url |
urllib.error | 處理request引起的異常 |
urllib.parse | 解析url |
urllib.robotparser | 解析robots.txt文件 |
1.1 完整請求與響應模型的實現
urllib2提供一個基礎函數urlopen,通過向指定的URL發出請求來獲取數據,最簡單的形式如下:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
"""響應"""
res = request.urlopen('http://www.zhihu.com') #可以設置timeout,例如timeout=2
html = res.read()
print(html)
輸出:
b'<!doctype html>\n<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react...'
以上代碼可以分爲兩步:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
"""請求"""
req = request.Request('http://www.zhihu.com')
"""響應"""
res = request.urlopen(req)
html = res.read()
print(html)
以上的兩者方法都是GET請求,接下來對POST請求進行說明:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
url = 'https://www.xxx.com//login'
postdata = {b'username': b'miao',
b'password': b'123456'}
"""請求"""
req = request.Request(url, postdata)
"""響應"""
res = request.urlopen(req)
html = res.read()
print(html)
這個自己試試就行。
1.2 請求頭headers處理
下面的例子對添加請求頭信息進行說明,包括設置User-Agent和Referer:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
url = 'https://www.xxx.com//login'
postdata = {b'username': b'xxx',
b'password': b'******'}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
referer = 'https://www.github.com'
herders = {'User-Agent': user_agent, 'Referer': referer}
"""請求"""
req = request.Request(url, postdata, herders)
"""響應"""
res = request.urlopen(req)
html = res.read()
print(html)
請求頭信息也可以用add_header來添加:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
url = 'https://www.xxxxxx.com//login'
postdata = {b'username': b'xxx',
b'password': b'******'}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
referer = 'https://www.github.com'
req = request.Request(url, postdata)
"""修改"""
req.add_header('User-Agent', user_agent)
req.add_header('Referer', referer)
res = request.urlopen(req)
html = res.read()
print(html)
注意:.
對某些header要特別注意,服務器會針對這些header進行檢查,例如:
- User-Agent:有些服務器或Proxy會通過該值來判斷是否是瀏覽器發出的請求
- Content-Type:在使用REST接口時,服務器會檢查該值,用來確定HEEP Body的內容該怎樣解析,在使用服務器提供的RESTful或SOAP服務時,該值的設置錯誤會導致服務器拒絕服務。常見的取值如下:
application/xml (在XML RPC,如RESTful/SOAP調用時使用 |
---|
application/json (在JSON RPC調用時使用) |
application/x-www-form-urlencoded (瀏覽器提交Web表單時使用) |
- Referer:服務器有時會檢查防盜鏈。
1.3 Cookie處理
如果需要得到某個Cookie的值,可以採取如下做法:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
from http import cookiejar
cookie = cookiejar.CookieJar()
opener = request.build_opener(request.HTTPCookieProcessor(cookie))
"""響應"""
res = opener.open('http://www.zhihu.com')
for item in cookie:
print(item.name + ": " + item.value)
輸出:
_xsrf: 467z...
_zap: 4f91...
KLBRSID: ed2a...
當然可以按自己的需要手動添加Cookie的內容:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
cookie = ('Cookie', 'email=' + '[email protected]')
opener = request.build_opener()
opener.addheaders = [cookie]
"""請求"""
req = request.Request('http://www.zhihu.com')
"""響應"""
res = opener.open(req)
print(res.headers)
retdata = res.read()
輸出:
Date: Tue, 09 Jun 2020 06:45:54 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 49014
Connection: close
Server: CLOUD ELB 1.0.0...
1.4 獲取HTTP響應碼
對於200OK來說,只需使用urlopen返回對象的getcode()即可獲得HTTP的響應碼。但是對於其他響應碼,則會拋出異常:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
try:
"""響應"""
res = request.urlopen('http://www.zhihu.com')
print(res.getcode())
except request.HTTPError as e:
if hasattr(e, 'code'):
print("Error code: ", e.code)
輸出:
200
1.5 重定向
以下代碼將檢查是否出現了重定向動作:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
try:
"""響應"""
res = request.urlopen('http://www.zhihu.com')
print(res.geturl())
except request.HTTPError as e:
if hasattr(e, 'code'):
print("Error code: ", e.code)
輸出:
https://www.zhihu.com/signin?next=%2F
如果不想重定向,則可以自定義HTTPRedirectHandler類:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
class RedirectHandler(request.HTTPRedirectHandler):
def http_error_301(self, req, fp, code, msg, headers):
pass
def http_error_302(self, req, fp, code, msg, headers):
result = request.HTTPRedirectHandler.http_error_301(self, req, fp, code, msg, headers)
result.status = code
result.newurl = result.geturl()
return result
opener = request.build_opener(RedirectHandler)
res = opener.open('http://www.zhihu.cn')
print(res)
輸出:
<http.client.HTTPResponse object at 0x000001BEAC776160>
1.6 Proxy的設置
示例如下:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
proxy = request.ProxyHandler({'http': '127.0.0.1: 8087'})
opener = request.build_opener(proxy)
res = opener.open('http://www.zhihu.com/')
print(res.read())
輸出:
2 request實現
2.1 完整請求與響應模型的實現
1)GET請求:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
res = requests.get('http://www.zhihu.com')
print(res.content)
2)POST請求:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
postdata = {'key' : 'value'}
res = requests.post('http://www.zhihu.com', data=postdata)
print(res.content)
HTTP中其他請求方式示例如下:
- requests.put (‘http://www.xxxxxx.com/put’,data={‘key’:‘value’})
- requests.delete (‘http://www.xxxxxx.com/delete’)
- requests.head (‘http://www.xxxxxx.com/get’)
- requests.options (‘http://www.xxxxxx.com/get’)
3)複雜URL的輸入,除了使用完整的URL,requests還提供了以下方式:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
payload = {'Keywords': 'bolg:qiyeboy', 'pageindex': 1}
"""可設置timeout"""
res = requests.get('http://www.zhihu.com', params=payload)
print(res.url)
輸出:
https://www.zhihu.com/?Keywords=bolg%3Aqiyeboy&pageindex=1
2.2 響應與編碼
以res = requests.get(‘http://www.zhihu.com’) 爲例,其返回值中:
- res.content:字節形式
- res.text:文本形式
- res.encoding:根據HTTP頭猜測的網頁編碼格式
這裏使用第三方庫chardet來進行字符串 / 文件編碼檢測:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
import chardet
res = requests.get('http://www.zhihu.com')
"""
detect返回字典,包括:
- 'encoding':編碼形式
- 'confidence':檢測精確度
- 'language':超文本標記語言
"""
ret_dic = chardet.detect(res.content)
"""使用檢測到的編碼形式解碼"""
res.encoding = ret_dic['encoding']
print(ret_dic)
print(res.text)
輸出:
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>
2.3 請求頭headers處理
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
res = requests.get('http://www.zhihu.com', headers=headers)
print(res.content)
2.4 響應碼code和請求頭headers處理
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
res = requests.get('http://www.baidu.com')
"""
res.status_code:獲取響應碼
res.status_code == requests.codes.ok:判斷相應碼
"""
if res.status_code == requests.codes.ok:
print("響應碼:", res.status_code)
print("響應頭:", res.headers)
print("字段獲取:", res.headers.get('content-type'))
else:
"""
當相應碼是4XX或5XX時,raise_for_status()會拋出異常
當相應碼是200時,raise_for_status()返回None
"""
res.raise_for_status()
輸出:
響應碼: 200
響應頭: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 09 Jun 2020 13:42:42 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:52 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
字段獲取: text/html
2.5 Cookie處理
1)自動Cookie:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
res = requests.get('http://www.baidu.com', headers=headers)
for cookie in res.cookies.keys():
print(cookie + ": " + res.cookies.get(cookie))
輸出:
BAIDUID: D285BF54C9CC968744699A9B4F843D60:FG=1
BIDUPSID: D285BF54C9CC9687F9E45D28DB4C9F33
H_PS_PSSID: 1456_31326_21100_31069_31765_31673_30823
PSTM: 1591710519
BDSVRTM: 0
BD_HOME: 1
2)自定義Cookie:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
"""自定義"""
cookies = dict(name='guangtouqiang', age='18')
res = requests.get('http://www.baidu.com', headers=headers, cookies=cookies)
print(res.text)
3)自動處理Cookie:
# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
login_url = 'http://www.zhihu.com/login'
s = requests.Session()
datas = {'name': 'guangtouqiang', 'passwd': '123456'}
"""
遊客模式,服務器先分配一個cookie, 如果沒有這一步,系統會認爲時非法用戶
allow_redirects=True表示允許重定向,如果重定向,則可通過res.history查看歷史信息
"""
s.get(login_url, allow_redirects=True)
"""驗證成功,權限將升級到會員權限"""
res = s.post(login_url, data=datas, allow_redirects=True)
print(res.text)
輸出:
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>