文章目錄
一、介紹
爲了方便的實現Cookies、登錄驗證、代理設置,python的簡易HTTP庫,比urllib庫方便。
二、基本用法
2.1 抓取網頁源碼
輸出Response對象的類型、狀態碼、響應體類型、cookies和內容
import requests
r = requests.get('https://csdn.net')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.cookies)
print(r.text)
2.2 GET請求
2.2.1 基本GET請求
import requests
r = requests.get('http://httpbin.org/get')
print(r.text)
2.2.2 GET添加參數
import requests
data = {
'name':'Kevin',
'age':25
}
r = requests.get('http://httpbin.org/get', params=data)
print(type(r.text))
# 將結果解析爲字典格式
print(type(r.json()))
print(r.json())
2.2.3 抓取二進制數據
抓取圖片、音頻、視頻文件
(前邊類型str型、後邊bytes類型,由於圖片是二進制數據,前邊打印時轉換成了str類型因此會亂碼)
(使用open方法,將文件保存到本地)
2.2.4 添加headers(僞造請求頭)
多用來僞造HTTP請求頭信息,進行某些限制繞過
如果知乎不添加User-Agent,會報如下錯誤:
》》添加User-Agent
2.3 POST請求
2.3.1 基本POST請求
2.3.2 其它狀態信息
返回其它響應內容:狀態碼、URL、請求頭、cookies、請求歷史記錄
import requests
data = {'name':'Kevin', 'age':25}
r=requests.post('http://httpbin.org/post', data=data)
# 狀態碼
print(r.status_code)
# URL
print(r.url)
# 請求頭
print(r.headers, '\n')
# cookies
print(r.cookies,type(r.cookies))
#請求歷史記錄
print(r.history)
2.3.3 利用狀態碼判斷是否成功
requests提供了一些內置的狀態碼查詢對象request.codes.<>
import requests
r = requests.get('https://csdn.net')
# 打印返回狀態碼
print(r.status_code)
# 打印內置狀態碼
print(requests.codes.not_found)
# 通過對比返回狀態碼和內置狀態碼判斷響應是否成功
if r.status_code == requests.codes.ok:
print('Request Successfully')
else:
exit()
(人性化一些,請求錯誤告訴錯誤狀態碼)
2.3.4 requests.codes狀態碼錶
requests.codes的內置狀態碼查詢對象和相應的查詢條件
(爲了便於理解此表,請看下方)
-
信息狀態碼
100: (‘continue’,),
101: (‘switching_protocols’,),
102: (‘processing’,),
103: (‘checkpoint’,),
122: (‘uri_too_long’, ‘request_uri_too_long’), -
成功狀態碼
200: (‘ok’, ‘okay’, ‘all_ok’, ‘all_okay’, ‘all_good’, ‘\o/’, ‘✓’),
201: (‘created’,),
202: (‘accepted’,),
203: (‘non_authoritative_info’, ‘non_authoritative_information’),
204: (‘no_content’,),
205: (‘reset_content’, ‘reset’),
206: (‘partial_content’, ‘partial’),
207: (‘multi_status’, ‘multiple_status’, ‘multi_stati’, ‘multiple_stati’),
208: (‘already_reported’,),
226: (‘im_used’,), -
重定向狀態碼
300: (‘multiple_choices’,),
301: (‘moved_permanently’, ‘moved’, ‘\o-’),
302: (‘found’,),
303: (‘see_other’, ‘other’),
304: (‘not_modified’,),
305: (‘use_proxy’,),
306: (‘switch_proxy’,),
307: (‘temporary_redirect’, ‘temporary_moved’, ‘temporary’),
308: (‘permanent_redirect’,
‘resume_incomplete’, ‘resume’,), # These 2 to be removed in 3.0 -
客戶端錯誤
400: (‘bad_request’, ‘bad’),
401: (‘unauthorized’,),
402: (‘payment_required’, ‘payment’),
403: (‘forbidden’,),
404: (‘not_found’, ‘-o-’),
405: (‘method_not_allowed’, ‘not_allowed’),
406: (‘not_acceptable’,),
407: (‘proxy_authentication_required’, ‘proxy_auth’, ‘proxy_authentication’),
408: (‘request_timeout’, ‘timeout’),
409: (‘conflict’,),
410: (‘gone’,),
411: (‘length_required’,),
412: (‘precondition_failed’, ‘precondition’),
413: (‘request_entity_too_large’,),
414: (‘request_uri_too_large’,),
415: (‘unsupported_media_type’, ‘unsupported_media’, ‘media_type’),
416: (‘requested_range_not_satisfiable’, ‘requested_range’, ‘range_not_satisfiable’),
417: (‘expectation_failed’,),
418: (‘im_a_teapot’, ‘teapot’, ‘i_am_a_teapot’),
421: (‘misdirected_request’,),
422: (‘unprocessable_entity’, ‘unprocessable’),
423: (‘locked’,),
424: (‘failed_dependency’, ‘dependency’),
425: (‘unordered_collection’, ‘unordered’),
426: (‘upgrade_required’, ‘upgrade’),
428: (‘precondition_required’, ‘precondition’),
429: (‘too_many_requests’, ‘too_many’),
431: (‘header_fields_too_large’, ‘fields_too_large’),
444: (‘no_response’, ‘none’),
449: (‘retry_with’, ‘retry’),
450: (‘blocked_by_windows_parental_controls’, ‘parental_controls’),
451: (‘unavailable_for_legal_reasons’, ‘legal_reasons’),
499: (‘client_closed_request’,), -
服務端錯誤
500: (‘internal_server_error’, ‘server_error’, ‘/o’, ‘✗’),
501: (‘not_implemented’,),
502: (‘bad_gateway’,),
503: (‘service_unavailable’, ‘unavailable’),
504: (‘gateway_timeout’,),
505: (‘http_version_not_supported’, ‘http_version’),
506: (‘variant_also_negotiates’,),
507: (‘insufficient_storage’,),
509: (‘bandwidth_limit_exceeded’, ‘bandwidth’),
510: (‘not_extended’,),
511: (‘network_authentication_required’, ‘network_auth’, ‘network_authentication’),
三、高級用法
3.1 文件上傳
(將本地文件上傳至目的站點)
import requests
files = {'file':open('C:\\Users\\Administrator\\Desktop\\pdx.jpg','rb')}
r = requests.post('http://httpbin.org/post',files=files)
print(r.text)
3.2 獲取Cookies
import requests
r = requests.get('https://www.baidu.com')
print(r.cookies)
# 調用items()方法轉換爲元組列表,遍歷每一個cookie的名稱和值
for key,value in r.cookies.items():
print(key+'='+value)
3.3 使用Cookies(維持登錄)
沒有帶cookies狀態如下:
(帶着cookies訪問目標站點)
》》首先登錄知乎,獲取cookies信息
(帶着cookie訪問目標站點頭部信息根據目標站點反爬蟲策略指定請求頭,這裏針對知乎制定了User-Agent(僞裝瀏覽器),需要host指定目標服務器域名)
import requests
# 定義頭部信息
headers = {
'Cookie':'_zap=16cf0be0-c819-4a9a-b47d-39996b373309; d_c0="AMDer3BeMBGPTjKRqCCcRuDnj8XFy8gip2w=|1588083758"; _ga=GA1.2.306769125.1588083761; _xsrf=uWRzHjKxIjur8ONElXJVAMSOFXlgfmTK; _gid=GA1.2.1735844765.1590308731; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1590367465,1590416837,1590417004,1590454186; SESSIONID=f7wYWMdoktA8VCEeO7XPyQdBvN6c6UBN7Go7dZZHBQF; JOID=UlsUAEpgVZrgVZvkZGP3xmOyKLV0BWDvuwPVggkKItnaAdmiA8syVr1XmuRknp9WtWzxOaADg_2lvkh72OVU_LE=; osd=UlwQAE9gUp7gUJvjYGPyxmS2KLB0AmTvvgPShgkPIt7eAdyiBM8yU71QnuRhnphStWnxPqQDhv2iukh-2OJQ_LQ=; capsion_ticket="2|1:0|10:1590454210|14:capsion_ticket|44:NDRmYzc1NWE0MjU5NDRmNGE5MmYzYjc1NGUxMDlkNmE=|60f5cabd6cf4e69a08ad51e186c9ebe12f079cdaae7976e438c2227e9193cc1b"; z_c0="2|1:0|10:1590454242|4:z_c0|92:Mi4xNm4xU0FnQUFBQUFBd042dmNGNHdFU1lBQUFCZ0FsVk40clc1WHdEajQ4OFpCRVFJU2NkZ3lYV18xdmZKUnJfZFlB|a36a5e8c90afe725e7e2280499cc54ebed142e63ee6678abe2e446f5f2d44b89"; unlock_ticket="ABBK_W4JDwkmAAAAYAJVTepuzF58FGbki7G97tkvhnH4U13W5z0FZQ=="; tst=r; KLBRSID=76ae5fb4fba0f519d97e594f1cef9fab|1590454304|1590454184; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1590454304',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'Host':'www.zhihu.com'
}
r = requests.get('https://zhihu.com', headers=headers)
# 將抓取的內容保存到本地文件
with open('c_zhihu.html', 'w', encoding='utf-8') as f:
f.write(r.text)
3.4 會話維持
開兩個瀏覽器需要不同的cookies,但是爲了方便的在每次訪問請求時候都設置同樣的cookies,便可維護會話(打開新的瀏覽器選項卡)
》》如下請求了一個測試網址,請求時設置了一個cookies內容是123456789,在請求獲取cookie時cookie值爲空
(使用Session模擬登錄後進行下一步操作(在同一瀏覽器中打開同一站點的不同頁面))
import requests
s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456')
r = s.get('http://httpbin.org/cookies')
print(r.text)
3.5 設置代理
大規模頻繁請求,防止封IP
3.6 代理HTTP Basic Auth認證
對於需要輸入用戶名和密碼才能訪問的網頁的爬取
3.7 超時設置
發出請求到服務器返回響應時間太長,網絡響應太慢時,爲了提高爬蟲的有效性需要使用此功能。
(如果0.1s沒有響應就拋出異常)
3.8 Basic Auth身份認證
如下這種認證情況
import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://192.168.226.133/')
print(r.status_code,'\n')
# 帶有認證數值的請求
r = requests.get('http://192.168.226.133/', auth=HTTPBasicAuth('Kevin','123456'))
print(r.status_code)
(沒有進行認證返回401需要授權,認證後返回請求成功)
3.9 Prepared Request
獨立每個請求對象,各個參數通過一個Request對象表示,在進行隊列調度時非常方便
from requests import Request, Session
# 自定義目標請求信息
url = 'http://httpbin.org/post'
data = {
'name':'Kevin'
}
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
# 會話維持
s = Session()
# 構造Request對象
req = Request('POST', url, data=data, headers=headers)
# 調用prepare_request()方法轉換成一個Prepared Request對象
prepped = s.prepare_request(req)
# 通過send()方法發送
r = s.send(prepped)
print(r.text)