前言:
- 我從這部分內容開始逐步根據官方文檔介紹教程二中提到的一些庫;
- 爬蟲的基礎是與網頁建立聯繫,而我們可以通過get和post兩種方式來建立連接,而我們可以通過引入urllib庫[在python3的環境下導入的是urllib;而python2的環境下是urllib和urllib2]或者requests庫來實現,從程序的複雜度和可讀性考慮,requests庫顯然更能滿足程序員的需求,但是我沒有找到這個庫詳細的中文講解,這也是我寫這篇文章的原因。
- 所有的參考資料均來源於官方文檔http://docs.python-requests.org/en/master/user/quickstart/#make-a-request
- 文中可能有一些拓展知識,不喜歡可以略讀過去。
如何使用requests庫
- 首先我們需要導入requests包
import requests
- 然後我們可以通過get或者post(兩者有一定的區別,請根據自己的需求合理的選擇)來請求頁面:
req_1 = requests.get('https://m.weibo.cn/status/4278783500356969')
req_2 = requests.post('https://m.weibo.cn/status/4278783500356969')
- 這裏多說一下我們通過這兩個方式得到了什麼?
- Now, we have a Response object called req_1/req_2. We can get all the information we need from this object.
這是官方文檔中給出的說明,我們得到的是一個對象,裏面包含了我們請求的頁面的代碼(可以print出來看一下)及相關信息,而我們可以通過’.'操作符來訪問這個對象內的信息,在文末我會詳細的歸納出來【注1】.
- 再拓展一下我們對一個url還有哪些操作?
req = requests.put('http://httpbin.org/put', data = {'key':'value'}) req = requests.delete('http://httpbin.org/delete') req = requests.head('http://httpbin.org/get') req = requests.options('http://httpbin.org/get')
- 我們多數情況下還需要在請求中添加一些參數,如果你接觸過urllib的話,你就會驚歎於requests的方便:
- 先說一下如何將參數/表單,或者其它信息添加到請求中
- get:
payload = {'key1': 'value1', 'key2': 'value2'} # 這裏的value可以爲一個列表 req = requests.get('http://httpbin.org/get', params=payload)
- post:
yourData = {'key':'value'} req = requests.post('http://httpbin.org/post', data=yourData)
- 下面的例子是展示表單中可以有多種類型的值
payload_tuples = [('key1', 'value1'), ('key1', 'value2')] r1 = requests.post('http://httpbin.org/post', data=payload_tuples) payload_dict = {'key1': ['value1', 'value2']} r2 = requests.post('http://httpbin.org/post', data=payload_dict) print(r1.text) { ... "form": { "key1": [ "value1", "value2" ] }, ... } r1.text == r2.text True
- 這個例子是說明表單的編碼的形式是多樣的,比如以json來傳遞
#寫法一 import json url = 'https://api.github.com/some/endpoint' payload = {'some': 'data'} req = requests.post(url, data=json.dumps(payload)) #寫法二 url = 'https://api.github.com/some/endpoint' payload = {'some': 'data'} req = requests.post(url, json=payload)
- 如果你想傳遞header的話
get: headers = {'user-agent': 'my-app/0.0.1'} req = requests.get(url, headers=headers) post: header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'} data = {'_xsrf': xsrf, 'email': '郵箱', 'password': '密碼', 'remember_me': True} session = requests.Session() result = session.post('https://www.zhihu.com/login/email', headers=header, data=data) #這裏的result是一個json格式的字符串,裏面包含了登錄結果
- 如果你想傳遞cookie的話
get: url = 'http://httpbin.org/cookies' req = requests.get(url, cookies=dict(cookies_are='working')) post: import requests r = requests.get(url1) # 你第一次的url headers = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language':'zh-CN,zh;q=0.8', 'Connection':'keep-alive', 'Cache-Control':'no-cache', 'Content-Length':'6', 'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8', 'Host':'www.mm131.com', 'Pragma':'no-cache', 'Origin':'http://www.mm131.com/xinggan/', 'Upgrade-Insecure-Requests':'1', 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36', 'X-Requested-With':'XMLHttpRequest' } # headers的例子,看你的post的headers headers['cookie'] = ';'.join([headers['cookie'], ['='.join(i) for i in r.cookies.items()]]) r = requests.post(url2, headers=headers, data=data) # 你第二次的url
- 如果你想傳遞文件
post: #低階版: url = 'http://httpbin.org/post' files = {'file': open('report.xls', 'rb')} req = requests.post(url, files=files) req.text { ... "files": { "file": "<censored...binary...data>" }, ... } #進階版: url = 'http://httpbin.org/post' files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})} req = requests.post(url, files=files) req.text { ... "files": { "file": "<censored...binary...data>" }, ... }
- 其實字符串也可以上傳:
url = 'http://httpbin.org/post' files = {'file': ('report.csv', 'some,data,to,send\nanother,row,to,send\n')} req = requests.post(url, files=files) req.text { ... "files": { "file": "some,data,to,send\\nanother,row,to,send\\n" }, ... }
- get:
- 再拓展一下get和post的函數原型,可以讓大家對參數有一個更加全面的瞭解:
get: def get(url, params=None, **kwargs): r"""Sends a GET request. :param url: URL for the new :class:`Request` object. :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`. :param \*\*kwargs: Optional arguments that ``request`` takes. :return: :class:`Response <Response>` object :rtype: requests.Response """ kwargs.setdefault('allow_redirects', True) return request('get', url, params=params, **kwargs) post: def post(url, data=None, json=None, **kwargs): r"""Sends a POST request. :param url: URL for the new :class:`Request` object. :param data: (optional) Dictionary (will be form-encoded), bytes, or file-like object to send in the body of the :class:`Request`. :param json: (optional) json data to send in the body of the :class:`Request`. :param \*\*kwargs: Optional arguments that ``request`` takes. :return: :class:`Response <Response>` object :rtype: requests.Response """ return request('post', url, data=data, json=json, **kwargs)
- 然後拓展一個打印出添加了參數的之後的url的方法:
print(req.url)
- 我們需要注意的另一個事情是編碼問題:
- 你如果使用print(req.text),那麼requests會自動幫你編碼來顯示結果(原文件是以二進制形式返回的,而urllib則需要手動編碼),如果你想改變編碼方式也很簡單:req.encoding = ‘ISO-8859-1’
- 而如果你想要得到一個二進制的結果:
req.content()
- 另外你如果想要一個json格式的結果 :
req.json()
- !一定要做異常的處理,很有可能請求的網頁與json不適配或者壓根請求就出問題
- 如果你想要一個未經過處理的response:
req = requests.get('https://api.github.com/events', stream=True) req.raw <urllib3.response.HTTPResponse object at 0x101194810> req.raw.read(10) '\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'
- 當然,我們需要做一些異常的處理
with open(filename, 'wb') as fd: for chunk in r.iter_content(chunk_size=128): fd.write(chunk)
- 如果你想要一個未經過處理的response:
-
如果你需要獲取response的信息的話:
req.headers { 'content-encoding': 'gzip', 'transfer-encoding': 'chunked', 'connection': 'close', 'server': 'nginx/1.0.4', 'x-runtime': '148ms', 'etag': '"e1ca502697e5c9317743dc078f67693f"', 'content-type': 'application/json' } req.headers['Content-Type'] 'application/json' req.headers.get('content-type') 'application/json'
-
如何取得cookies並使用:
#基本取出 >>> url = 'http://example.com/some/cookie/setting/url' >>> r = requests.get(url) >>> r.cookies['example_cookie_name'] 'example_cookie_value' #基本使用 >>> url = 'http://httpbin.org/cookies' >>> cookies = dict(cookies_are='working') >>> r = requests.get(url, cookies=cookies) >>> r.text '{"cookies": {"cookies_are": "working"}}' #使用cookiesJar來完成兩個過程 >>> jar = requests.cookies.RequestsCookieJar() >>> jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies') >>> jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere') >>> url = 'http://httpbin.org/cookies' >>> r = requests.get(url, cookies=jar) >>> r.text '{"cookies": {"tasty_cookie": "yum"}}'
6,其它內容(挖坑以後填):
- 狀態碼
- 超時
- 異常和錯誤的處理