python3爬蟲學習筆記之請求庫的使用（二）

我們在使用python爬蟲時，需要模擬發起網絡請求，主要用到的庫有requests庫和python內置的urllib庫，一般建議使用requests，它是對urllib的再次封裝，它們使用的主要區別：

requests可以直接構建常用的get和post請求並發起，urllib一般要先構建get或者post請求，然後再發起請求。

get請求：使用get方式時，請求數據直接放在url中。

post請求：使用post方式時，數據放在data或者body中，不能放在url中，放在url中將被忽略。

使用urllib

在python2中，有urllib和urllib2兩個庫來實現請求的發送，而在python3中，統一爲rullib。Urllib是python內置的HTTP請求庫，不需要額外安裝。python內置urllib版塊，支持header，cookie，ip代理池等操作，但是比較麻煩的就是每次都要處理編碼解碼問題，搞得有點繁瑣。

get請求：

最簡單的網頁get請求，無header，cookie，ip代理池等

import urllib.request
response = urllib.request.urlopen(‘https://www.python.org’)
print(response.read().decode(‘utf-8’))

運行結果如下（打印網頁的HTML源碼）：

2. 一個基本的百度請求的代碼如下：

import urllib
header={"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}#模仿瀏覽器
request = urllib.request.Request('http://www.baidu.com',headers=header)
response = urllib.request.urlopen(request)
html= response.read().decode()
print(html)

3. 對於有url拼湊的地址，例如有：

這樣就要添加data信息（或者你直接拼湊url）。比如我要請求這個頁面，就要在data字典組添加對應的查詢頭信息，並且還需要url編碼轉換成瀏覽器能夠標識的字串。編碼工作使用urllib.parse的urlencode()函數，幫我們將key:value這樣的鍵值對轉換成"key=value"這樣的字符串，解碼工作可以使用urllib的unquote()函數。(注意，不是urllib.urlencode())

代碼爲：

import urllib.request as urllib2
import urllib.parse
url = "http://tieba.baidu.com/f"
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"}
formdata = {
 "ie":"utf-8",
 "kw":"江蘇科技大學",
 "fr":"search"
}
data = urllib.parse.urlencode(formdata)#要轉換成url編碼
newurl=url+'?'+data
print(data)
request = urllib2.Request(newurl, headers = headers)
response = urllib2.urlopen(request)
html=response.read().decode("utf-8")
print( html)

post請求:

post請求和get請求的不同之處在於傳遞參數的方式，get通過url拼湊進行不同的請求，而post請求則是將data放進請求列中進行模擬類似表單的請求。

import urllib.request
import urllib.parse
url = "http://tieba.baidu.com/f"
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"}
formdata = {
 "ie":"utf-8",
 "kw":"江蘇科技大學",
 "fr":"search"
}
data = urllib.request.parse.urlencode(formdata).encode('utf-8')#要轉換成url編碼
print(data)
request = urllib.request.Request(url, data = data, headers = headers)
response = urllib.request.urlopen(request)
html=response.read().decode("utf-8")
print( html)

可利用type()方法輸出響應的類型：print(type(response))。結果：<class 'http.client.HTTPResponse'>，是一個HTTPResponse類型對象，主要包含read()、readinto()、getheader(name)、getheaders()、fileno()等方法，以及msg 、version 、status 、reason 、debuglevel 、closed 等屬性。

read()方法可以得到返回的網頁內容。

status屬性可以得到返回結果的狀態碼，如200表示請求成功，404表示網頁未找到。

設置代理：

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
proxy_handler = ProxyHandler({
        'http':'http://127.0.0.1:9743',
        'https':'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open ('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e. reason)

使用requests

上面我們瞭解了urllib的使用，但是有諸多不便，這裏我們將學習一個更強大的庫requests。requests是從urllib編寫而來，支持urllib的絕大部分操作。比如cookie管理，ip代理等等。

實例：

import requests
r = requests.get('https://www.baidu.com/')
print(type(r))
print(r.status_code)
print(type(r.text))
print(r.text)
print(r.cookies)

運行結果如下:

這裏調用get( )方法與urlopen( )相同，得到Response對象，然後分別可以輸出Response的類型、狀態碼、相應體的類型、內容和 Cookies。

以下是requests的各種請求方式：

import requests

requests.post('http://httpbin.org/post')

requests.put('http://httpbin.org/put')

requests.delete('http://httpbin.org/delete')

requests.head('http://httpbin.org/get')

requests.options('http://httpbin.org/get')

GET請求

1. 基本的get請求

import requests
response = requests.get('http://httpbin.org/get')
print(response.text)

2. 帶參數的get請求(將name和age傳進去)

import requests
response = requests.get("http://httpbin.org/get?name=germey&age=22")
print(response.text)

或者使用params的方法：

import requests
data = {
'name': 'germey',
'age': 22
}
response = requests.get("http://httpbin.org/get", params=data)
print(response.text)

3. 解析json

將返回值以json的形式展示

import requests
import json
response = requests.get("http://httpbin.org/get")
print(type(response.text))
print(response.json())
print(json.loads(response.text))
print(type(response.json()))

返回值：

{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.18.4'}, 'origin': '183.64.61.29', 'url': 'http://httpbin.org/get'}

4. 獲取二進制數據

import requests
response = requests.get("https://github.com/favicon.ico")
print(type(response.text), type(response.content))
print(response.text)
print(response.content)

response.content返回值爲二進制不必再進行展示。

5. 添加headers

import requests
headers = {
 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.get("https://www.zhihu.com/explore", headers=headers)
print(response.text)

POST請求

import requests
data = {'name': 'germey', 'age': '22'}
response = requests.post("http://httpbin.org/post", data=data)
print(response.text)

代理設置：

import requests
proxies = {
 "http": "http://127.0.0.1:9743",
 "https": "https://127.0.0.1:9743",}
response = requests.get("https://www.taobao.com", proxies=proxies)
print(response.status_code)

總結：以上就是python常用的兩個請求庫requests和urllib的簡單使用，更加複雜的使用方式可在網上查看資料。

如果對你有用，點個贊手動笑臉（*_*）

python3爬蟲學習筆記之請求庫的使用（二）

python3爬蟲學習筆記之分析動態渲染網頁爬取Selenium+Chrome（九）

python3爬蟲學習筆記之請求庫的使用（二）

ckpt模型轉換爲tf serving的saved model格式

python3爬蟲學習筆記之環境安裝（一）

python3爬蟲學習筆記之Selenium+Chrome爬取中國青年網新聞內容（十）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結