urllib3庫

簡介

urllib3 是一個基於python3的功能強大，友好的http客戶端。越來越多的python應用開始採用urllib3.它提供了很多python標準庫裏沒有的重要功能。

1.構造請求(request)

導入urllib3庫

import urllib3

然後需要實例化一個PoolManager對象構造請求，這個對象處理了連接池和線程池的所有細節，所以我們不用自行處理。

http = urllib3.PoolManager()

用request()方法發送一個請求

r = http.request('GET', 'http://httpbin.org/robots.txt')
b'Uesr-agent:*\nDisallow: /deny\n'

可以用request（）方法發送任意http請求，我們發一個post請求

r = http.request(
    'POST',
    'http://httpbin.org/post',
    fields=('hello':'world')
)

2.Response content

http響應對象提供status,data和header等屬性。

http = urllib3.PoolManager()
r = http.request('GET', 'http://httpbin.org/ip')
print(r.status)
print(r.data)
print(r.headers)

'''
運行結果
200
b'{\n  "origin": "221.178.125.122"\n}\n'
HTTPHeaderDict({'Date': 'Tue, 25 Feb 2020 12:00:23 GMT', 'Content-Type': 'application/json', 'Content-Length': '34', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'})
'''

3.JSON content

返回的json格式數據可以通過json模塊，load爲字典數據類型。

import urllib3
import json

http = urllib3.PoolManager()
r = http.request('GET', 'http://httpbin.org/ip')
print(json.loads(r.data.decode('utf-8')))

'''
運行結果
{'origin': '221.178.125.122'}
'''

4.Binary content

響應返回的數據都是字節類型，對於大量的數據我們通過

stream來處理好：
http = urllib3.PoolManager()
r = http.request('GET', 'http://httpbin.org/bytes/1024',preload_content=False)

 for chunk in r.stream(32):
     print(chunk)

也可以當作一個文件對象來處理

http = urllib3.PoolManager()
r = http.request('GET', 'http://httpbin.org/bytes/1024',preload_content=False)

 for line in r:
     print(line)

5.proxies

可以利用ProxyManager進行http代理操作

proxy = urllib3.ProxyManager('http://180.76.111.69:3128')
res = proxy.request('get', 'http://httpbin.org/ip')
print(res.data)

6.Request data

a.Headers

request方法中添加字典格式的headers參數去指定請求頭

http = urllib3.PoolManager()
r = http.request('GET', 'http；//httpbin.org/headers', headers={'key': 'value'})

print(json.loads(r,data,decode('utf-8')))

b.Query parameters

get，head，delete請求，可以通過提供字典類型的參數fields來添加查詢參數。

http = urllib3.PoolManager()
r = http.request(
    'GET',
    'http://httpbin.org/get',
    fields={'arg': 'value'}
)
print(json.loads(r.data.decode(utf-8))['args'])

對於post和put請求，如果需要查詢參數，需要通過url編碼將參數編碼正確個格式然後拼接到url中

import urllib3
import json
from urllib.parse import urlencode
http = urllib3.PoolManager()
encoded_args = urlencode({'args': 'value'})
url = 'http://httpbin.org/post?' + encode_args
r = http.request('POST', url)
print(json.loads(r.data.decode('utf-8'))['args'])

c.Form data

對於put和post請求，需要提供字典類型的參數field來傳遞form表單數據。

r = http.request('POST', 'http://httpbin/post',fields={'field: 'value'})
print(json.loads(r.data.decode('utf-8'))['form'])

d.JSON

當我們需要發送json數據時，我們需要在request中傳入編碼後的二進制數據類型的body參加數，並制定Content-type的請求頭

http = urllib3.PoolManager()
data = {'attribute': 'value'}
encode_data = json.dumps(data).encode('utf-8')
r = http.request(
    'post'
    'http://httpbin.org/post',
    body=encoded_data,
    headers={'Content-Type': 'application/json'}
)
print(json.load(r.data.decode('utf-8'))['json'])

e.Files&binary data

對於文件上傳，我們可以模仿瀏覽器表單的方式

with open('example.text') as f:
    file_data = f.read()
r = http.request(
    'POST',
    'http://httpbin.org/post',
    fields={
        'filefied':('example.txt', file_data),
})    
print(json.loads(r.data.decode('utf-8'))['files'])

對於二進制數據的上傳，我們用指定body的方式，並設置Content-Type的請求頭

http = urllib3.PoolManger()
with open('example.jpg','rb') as f:
    binary_data = f.read()
r = http.request(
    'post',
    'http://httpbin.org/post',
    body=binary_data,
    headers={'Content-Type':'image/jpeg'}
)
print(json.loads(r.data.decode('utf-8')))

7.爬蟲一般開發流程

	a.找到目標數據
	b.分析請求流程
	c.構造http請求
	d.提取清洗數據
	e.數據持久化

案例：1.利用urllib3 下載百度圖片首頁所有圖片，保存到當前文件夾下的imgs文件夾

import urllib3
import re
'''
下載百度首頁所有圖片
'''
# 1.尋找目標數據
page_url = 'https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gb18030&word=%CD%BC%C6%AC&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=111111'
# 2.分析請求流程
# 圖片是瀏覽器下載下來的
# 圖片的url,會比圖片更早的下載回來
# 下載html
http = urllib3.PoolManager()
res = http.request('GET', page_url)
html = res.data.decode('utf-8')     # 網頁源碼搜索charset，查看編碼類型
# 提取url
img_urls = re.findall(r'"thumbURL":"(.*?)"', html)
# 加上headers僞裝
header = {
    "Referer": "https://image.baidu.com/search/index?tn=baiduimage&ct=201326592&lm=-1&cl=2&ie=gb18030&word=%CD%BC%C6%AC&fr=ala&ala=1&alatpl=adress&pos=0&hs=2&xthttps=111111"

}
# 遍歷 下載
for index, img_url in enumerate(img_urls):
    img_res = http.request('GET', img_url, headers=header)
    # print(img_res.data)
    # 動態拼接文件名
    img_name = '%s.%s' % (index, img_url.split('.')[-1])
    with open(img_name, 'wb') as f:
        f.write(img_res.data)

爬蟲基礎筆記—urllib3庫+baidu圖片首頁圖片爬取

urllib3庫

簡介

1.構造請求(request)

2.Response content

3.JSON content

4.Binary content

5.proxies

6.Request data

a.Headers

b.Query parameters

c.Form data

d.JSON

e.Files&binary data

7.爬蟲一般開發流程

案例：1.利用urllib3 下載百度圖片首頁所有圖片，保存到當前文件夾下的imgs文件夾

爬蟲基礎筆記—urllib3庫+baidu圖片首頁圖片爬取

爬蟲筆記-request庫+baidu妹子圖爬取

爬蟲基礎筆記—爬蟲入門+socket爬取一張圖片

爬蟲基礎筆記—爬蟲入門

爬蟲基礎筆記—urllib庫的使用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

爬蟲基礎筆記—urllib3庫+baidu圖片首頁圖片爬取

urllib3庫

簡介

1.構造請求(request)

2.Response content

3.JSON content

4.Binary content

5.proxies

6.Request data

a.Headers

b.Query parameters

c.Form data

d.JSON

e.Files&binary data

7.爬蟲一般開發流程

案例：1.利用urllib3 下載 百度圖片首頁所有圖片，保存到當前文件夾下的imgs文件夾

案例：1.利用urllib3 下載百度圖片首頁所有圖片，保存到當前文件夾下的imgs文件夾