Python3實現HTTP請求

1 urllib實現

  關於urllib、urllib2和urllib3的區別可以查看。python3中,urllib被打包成一個包,所擁有的模塊如下:

名稱 作用
urllib.request 打開和讀取url
urllib.error 處理request引起的異常
urllib.parse 解析url
urllib.robotparser 解析robots.txt文件

1.1 完整請求與響應模型的實現

  urllib2提供一個基礎函數urlopen,通過向指定的URL發出請求來獲取數據,最簡單的形式如下:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

"""響應"""
res = request.urlopen('http://www.zhihu.com') #可以設置timeout,例如timeout=2
html = res.read()
print(html)

  輸出:

b'<!doctype html>\n<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react...'

  以上代碼可以分爲兩步:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

"""請求"""
req = request.Request('http://www.zhihu.com')
"""響應"""
res = request.urlopen(req)
html = res.read()
print(html)

  以上的兩者方法都是GET請求,接下來對POST請求進行說明:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

url = 'https://www.xxx.com//login'
postdata = {b'username': b'miao', 
            b'password': b'123456'}
"""請求"""
req = request.Request(url, postdata)
"""響應"""
res = request.urlopen(req)
html = res.read()
print(html)

  這個自己試試就行。

1.2 請求頭headers處理

  下面的例子對添加請求頭信息進行說明,包括設置User-AgentReferer

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

url = 'https://www.xxx.com//login'
postdata = {b'username': b'xxx', 
            b'password': b'******'}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
referer = 'https://www.github.com'
herders = {'User-Agent': user_agent, 'Referer': referer}
"""請求"""
req = request.Request(url, postdata, herders)
"""響應"""
res = request.urlopen(req)
html = res.read()
print(html)

  請求頭信息也可以用add_header來添加:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

url = 'https://www.xxxxxx.com//login'
postdata = {b'username': b'xxx', 
            b'password': b'******'}
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
referer = 'https://www.github.com'
req = request.Request(url, postdata)

"""修改"""
req.add_header('User-Agent', user_agent)
req.add_header('Referer', referer)

res = request.urlopen(req)
html = res.read()
print(html)

  注意:.
  對某些header要特別注意,服務器會針對這些header進行檢查,例如:

  • User-Agent:有些服務器或Proxy會通過該值來判斷是否是瀏覽器發出的請求
  • Content-Type:在使用REST接口時,服務器會檢查該值,用來確定HEEP Body的內容該怎樣解析,在使用服務器提供的RESTful或SOAP服務時,該值的設置錯誤會導致服務器拒絕服務。常見的取值如下:
application/xml (在XML RPC,如RESTful/SOAP調用時使用  
application/json (在JSON RPC調用時使用)          
application/x-www-form-urlencoded (瀏覽器提交Web表單時使用)
  • Referer:服務器有時會檢查防盜鏈。

1.3 Cookie處理

  如果需要得到某個Cookie的值,可以採取如下做法:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request
from http import cookiejar

cookie = cookiejar.CookieJar()
opener = request.build_opener(request.HTTPCookieProcessor(cookie))
"""響應"""
res = opener.open('http://www.zhihu.com')
for item in cookie:
    print(item.name + ": " + item.value)

  輸出:

_xsrf: 467z...
_zap: 4f91...
KLBRSID: ed2a...

  當然可以按自己的需要手動添加Cookie的內容:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

cookie = ('Cookie', 'email=' + '[email protected]')
opener = request.build_opener()
opener.addheaders = [cookie]
"""請求"""
req = request.Request('http://www.zhihu.com')
"""響應"""
res = opener.open(req)
print(res.headers)
retdata = res.read()

  輸出:

Date: Tue, 09 Jun 2020 06:45:54 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 49014
Connection: close
Server: CLOUD ELB 1.0.0...

1.4 獲取HTTP響應碼

  對於200OK來說,只需使用urlopen返回對象的getcode()即可獲得HTTP的響應碼。但是對於其他響應碼,則會拋出異常:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

try:
    """響應"""
    res = request.urlopen('http://www.zhihu.com')
    print(res.getcode())
except request.HTTPError as e:
    if hasattr(e, 'code'):
        print("Error code: ", e.code)

  輸出:

200

1.5 重定向

  以下代碼將檢查是否出現了重定向動作:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

try:
    """響應"""
    res = request.urlopen('http://www.zhihu.com')
    print(res.geturl())
except request.HTTPError as e:
    if hasattr(e, 'code'):
        print("Error code: ", e.code)

  輸出:

https://www.zhihu.com/signin?next=%2F

  如果不想重定向,則可以自定義HTTPRedirectHandler類:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

class RedirectHandler(request.HTTPRedirectHandler):
    def http_error_301(self, req, fp, code, msg, headers):
        pass
    
    def http_error_302(self, req, fp, code, msg, headers):
        result = request.HTTPRedirectHandler.http_error_301(self, req, fp, code, msg, headers)
        result.status = code
        result.newurl = result.geturl()
        return result
    
opener = request.build_opener(RedirectHandler)
res = opener.open('http://www.zhihu.cn')
print(res)

  輸出:

<http.client.HTTPResponse object at 0x000001BEAC776160>

1.6 Proxy的設置

  示例如下:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
from urllib import request

proxy = request.ProxyHandler({'http': '127.0.0.1: 8087'})
opener = request.build_opener(proxy)
res = opener.open('http://www.zhihu.com/')
print(res.read())

  輸出:
在這裏插入圖片描述

2 request實現

2.1 完整請求與響應模型的實現

  1)GET請求:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

res = requests.get('http://www.zhihu.com')
print(res.content)

  2)POST請求:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

postdata = {'key' : 'value'}
res = requests.post('http://www.zhihu.com', data=postdata)
print(res.content)

  HTTP中其他請求方式示例如下:

  • requests.put (‘http://www.xxxxxx.com/put’,data={‘key’:‘value’})
  • requests.delete (‘http://www.xxxxxx.com/delete’)
  • requests.head (‘http://www.xxxxxx.com/get’)
  • requests.options (‘http://www.xxxxxx.com/get’)

  3)複雜URL的輸入,除了使用完整的URL,requests還提供了以下方式:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

payload = {'Keywords': 'bolg:qiyeboy', 'pageindex': 1}
"""可設置timeout"""
res = requests.get('http://www.zhihu.com', params=payload)
print(res.url)

  輸出:

https://www.zhihu.com/?Keywords=bolg%3Aqiyeboy&pageindex=1

2.2 響應與編碼

  以res = requests.get(‘http://www.zhihu.com’) 爲例,其返回值中:

  • res.content:字節形式
  • res.text:文本形式
  • res.encoding:根據HTTP頭猜測的網頁編碼格式

  這裏使用第三方庫chardet來進行字符串 / 文件編碼檢測:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests
import chardet

res = requests.get('http://www.zhihu.com')
"""
detect返回字典,包括:
    - 'encoding':編碼形式 
    - 'confidence':檢測精確度
    - 'language':超文本標記語言
"""
ret_dic = chardet.detect(res.content)
"""使用檢測到的編碼形式解碼"""
res.encoding = ret_dic['encoding']
print(ret_dic)
print(res.text)

  輸出:

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
<html>

<head><title>400 Bad Request</title></head>

<body bgcolor="white">

<center><h1>400 Bad Request</h1></center>

<hr><center>openresty</center>

</body>

</html>

2.3 請求頭headers處理

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent': user_agent}
res = requests.get('http://www.zhihu.com', headers=headers)
print(res.content)

2.4 響應碼code和請求頭headers處理

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

res = requests.get('http://www.baidu.com')

"""
res.status_code:獲取響應碼
res.status_code == requests.codes.ok:判斷相應碼
"""
if res.status_code == requests.codes.ok:
    print("響應碼:", res.status_code)
    print("響應頭:", res.headers)
    print("字段獲取:", res.headers.get('content-type'))
else:
	"""
	當相應碼是4XX或5XX時,raise_for_status()會拋出異常
	當相應碼是200時,raise_for_status()返回None
	"""
    res.raise_for_status()

  輸出:

響應碼: 200
響應頭: {'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'keep-alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Tue, 09 Jun 2020 13:42:42 GMT', 'Last-Modified': 'Mon, 23 Jan 2017 13:27:52 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}
字段獲取: text/html

2.5 Cookie處理

  1)自動Cookie:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
res = requests.get('http://www.baidu.com', headers=headers)

for cookie in res.cookies.keys():
    print(cookie + ": " + res.cookies.get(cookie))

  輸出:

BAIDUID: D285BF54C9CC968744699A9B4F843D60:FG=1
BIDUPSID: D285BF54C9CC9687F9E45D28DB4C9F33
H_PS_PSSID: 1456_31326_21100_31069_31765_31673_30823
PSTM: 1591710519
BDSVRTM: 0
BD_HOME: 1

  2)自定義Cookie:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers={'User-Agent':user_agent}
"""自定義"""
cookies = dict(name='guangtouqiang', age='18')
res = requests.get('http://www.baidu.com', headers=headers, cookies=cookies)

print(res.text)

  3)自動處理Cookie:

# coding: utf-8
import warnings
warnings.filterwarnings('ignore')
import requests

login_url = 'http://www.zhihu.com/login'
s = requests.Session()
datas = {'name': 'guangtouqiang', 'passwd': '123456'}
"""
遊客模式,服務器先分配一個cookie, 如果沒有這一步,系統會認爲時非法用戶
allow_redirects=True表示允許重定向,如果重定向,則可通過res.history查看歷史信息
"""
s.get(login_url, allow_redirects=True) 
"""驗證成功,權限將升級到會員權限"""
res = s.post(login_url, data=datas, allow_redirects=True)
print(res.text)

  輸出:

<html>

<head><title>400 Bad Request</title></head>

<body bgcolor="white">

<center><h1>400 Bad Request</h1></center>

<hr><center>openresty</center>

</body>

</html>

2.6 重定向和歷史信息

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章