Python爬蟲之 urllib庫

　　1、urllib庫介紹

　　 urllib庫是Python內置的請求庫，能夠實現簡單的頁面爬取功能。值得注意的是，在Python2中，有urllib和urllib2兩個庫來實現請求的發送。但在Python3中，就只有urllib庫了。由於現在普遍流行只用Python3了，所以瞭解urllib庫就行了。查看Python源文件知道urllib庫包括5個模塊，分別是：request、error、parse、robotparser、response。但我翻閱了一些資料後，發現robotparser和response很少提及，故我只對其他三個模塊有所瞭解。

　　2、request模塊

　　顧名思義，request就是用來發送請求的，我們可以通過設置參數來模擬瀏覽器發送請求。值得注意的是，此處request是一個urllib的一個子模塊與另外一個請求庫request要區分。本來在寫這篇博客之前想仔細看看request模塊的源碼，打開發現有2700+行代碼，遂放棄。

　　 request模塊中主要是用urlopen()和Request()來發送請求和一些Handler處理器。下面用代碼演示，具體用法在代碼註釋中。

　　urlopen()方法演示：

　　from urllib import request

　　from urllib import parse

　　from urllib import error

　　import socket

　　if __name__ == '__main__':

　　'''

　　def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,

　　*, cafile=None, capath=None, cadefault=False, context=None):

　　參數分析：

　　url:請求路徑

　　data:可選;如果要添加這個參數，需要將字典格式的數據轉化爲字節流數據，並且請求方式從get變爲post

　　timeout:可選;超時時間，如果訪問超時了變會拋出一個異常

　　其他三個參數是用來設置證書和SSL的，默認設置即可

　　'''

　　# 一次簡單的請求了

　　response_1 = request.urlopen(url="http://www.baidu.com") # 返回一個HttpResponse對象

　　print(response_1.read().decode("utf-8")) #這樣就完成了一次簡單的請求了

　　print("狀態碼:" , response_1.status)

　　print("請求頭:" , response_1.getheaders())

　　print("----------------------------------華麗分割線-----------------------------------------------")

　　# 一次複雜的請求

　　dict = {"name" : "Tom"}

　　data = bytes(parse.urlencode(dict),encoding="utf-8")

　　try:

　　response_2 = request.urlopen(url="http://www.httpbin.org/post",data=data,timeout=10)

　　except error.URLError as e:

　　if isinstance(e.reason,socket.timeout):

　　print("請求超時了")

　　print(response_2.read().decode('utf-8'))

　　使用Request構造請求體

　　from urllib import request,parse

　　if __name__ == '__main__':

　　"""

　　Request是一個類，通過初始化函數對其進行賦值，其作用是構造一個更強大的請求體

　　def __init__(self, url,

　　data=None, headers={},

　　origin_req_host=None,

　　unverifiable=False,

　　method=None):

　　url:請求路徑

　　data:可選;如果要添加這個參數，需要將字典格式的數據轉化爲字節流數據

　　headers:可選;參數類型是一個字典。我們可以修改User-Agent來僞裝成瀏覽器，可以防止反爬蟲

　　origin_req_host:可選;設置主機IP

　　unverifiable:可選;表示請求是否是無法驗證的

　　method:可選;指示請求方式是GET,POST,PUT

　　"""

　　dict = {"name": "Tom"}

　　data = bytes(parse.urlencode(dict),encoding="utf-8")

　　headers = {鄭州婦科在線醫生 http://www.zzkdfk120.com/

　　"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"

　　} #僞裝成corome瀏覽器

　　req = request.Request(url="http://www.httpbin.org/post",data=data,headers=headers,method="POST")

　　response = request.urlopen(req)

　　print(response.read().decode("utf-8"))

　　3、error模塊

　　 error模塊有兩個子類：URLError和HTTPError

　　from urllib import request,error

　　if __name__ == '__main__':

　　try:

　　# 嘗試打開一個不存在的網站

　　response_1 = request.urlopen(

　　except error.URLError as e:

　　print(e.reason)

　　try:

　　# 請求出現錯誤

　　response_2 = request.urlopen("http://www.baidu.com/aaa.html")

　　except error.HTTPError as e:

　　print(e.reason)

　　#若是報400，則表示網頁不存在;若是報500，則表示服務器異常

　　print(e.code)

　　print(e.headers)

　　4、parse模塊

　　urlparse()：對url字符串進行解析

　　from urllib import parse

　　if __name__ == '__main__':

　　url = "https://www.baidu.com/s;param1?ie=UTF-8&wd=python#錨點"

　　result = parse.urlparse(url=url)

　　print(result)

　　# 輸出結果：

　　ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='param1', query='ie=UTF-8&wd=python', fragment='錨點')

　　urlunparse()： urlparse()的逆過程，傳入一個長度爲6的列表即可，列表的參數順序與urlparse()的結果一致。

　　urlsplit()與urlunsplit() :與上述兩個方法基本一致，只是將path和params的結果放在一起了

　　from urllib import parse

　　if __name__ == '__main__':

　　url = "https://www.baidu.com/s;param1?ie=UTF-8&wd=python#錨點"

　　result = parse.urlsplit(url=url)

　　print(result)

　　# 輸出結果：

　　SplitResult(scheme='https', netloc='www.baidu.com', path='/s;param1', query='ie=UTF-8&wd=python', fragment='錨點')

　　其它的方法也是差不多的作用，都是對url進行解析的。

Python爬蟲之 urllib庫

物理機開關機

前端使用 Konva 實現可視化設計器（15）- 自定義連接點、連接優化

Python中for循環運行機制探究以及可迭代對象、迭代器詳解

BP神經網絡原理推導及python實現

瞭解幾個Python高級特性

Python學習之數據清洗之增刪改查

Python 函數入門：變化的參數

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結