Python urllib 模塊基礎使用

urllib ：URL處理模塊

urllib 是一個收集幾個模塊以處理URL的包
包括：
urllib.request 用於打開閱讀URLs
urllib.error 用於urllib.request過程中出現的錯誤
urllib.parse 用於解析URL
urllib.robotparser 用於解析robots.txt文件

urllib.request 定義了有助於處理HTTP的函數與類

urllib.request主要方法

urllib.request.urlopen

urllib.request.urlopen（url，data = None，[ timeout，] *，cafile = None，capath = None，cadefault = False，context = None ）
- url：可以是string，或者是Request對象
- data: 指定要發送到服務器的其他數據的對象，或者None。HTTP請求是唯一使用數據的請求。支持的對象類型包括字節，類文件對象和可迭代
- timeout：設置超時時間(以秒爲單位)，如果沒有設定以全局默認爲準。該字段僅適用於HTTP, HTTPS， FTP
- cafile，capath參數爲HTTPS請求指定一組可信CA證書，從版本3.6 開始不推薦使用。可使用 ssl.SSLContext.load_cert_chain() 改用，或者 ssl.create_default_context() 爲系統選擇可信CA證書
- cadefault，忽略參數
- context，如果指定context，它必須是ssl.SSLContext描述各種SSL選項的實例
該函數返回一個對象，該對象可用作上下文管理器並具有諸如的方法：
- geturl() - 返回檢索到的資源的URL
- info()- 以email.message_from_string()實例的形式返回頁面的元信息，例如標題
- getcode() - 返回響應的HTTP狀態代碼
對於HTTP和HTTPS URL，此函數返回http.client.HTTPResponse稍微修改的對象

from urllib.request import urlopen

abc = urlopen(r"http://python.org/")
print(abc.geturl())
print(abc.info())
print(abc.getcode())

urllib.request還提供了以下主要類

Request 該類是url請求的抽象類

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
- url：string類型的有效url
- data：指定要發送到服務器的其他數據的對象，或者None
- headers：訪問url時的頭部信息，必須是字典類型
- origin_req_host：原始請求方的主機名或IP地址
- unverifiable：如果沒有權限訪問請求結果，該參數爲True。默認訪問權限爲False
- method：指定請求的方法，如GET，POST，PUT等

from urllib import request

url = (r"http://python.org/")
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    'Connection': 'keep-alive',
}
req = request.Request(url, headers=headers)
page = request.urlopen(req).read()
page = page.decode('utf-8')
print(page)

urllib.error定義了urllib.request引發的異常類

包含的異常類

urllib.error.URLError

運行urllib.request出現異常時的異常處理類
reason屬性，包含錯誤的一些信息

from urllib import request
from urllib import error

url = (r"http://zheshigeshenmewangzhan.org/")
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    'Connection': 'keep-alive',
}
try:
    req = request.Request(url, headers=headers)
    request.urlopen(req)
except error.URLError as e:
    print(e.reason)

輸出結果爲 [Errno 8] nodename nor servname provided, or not known

urllib.error.HTTPError

URLError的子類，因此，如果URLError與HTTPError一同使用的時候，應該將HTTPError放在前
更有利於處理HTTP的錯誤
code 屬性，HTTP狀態碼
reason 屬性，錯誤信息
headers屬性，HTTP響應頭

from urllib import request
from urllib import error

if __name__ == "__main__":
    url = (r"http://www.sina.com/asd")
    try:
        response = request.urlopen(url)
    except error.HTTPError as e:
        print(e.reason)
        print(e.code)
        print(e.headers)

輸出結果爲：
Not Found
404
Server: nginx
Date: Wed, 08 Aug 2018 14:44:57 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Age: 0
Via: http/1.1 ctc.ningbo.ha2ts4.74 (ApacheTrafficServer/6.2.1 [cMsSf ]), http/1.1 ctc.xiamen.ha2ts4.34 (ApacheTrafficServer/6.2.1 [cMsSf ])
X-Via-Edge: 15337394975367cf664713cd64cde3e31ef1b
X-Cache: MISS.34
X-Via-CDN: f=edge,s=ctc.xiamen.ha2ts4.35.nb.sinaedge.com,c=113.100.246.124;f=Edge,s=ctc.xiamen.ha2ts4.34,c=222.76.214.35;f=edge,s=ctc.ningbo.ha2ts4.71.nb.sinaedge.com,c=222.76.214.34;f=Edge,s=ctc.ningbo.ha2ts4.74,c=115.238.190.71

urllib.parse解析URL爲組件

主要包括兩大功能：URL解析 與 URL引用

主要的方法

urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)

將URL解析爲六個組件，返回一個6元組。對應於URL的一般結構：scheme://netloc/path;parameters?query#fragment
urlstring：url路徑
scheme：方案說明符，http等
allow_fragments：如果參數爲false，則無法識別fragment片段標識符，相反，它被解析爲路徑，參數或查詢組件的一部分，返回的fragment在返回值中爲空

>>> from urllib.parse import urlparse
>>> o = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')
>>> o   
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
            params='', query='', fragment='')
>>> o.scheme
'http'
>>> o.port
80
>>> o.geturl()
'http://www.cwi.nl:80/%7Eguido/Python.html'

urllib.parse.quote(string, safe=’/’, encoding=None, errors=None)

按照標準， URL 只允許一部分 ASCII 字符（數字字母和部分符號），其他的字符（如漢字）是不符合 URL 標準的，因此需要進行編碼
quote 除了 -._/09AZaz ,都會進行編碼
quote_plus(也是一個方法) 比 quote 『更進』一些，它還會編碼 /
safe：安全不轉換字符，默認是/
encoding：編碼方式，默認utf-8
errors：默認是’strict’，意味着不支持的字符會出現 UnicodeEncodeError

from urllib.parse import urlparse,quote

if __name__ == "__main__":
    print(quote("www.baidu.com?a=*",safe="*"))

輸出結果：
www.baidu.com%3Fa%3D*

urllib.parse.unquote(string, encoding=’utf-8’, errors=’replace’)

有編碼自然要有解碼
errors：默認是’replace’ ，意味着無效序列被佔位符替換

from urllib.parse import unquote,quote

if __name__ == "__main__":
    print(quote("www.baidu.com?a=*"))
    print(unquote("www.baidu.com%3Fa%3D%2A"))

輸出結果
www.baidu.com%3Fa%3D%2A
www.baidu.com?a=*

urllib.robotparser解析robots.txt文件

此模塊提供單個類，RobotFileParser用於回答有關特定用戶代理是否可以在發佈該robots.txt文件的網站上獲取URL的問題

class urllib.robotparser.RobotFileParser(url=”）

set_url（url ）設置引用robots.txt文件的URL
read（）讀取robots.txtURL並將其提供給解析器
can_fetch（useragent，url ） useragent用戶客戶端，根據True或False來判斷是否能以某個agent訪問

from urllib import robotparser

rp = robotparser.RobotFileParser()
rp.set_url('https://www.douban.com/robots.txt')
rp.read()
url = 'https://www.douban.com'
user_agent = 'Wandoujia Spider'
can = rp.can_fetch(user_agent, url)
print(rp)
print(can)

輸出結果：
User-agent: Wandoujia Spider
Disallow: /
False

Python urllib 模塊基礎使用

urllib ：URL處理模塊

urllib.request 定義了有助於處理HTTP的函數與類

urllib.request主要方法

urllib.request.urlopen

urllib.request還提供了以下主要類

Request 該類是url請求的抽象類

urllib.error定義了urllib.request引發的異常類

包含的異常類

urllib.error.URLError

urllib.error.HTTPError

urllib.parse解析URL爲組件

主要的方法

urllib.parse.urlparse(urlstring, scheme=”, allow_fragments=True)

urllib.parse.quote(string, safe=’/’, encoding=None, errors=None)

urllib.parse.unquote(string, encoding=’utf-8’, errors=’replace’)

urllib.robotparser解析robots.txt文件

class urllib.robotparser.RobotFileParser(url=”）

druid數據源 xml配置

準確訪問路徑下，爲什麼有些圖片無法顯示？？？

phpfpm nginx file not found 終結

Non-static method Redis::****() cannot be called statically

語義化版本(semver)

golang beego bee 安裝運行

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結