這節課來學習一下什麼是urllib庫

功能：指定URL，獲取網頁數據（給網頁就能爬）

獲取到的網頁原始數據需要後續的處理：

1.解析爲樹形（需要用到BeautifulSoup，方法見咱的另一篇博客）

2.粗略提取需要爬取的信息（通過樹的各種及節點粗略爬取需要的數據，網頁簡單時可省略）

3.精確定位（通過正則表達式精確匹配信息，方法見咱的另一篇博客）

4.保存有用信息（保存到文件，excle，數據庫等）

#!/usr/bin/env python 
# -*- coding:utf-8 -*-
# 測試urllib

import urllib.request

# 獲取一個get請求（無需提供信息）
response = urllib.request.urlopen("http://www.baidu.com")   # response網頁對象，需要二次操作
print(response)
print(response.read())  # 讀取網頁（二進制文件，未用瀏覽器解析）
print(response.read().decode('utf-8'))  # 解碼（換行符中文都能正常解釋），html格式，結構清晰


# 獲取一個post請求（向服務器提供一個表單）
import urllib.parse

# 傳遞參數（用戶名、密碼等）二進制格式
# 將鍵值對以"utf-8"的方式解析封裝==>>轉化爲二進制包
data = bytes(urllib.parse.urlencode({"hello":"world"}).encode("utf-8"))
response = urllib.request.urlopen("http://httpbin.org/post", data=data)
print(response.read().decode("utf-8"))


# 超時問題
timeout= :若響應時間超出設定值，則自動停止
try:
    response = urllib.request.urlopen("http://httpbin.org/get", timeout=0.01)
    print(response.read().decode("utf-8"))
except urllib.error.URLError as e:  # 超時檢測
    print('time out!',e)


# 網頁信息查詢（除了網頁信息還有很多其他信息）
response = urllib.request.urlopen("http://www.baidu.com")  # 所有返回的信息
print(response.status)  # 獲取網頁狀態碼：200（正常）；404（找不到）；418（發現被爬）
print(response.getheaders())    # 頭部信息
print(response.getheader('Server'))    # 頭部信息具體某條內容


# 僞裝成瀏覽器
# 先僞裝信息，再提交給瀏覽器
# 不能像前面直接訪問，無法包含僞裝信息
# 用req對象包裝一下
url = "http://httpbin.org/post"
# data:提交信息（同上）
data = bytes(urllib.parse.urlencode({"hello": "world"}).encode("utf-8"))
# headers：僞裝成瀏覽器
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 "

                  "Safari/537.36 "

}
# 封裝請求對象（包含網址、封裝信息、頭部信息、傳遞方式等）
req = urllib.request.Request(url=url, data=data, headers=headers, method='POST')    # method：訪問方式
response = urllib.request.urlopen(req)  # 提交
print(response.read().decode('utf-8'))


# 僞裝成瀏覽器訪問豆瓣
url = "http://www.douban.com"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 "

                  "Safari/537.36 "

}
req = urllib.request.Request(url=url, headers=headers)
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

你學會了嗎？

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲-urllib

這節課來學習一下什麼是urllib庫

python爬蟲-urllib

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結