python網絡爬蟲Simple(2) requests和beautifulsoup4安裝和使用

原創

2020-06-24 21:59

1 requests

1.1 requests packages簡介

requests. 庫基於 urllib開發。

requests的主要方法：
requests.request() 構造一個請求，支持以下各種方法
requests.get() 獲取html的主要方法
requests.head() 獲取html頭部信息的主要方法
requests.post() 向html網頁提交post請求的方法
requests.put() 向html網頁提交put請求的方法
requests.patch() 向html提交局部修改的請求
requests.delete() 向html提交刪除請求

response的主要屬性：
r.status_code #響應狀態碼
r.raw #返回原始響應體，也就是 urllib 的 response 對象，使用 r.raw.read() 讀取
r.content #字節方式的響應體，會自動爲你解碼 gzip 和 deflate 壓縮
r.text #字符串方式的響應體，會自動根據響應頭部的字符編碼進行解碼
r.headers #以字典對象存儲服務器響應頭，但是這個字典比較特殊，字典鍵不區分大小寫，若鍵不存在則返回None
r.json() #Requests中內置的JSON解碼器
r.raise_for_status() #失敗請求(非200響應)拋出異常

1.2 安裝requests

輸入pip install request執行

打開文件目錄可以看到安裝的package：

1.3 requests 的簡單使用

首先引入request庫：

import requests

get方式進行請求：

response1 = requests.get(url='http://www.baidu.com')

然後打印返回狀態碼

print(response1.status_code)

我們也可以設置代理：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'}
response2 = requests.get(url='http://www.baidu.com', headers=headers)

然後輸出response的文本：

print(response2.text)

代碼詳見：https://github.com/alifeidao/python-spider-simple chapter2中chapter2-1.py

2 beautifulsoup4

2.1 beautifulsoup4簡介

beautifulsoup 4的中文官方版本：
https://beautifulsoup.readthedocs.io/zh_CN/latest/

beautifulsoup庫是解析、遍歷、維護“標籤樹”的功能庫。
5種基本元素

Tag：標籤
Name：標籤的名字：tag.name
Attribute：提取標籤的屬性：tag[‘attribute’]
NavigableString 標籤中的文本內容tag.string
Comment：HTML和XML中的註釋

常用find_all函數說明：
find_all( name , attrs , recursive , string , **kwargs )
name 參數：可以查找所有名字爲 name 的tag。
attr 參數：就是tag裏的屬性。
string 參數：搜索文檔中字符串的內容。
recursive 參數：調用tag的 find_all() 方法時，Beautiful Soup會檢索當前tag的所有子孫節點。如果只想搜索tag的直接子節點，可以使用參數 recursive=False

2.2 安裝beautifulsoup4

需要安裝xml 的lib：
pip install lxml

安裝beautifulsoup4：
pip install beautifulsoup4
安裝成功後提示：

2.3 beautifulsoup的簡單實用

首先BeautifulSoup的實例bsoup

reponse_text = reponse1.text
bsoup = BeautifulSoup(reponse_text, 'html.parser')

然後打印第一個p標籤：

p=bsoup.p
print("第一個P標籤:")
print(p)

代碼詳見：https://github.com/alifeidao/python-spider-simplechapter2中chapter2-2.py

3 Requests和beautifulsoup4結合使用

獲取天氣預報項目實戰
1 使用requests 庫獲取圖片
2 使用BeautifulSoup 庫解析抓取網頁內容。
3 使用os 庫創建文件夾和獲取文件夾中的文件名稱列表

第1步：在Chrome中打開http://www.weather.com.cn/weather/101010100.shtml

第2步：開發者工具，找到代碼中天氣的元素，右鍵在彈出的快捷菜單中選擇“Copy”➔“Copy
Selector”命令，便可以自動複製路徑。
將路徑粘貼在文檔中，代碼如下:
#\37 d > ul > li.sky.skyid.lv2.on
對路徑稍作處理：
ul > li.sky.skyid.lv2.on
Beautiful 直接select：
datas = bsoup.select(‘ul > li.sky.skyid.lv2.on’)

第3步：清洗和組織數據

weathers = bsoup.find_all('li', class_='sky skyid lv2 on')

…

date = weather.find(‘h1’).text
print(“日期:”, end=" ")
print(date)

wea = weather.find(‘p’, class_=‘wea’).text
print(“天氣:”, end=" ")
print(wea)
第4步：運行結果：

代碼詳見：https://github.com/alifeidao/python-spider-simplechapter2 中chapter2-3.py
需要說明：
今日天氣的tag有可能會發生變化，代碼需要相應調整。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python網絡爬蟲Simple(2) requests和beautifulsoup4安裝和使用

1 requests

1.1 requests packages簡介

1.2 安裝requests

1.3 requests 的簡單使用

2 beautifulsoup4

2.1 beautifulsoup4簡介

2.2 安裝beautifulsoup4

2.3 beautifulsoup的簡單實用

3 Requests和beautifulsoup4結合使用

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

再談23種設計模式（3）：行爲型模式（學習筆記）

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

python網絡爬蟲Simple(1) 簡介

2 springcloud負載均衡

1 springcloud微服務協調者Eureka

23種設計模式python實現（2-結構型模式）

python網絡爬蟲Simple(2) requests和beautifulsoup4安裝和使用

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結