Python網絡爬蟲實例教程——第3章網頁解析基礎

原創

2020-05-26 02:41

3.3 抓取百度首頁實例–XPath

import requests
from lxml import etree
# headers 頭部信息
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}

response = requests.get('https://www.baidu.com', headers=headers)

response.encoding = 'utf-8'

selector = etree.HTML(response.text)
print(response.text)

news_text = selector.xpath('//*[@id="s-top-left"]/a[1]/text()')[0]
print(news_text)

news_url = selector.xpath('//*[@id="s-top-left"]/a[1]/@href')[0]
print(news_url)

1、讀取網頁內容

2、獲取標籤的文本內容

3、獲取標籤的屬性

3.4 Beautiful Soup 庫

import requests
from lxml import etree
from bs4 import BeautifulSoup

# headers 頭部信息
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}

response = requests.get('https://www.baidu.com', headers=headers)

response.encoding = 'utf-8'

selector = response.text
print(selector)

# 將 html 傳入Beautiful Soup 的構造方法，得到一個文檔對象 soup
soup = BeautifulSoup(selector, 'lxml')
soup

# 獲取第一個<a>標籤的內容
soup.a

# 獲取第一個標籤的文本內容
soup.a.string

# 嵌套獲取
soup.div.div.li.a.string

# 獲取屬性值
soup.a['class']

# 獲取屬性值 get 方法
soup.a.get('class')

# 將所有子標籤存入一個列表
soup.ul.contents

# find_add 方法 獲取所有內容
all_a = soup.find_all('a')
all_a

# find_all 方法簡寫
soup('a')

# 獲取第二個 <a> 標籤
soup.find_all('a')[2]

根據屬性值查找

soup.find_all(id="s-top-left")

soup.find_all(href="http://tieba.baidu.com")[0]

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python網絡爬蟲實例教程——第3章網頁解析基礎

3.3 抓取百度首頁實例–XPath

1、讀取網頁內容

2、獲取標籤的文本內容

3、獲取標籤的屬性

3.4 Beautiful Soup 庫

關於接口協議，你必須要知道這些！

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一鍵自動化博客發佈工具,用過的人都說好(頭條篇)

01 穩定性（一）如何應對事故並做好覆盤？

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

線程池那些坑爹的參數-核心線程數&最大線程數&工作隊列

Stream流常用方法總結

Hive 基本指令

22 - Spark - map 算子

26 - Spark - flatMap算子

24 - Spark - mapPartitionsWithIndex算子

31 - Spark - coalesce算子

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python網絡爬蟲實例教程——第3章 網頁解析基礎

3.3 抓取百度首頁實例–XPath

1、讀取網頁內容

2、獲取標籤的文本內容

3、獲取標籤的屬性

3.4 Beautiful Soup 庫

Python網絡爬蟲實例教程——第3章網頁解析基礎