3.3 抓取百度首頁實例–XPath
import requests
from lxml import etree
# headers 頭部信息
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
response = requests.get('https://www.baidu.com', headers=headers)
response.encoding = 'utf-8'
selector = etree.HTML(response.text)
print(response.text)
news_text = selector.xpath('//*[@id="s-top-left"]/a[1]/text()')[0]
print(news_text)
news_url = selector.xpath('//*[@id="s-top-left"]/a[1]/@href')[0]
print(news_url)
1、讀取網頁內容
2、獲取標籤的文本內容
3、獲取標籤的屬性
3.4 Beautiful Soup 庫
import requests
from lxml import etree
from bs4 import BeautifulSoup
# headers 頭部信息
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
response = requests.get('https://www.baidu.com', headers=headers)
response.encoding = 'utf-8'
selector = response.text
print(selector)
# 將 html 傳入Beautiful Soup 的構造方法,得到一個文檔對象 soup
soup = BeautifulSoup(selector, 'lxml')
soup
# 獲取第一個<a>標籤的內容
soup.a
# 獲取第一個標籤的文本內容
soup.a.string
# 嵌套獲取
soup.div.div.li.a.string
# 獲取屬性值
soup.a['class']
# 獲取屬性值 get 方法
soup.a.get('class')
# 將所有子標籤存入一個列表
soup.ul.contents
# find_add 方法 獲取所有內容
all_a = soup.find_all('a')
all_a
# find_all 方法簡寫
soup('a')
# 獲取第二個 <a> 標籤
soup.find_all('a')[2]
根據屬性值查找
soup.find_all(id="s-top-left")
soup.find_all(href="http://tieba.baidu.com")[0]