Python——爬取網頁信息 Ⅰ

07. lxml的安裝

01. 爬取內容並保存到本地

from urllib import request

# 加載一個頁面
def loadPage(url):
    # 發送請求
    req = request.Request(url)
    # 打開響應的對象
    response = request.urlopen(req)
    # 獲取響應的內容
    html = response.read()
    # 對網頁進行解碼
    content = html.decode('utf-8')
    return content
    
# 把下載的內容保存到本地文件
def writePage(html, filename):
    print('正在保存到：'+filename)
    f = open(filename,'w',encoding='utf-8')
    f.write(html)
    f.close()
    
url='https://tieba.baidu.com/f?kw=%E6%9F%AF%E5%8D%97&ie=utf-8'
content = loadPage(url)
print(content)
filename = 'tieba.html'
writePage(content,filename)

02. 設置起始頁和終止頁

from urllib import request

# 加載一個頁面
def loadPage(url):
    # 發送請求
    req = request.Request(url)
    # 打開響應的對象
    response = request.urlopen(req)
    # 獲取響應的內容
    html = response.read()
    # 對網頁進行解碼
    content = html.decode('utf-8')
    return content
    
# 把下載的內容保存到本地文件
def writePage(html, filename):
    print('正在保存到：'+filename)
    f = open(filename,'w',encoding='utf-8')
    f.write(html)
    f.close()
    
# 設置起始頁和終止頁
def tiebaSpider(url,beginPage,endPage):
    for page in range(beginPage,endPage+1):
        pn = 50*(page-1)
        fullurl = url+'&pn='+str(pn)
        content = loadPage(fullurl)
        filename = '第'+str(page)+'頁.html'
        writePage(content,filename)
        
url='https://tieba.baidu.com/f?kw=%E6%9F%AF%E5%8D%97&ie=utf-8'
tiebaSpider(url,1,4)

03. 用戶輸入參數

from urllib import request,parse

# 加載一個頁面
def loadPage(url):
    # 發送請求
    req = request.Request(url)
    # 打開響應的對象
    response = request.urlopen(req)
    # 獲取響應的內容
    html = response.read()
    # 對網頁進行解碼
    content = html.decode('utf-8')
    return content
    
# 把下載的內容保存到本地文件
def writePage(html, filename):
    print('正在保存到：'+filename)
    f = open(filename,'w',encoding='utf-8')
    f.write(html)
    f.close()
    
# 設置起始頁和終止頁
def tiebaSpider(url,beginPage,endPage):
    for page in range(beginPage,endPage+1):
        pn = 50*(page-1)
        fullurl = url+'&pn='+str(pn)
        content = loadPage(fullurl)
        filename = kw+'第'+str(page)+'頁.html'
        writePage(content,filename)
        
if __name__ == '__main__':
    kw = input('請輸入要爬取的貼吧：')
    beginPage = int(input('請輸入起始頁：'))
    endPage = int(input('請輸入終止頁：'))
    key = parse.urlencode({'kw':kw})
    url='https://tieba.baidu.com/f?'
    url += key
    tiebaSpider(url,beginPage,endPage)

04. 找出帖子的圖片鏈接

from urllib import request
from lxml import etree

# 加載帖子中的圖片鏈接
def loadImage(url):
    # 發起請求
    req = request.Request(url)
    # 打開響應的內容
    response = request.urlopen(req)
    # 獲取響應的內容
    html = response.read()
    # 對網頁進行解碼
    content = html.decode('utf-8')
    # 使用etree對html的內容建立文檔樹
    content = etree.HTML(content)
    link_list = content.xpath('//img[@class="BDE_Image"]/@src')
    for link in link_list:
        print(link)
        
url = 'https://tieba.baidu.com/p/6243133196'
loadImage(url)

05. 把圖片保存到文件中

from urllib import request,parse
from lxml import etree

# 加載帖子中的圖片鏈接
def loadImage(url):
    # 發起請求
    req = request.Request(url)
    # 打開響應的內容
    response = request.urlopen(req)
    # 獲取響應的內容
    html = response.read()
    # 對網頁進行解碼
    content = html.decode('utf-8')
    # 使用etree對html的內容建立文檔樹
    content = etree.HTML(content)
    link_list = content.xpath('//img[@class="BDE_Image"]/@src')
    for link in link_list:
        print(link)
        writeImage(link)
        
# 下載圖片並保存到文件中
def writeImage(url):
    # 發起請求
    req = request.Request(url)
    # 打開響應的對象
    response = request.urlopen(req)
    # 獲取響應的內容
    image = response.read()
    filename = url[-15:]   # 命名格式爲url後15位
    f = open ('img/'+filename,'wb')    # 將圖片放在img文件夾下
    f.write(image)
    f.close()
    
if __name__ =='__main__':
    url = 'https://tieba.baidu.com/p/6243133196'
    loadImage(url)

06. xpath

xpath的安裝

在Chrome瀏覽器中打開開發者模式。
把xpath_helper_2_0_2.crx後綴名改爲rar（即xpath_helper_2_0_2.rar）。
加載已解壓的xpath拓展程序。
在貼吧網頁中，點擊該按鈕。
會出現如下頁面。可以在QUERY中寫上簡單的xpath規則（//div）。在RESULT中呈現搜索出來的內容。

xpath的語法

查找標籤

總路徑下查找標籤（以//開頭）	總路徑下查找標籤（以//開頭）	子路徑下查找
//div //span //a	./div ./span ./a	//div/span //div/a

查找屬性

標籤名[@屬性名=屬性值]
//span[@class=“threadlist_rep_num center_text”] //div/a[@class=“j_th_tit”

讀取屬性

標籤名/@屬性名
//div/a[@class=“j_th_tit”]/@href //img[@class=“card_head_img”]/@src

讀取內容

標籤名/text()
//div/a[@class=“j_th_tit”]/text()

注意：如果不寫text()，實際上拿到的是標籤對象；如果寫text()，實際上拿到的是文本（字符串）。

Python——爬取網頁信息 Ⅰ

Python——爬取網頁信息 Ⅰ

01. 爬取內容並保存到本地

02. 設置起始頁和終止頁

03. 用戶輸入參數

04. 找出帖子的圖片鏈接

05. 把圖片保存到文件中

06. xpath

xpath的安裝

xpath的語法

07. lxml的安裝

985 碩士程序員，空窗 4 個月沒有 Offer！

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

Python——模塊 [ 筆記 ]

面試——服務端開發工程師（20190918）

Python——爬取直播網站房間名及熱度

面試——Python開發實習生（20190919）

Python——爬取網頁信息 Ⅰ

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結