python urllib.request etree爬取百度貼吧的圖片並存儲本地

原創

2020-06-16 02:27

python urllib.request etree爬取百度貼吧的圖片並存儲本地，源代碼如下：

import re
import time
import urllib.request
from lxml import etree

# ------ 獲取網頁源代碼的方法 ---
def getHtml(url):
    # page = urllib.request.urlopen(url)
    # html = page.read()

    headers = {'User-Agent': 'User-Agent:Mozilla/5.0'}
    html1 = urllib.request.Request(url, headers=headers)
    html = urllib.request.urlopen(html1).read()
    return html

# ------ getHtml()內輸入任意帖子的URL ------
html = getHtml('https://tieba.baidu.com/index.html')
# ------ 修改html對象內的字符編碼爲UTF-8 ------
# html = html.decode('UTF-8')

# ------ 獲取帖子內所有圖片地址的方法 ------
def getImg(html):
    # ------ 匹配網頁內容找到圖片地址 ------
    tree = etree.HTML(html)
    imglist = tree.xpath('//img')
    return imglist
    # reg = r'src="([.*\S]*\.jpg)"'
    # imgname = r'alt="*"'
    # imgre = re.compile(reg);
    # imgnamelist = re.findall(imgname,html)
    # imglist = re.findall(imgre, html)
    # return imglist,imgnamelist

imgList = getImg(html)
imgNamenum = 0
for one in imgList:
    # ------ 這裏最好使用異常處理及多線程編程方式 ------
    try:
        imgPath = one.get('src')
        if imgPath[:4] != 'http':
            imgPath =  'https://tieba.baidu.com/' + imgPath
        imgName = one.get('alt')
        imgName = str(imgName)
        if imgName == 'None':
            imgName = str(time.time())
        f = open('D:\\Temp\\'+ str(imgName)+".jpg", 'wb')
        f.write((urllib.request.urlopen(imgPath)).read())
        print(imgPath)
        time.sleep(0.1)
        f.close()
    except Exception as e:
        print(imgPath+" error")
    imgNamenum += 1

print("All Done!")

結果如下：

注：本文僅用於技術交流，不得用於商業用途。不遵守者，與本文作者無關。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python urllib.request etree爬取百度貼吧的圖片並存儲本地

redis的key亂碼問題和值自增問題

一個開源且全面的C#算法實戰教程

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

CORS error 但是 status code 是200 OK

壓縮上傳的GPU數據的方案

使用skopeo同步鏡像

經典bug——續更累積

測試常用正則表達式整理彙總

工作需要的資料、工作流程——測試小小小白時的整理

App測試點——測試小小小小白時的整理

測試工程師績效考覈表——僅供參考，各指標由公司制度決定

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結