用Python爬取解析過的網頁

原創

李奇峰1998

2020-06-20 19:36

之前呢，我怕去了百度貼吧的一些圖片，並且保存的下來，然後我想用相同的方法爬取淘女郎-美人庫的內容，發現不管怎麼編寫正則表達式都獲取不了“Elements”其中的圖片鏈接，之後去網上一查發現，原來我需要爬取的內容都是經過瀏覽器解析過的JS的內容，所以之前一直用的request.urlopen()方法此時就不管作用了，需要調用PhantomJS來解析網頁，然後將解析過的源碼進行篩選，就可以了，話不多說，看代碼

from selenium import webdriver
import re
from urllib import request
class heiheihei:
    #初始化方法，url參數是等待爬取的網址
    def __init__(self,url):
        self.url=url

    # 此方法用來獲取頁面內 (JS解析之後的界面)
    def getPage(self):
        #調用PhantomJS解析器解析網頁
        driver = webdriver.PhantomJS(executable_path=r'C:\Users\liqifeng\AppData\Local\Programs\Python\Python36\Scripts\phantomjs')
        #獲取網頁內容
        driver.get(self.url)
        #返回網頁源代碼
        data = driver.page_source
        return data

    #獲取網頁中每個女模特的名字
    def getName(self):
        #首先，創建一個list，用來存儲名字
        list=[]
        #調用getPage()方法獲得網頁解析過的源代碼
        content=self.getPage()
        #編寫pattern
        pattern=re.compile('<span class="name">(.*?)</span>',re.S)
        #將網頁內容放進去匹配，並返回所有結果
        result=re.findall(pattern,content)
        #挨個遍歷並將內容添加到list中
        for item in result:
            list.append(item)
        #返回list
        return list
    #獲取網頁中所有模特圖片的鏈接
    def getJpg(self):
        list=[]
        content=self.getPage()
        pattern=re.compile('<div class="img"><img src="//(.*?)"></div>')
        result = re.findall(pattern, content)
        for item in result:
            #再編寫一個pattern，篩選鏈接
            pattern2=re.compile('gtd.*',re.S)
            result2=re.search(pattern2,item)
            list.append(result2.group())
        return list
    #此方法用來保存圖片
    #url爲圖片鏈接
    #filename爲圖片名
    def saveJpg(self,url,filename):
        #一次遍歷url和filename
        for (jpg,name) in zip(url,filename):
            #完善網址
            jpgurl='http://'+jpg
            #打開網址
            req=request.urlopen(jpgurl)
            #獲取圖片資源
            u=req.read()
            #完善文件名
            files='E:/jpg/'+name+'.jpg'
            #打開文件
            file=open(files,'wb')
            with file as f:
                #寫入圖片資源
                f.write(u)
            print('保存圖片'+name+'.jpg'+'成功')

#新建一個實例，並傳入網址
test=heiheihei('https://mm.taobao.com/search_tstar_model.htm?spm=5679.126488.640745.2.b17c0adHu3H5A')
#獲取姓名
lalala=test.getName()
#獲取圖片
hahaha=test.getJpg()
#保存圖片
test.saveJpg(hahaha,lalala)

以上爲程序運行的結果

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

用Python爬取解析過的網頁

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

druid數據源 xml配置

基於selenium的動態網頁Xpath測試工具

網站反爬方案分析

Windows下網頁連接VNC操作手冊

Kafka中數據通過SpringBoot-WebSocket進行實時數據可視化

Python根據dict動態創建mysql表並寫入數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結