python爬蟲之一 —— 愛鬥圖圖包抓取

前言

最近有點空閒時間，又開始研究python的爬蟲，事實上這幾天已經寫了好幾個爬蟲，也嘗試了用pyspider爬取網頁，慢慢積累，今天和大家分享一個表情包爬蟲。

相信大家都喜歡鬥圖，今天這個爬蟲就是爬取愛鬥圖網站的圖包，資源豐富，內容很多：

這是從網上拿到的結果：

步驟

這個網站主要是靜態網頁，結構並不複雜，我們的操作步驟如下：

發送請求，接收響應；
解析第一頁，獲取詳情頁鏈接和下一頁鏈接；
進入詳情頁抓取圖片鏈接和其它信息；
翻頁，重複1、2、3步

詳細代碼

下面是具體的python代碼，這裏沒有下載圖片，只是把圖片數據存到了本地，如果要下載可以再定義一個下載函數。以上，謝謝。

import requests
from bs4 import BeautifulSoup as bs
import json


class Doutu:
    def __init__(self):
        self.start_url = 'http://www.adoutu.com/article/list/1'
        self.part_url = 'http://www.adoutu.com'
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
        
    def get_page(self, url):
        try:
            response = requests.get(url, headers=self.headers)
            return response.content.decode()
        except:
            return None
    
    def parse_page(self, html):
        soup=bs(html, 'lxml')
        contents=soup.select('div.article-part-list > div.list-group')
        next_url=self.part_url+soup.select('li.page-item  > a.page-link')[-1]['href']
        content_list=[]
        for content in contents:
            result={
                    'title': content.select_one('.title-content').get_text(),
                    'detail_url': self.part_url+content.select_one('div.list-group-item > a')['href'],
                    'date': content.select('.title-property')[0].get_text().replace('上傳時間：', ''),
                    'nums': content.select('.title-property')[1].get_text().replace('數量：', ''),
                    'hot': content.select('.title-property')[2].get_text().replace('熱度：', ''),
                    'keywords': [i.get_text() for i in content.select('.detail-keyword-item ')]
                    }
            content_list.append(result)
        return next_url, content_list
    
    def get_detail_page(self, url):
        content=self.get_page(url)
        s=bs(content, 'lxml')
        results=s.select('div.detail-content > div.detail-picture')
        pics_list=[i.find('img')['src'] for i in results]
#        pic_title=[i.find('img')['title'] for i in results]
        return pics_list#, pic_title
    
    def run(self):
        html=self.get_page(self.start_url)
        next_url, content_list=self.parse_page(html)
        for i in content_list:
            detail_url=i['detail_url']
            pics_list=self.get_detail_page(detail_url)
            i['pics_list']=pics_list
            self.on_save(i)
        while self.get_page(next_url):
            html=self.get_page(next_url)
            next_url, content_list=self.parse_page(html)
            for i in content_list:
                detail_url=i['detail_url']
                pics_list=self.get_detail_page(detail_url)
                i['pics_list']=pics_list
                self.on_save(i)
                
    def on_save(self, content):
        if content:
            with open('E:/spiders/doutu/doutu.txt', 'a', encoding='utf-8') as f:
                f.write(json.dumps(content, ensure_ascii=False))
                f.write('\n')
            print(content['title'], 'Done')


if __name__ == '__main__':
    dt=Doutu()
    dt.run()

python爬蟲之一 —— 愛鬥圖圖包抓取

前言

步驟

詳細代碼

C語言--右移左移

12款高效開源Wiki系統推薦，打造團隊知識管理利器

一個開源且全面的C#算法實戰教程

dotnet 基於 DirectML 控制檯運行 Phi-3 模型

自定義MyBatis插件

一款.NET開源、功能強大、跨平臺的繪圖庫 - OxyPlot

常用的 Git 指令

鼠標控制軟件有可能和虛擬機軟件產生衝突

sm4加密工具類

成都二手房長啥樣 —— 基於鏈家數據

信用評分卡模型 —— 基於Lending Club數據

Human Resources Analytics -- Kaggle Dataset

遊戲付費金額 —— 基於DC遊戲數據（Brutal Age）

python爬蟲之七 —— 鏈家二手房

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結