Python爬蟲5.11 — scrapy框架結合selenium+chromedriver的使用

綜述

本系列文檔用於對Python爬蟲技術的學習進行簡單的教程講解，鞏固自己技術知識的同時，萬一一不小心又正好對你有用那就更好了。
Python 版本是3.7.4

上篇文章我們講述了下載器中間件的概念，以及如何使用下載器中間件如何使用下載器中間件進行動態隨機設置請求頭和設置代理IP的方法。這一篇文章我們就講述一個少高級一點的中間件用法，那就是Scrapy+selenium+chromedriver進行結合使用的方法。

爲什麼我們要繼續這種結合使用呢？這是由於目前的網站很多都是使用動態加載
生成頁面，我們在直接訪問鏈接的時候獲取不到頁面上的信息(去分析其ajax請求接口規則又太過複雜)，所以我們使用這一方案進行爬取頁面信息。

原理講解

通過我們之前對Scrapy的學習我們知道Scrapy框架的架構可以分成五大模塊(spider、engine、downloader、scheduler、item pipelines)+兩個中間件(spider中間件、downloader中間件)。以及Scrapy的執行流程(爲了方便查看，在這裏在展示一遍)：

引擎從Spiders中獲取到最初的要爬取的請求（Requests）；
引擎安排請求（Requests）到調度器中，並向調度器請求下一個要爬取的請求（Requests）；
調度器返回下一個要爬取的請求（Requests）給引擎；
引擎將上步中得到的請求（Requests）通過下載器中間件（Downloader Middlewares）發送給下載器（Downloader ）,這個過程中下載器中間件（Downloader Middlewares）中的process_request()函數會被調用到；
一旦頁面下載完畢,下載器生成一個該頁面的Response，並將其通過下載中間件（Downloader Middlewares）發送給引擎，這個過程中下載器中間件（Downloader Middlewares）中的process_response()函數會被調用到；
引擎從下載器中得到上步中的Response並通過Spider中間件(Spider Middlewares)發送給Spider處理,這個過程中Spider中間件(Spider Middlewares)中的process_spider_input()函數會被調用到；
Spider處理Response並通過Spider中間件(Spider Middlewares)返回爬取到的Item及(跟進的)新的Request給引擎，這個過程中Spider中間件(Spider Middlewares)的process_spider_output()函數會被調用到；
引擎將上步中Spider處理的其爬取到的Item給Item 管道（Pipeline），將Spider處理的Request發送給調度器，並向調度器請求可能存在的下一個要爬取的請求（Requests）；
(從第二步)重複直到調度器中沒有更多的請求（Requests）。

並且通過前面一章我們學習了中間件的用法。那麼使用Scrapy+selenium+chromedriver進行結合開發的原理就是將selenium+chromedriver寫到中間件中，在執行到第四步的時候直接在process_request()返回response對象給引擎，就不會再去下載器中進行請求訪問(也就是我們沒有使用scrapy框架的downloader模塊，直接在下載器中間件中完成了請求，並且返回了請求結果resopnse給引擎)。

開發實例

下面我們就已爬取簡書網數據爲例進行實例開發。

使用scrapy startproject jianshu_spider命令創建項目；
使用scrapy genspider -t crawl jianshu jianshu.com 創建爬蟲；

開發spider目錄下爬蟲文件jianshu.py代碼如下：

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from jianshu_spider.items import JianshuSpiderItem


class JianshuSpider(CrawlSpider):
    name = 'jianshu'
    allowed_domains = ['jianshu.com']
    start_urls = ['https://www.jianshu.com/']

    # 分析簡書文章鏈接發現其鏈接地址結構爲：域名 + /p/ + 12爲數字字母組合字符串
    rules = (
        Rule(LinkExtractor(allow=r'.*/p/[0-9a-z]{12}.*'), callback='parse_detail', follow=True),
    )

    def parse_detail(self, response):
        """
        進行爬取結果分析
        :param response:
        :return:
        """
        title = response.xpath('//h1[@class="title"]/text()').get()
        avatar = response.xpath('//a[@class="avatar"]/img/@src').get()
        author = response.xpath('//span[@class="name"]/text()').get()
        pub_time = response.xpath('//span[@class="publish-time"]/text()').get()
        url = response.url
        url1 = url.split('?')[0]
        article_id = url1.split('/')[-1]
        content = response.xpath('//div[@class="show-content"]').get()

        word_count = response.xpath('//span[@class="wordage"]/text()').get()
        comment_count = response.xpath('//span[@class="comments-count"]/text()').get()
        like_count = response.xpath('//span[@class="likes-count"]/text()').get()
        read_count = response.xpath('//span[@class="views-count"]/text()').get()
        subjects = ','.join(response.xpath('//div[@class="include-collection"]/a/div/text()').getall())

        item = JianshuSpiderItem(
            title=title,
            avatar=avatar,
            author=author,
            pub_time=pub_time,
            origin_url=url1,
            article_id=article_id,
            content=content,
            word_count=word_count,
            comment_count=comment_count,
            like_count=like_count,
            read_count=read_count,
            subjects=subjects
        )
        yield item

開發items.py代碼如下：

import scrapy


class JianshuSpiderItem(scrapy.Item):
    """
    定義所需字段
    """
    title = scrapy.Field()
    content = scrapy.Field()
    article_id = scrapy.Field()
    origin_url = scrapy.Field()
    author = scrapy.Field()
    avatar = scrapy.Field()
    pub_time = scrapy.Field()
    read_count = scrapy.Field()
    like_count = scrapy.Field()
    word_count = scrapy.Field()
    subjects = scrapy.Field()
    comment_count = scrapy.Field()

開發middlewares.py中間件代碼如下：

import time
from scrapy.http.response.html import HtmlResponse
from selenium import webdriver


class SeleniumDownloadMiddleware(object):
    """
    selenium 下載中間件
    """

    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=r'E:\Python_Code\s1\chromedriver_win32\chromedriver.exe')

    def process_request(self, request, spider):
        self.driver.get(request.url)
        time.sleep(1)
        try:
            while True:
                show_more = self.driver.find_element_by_class_name('show-more')
                show_more.click()
                time.sleep(3)
                if not show_more:
                    break
        except:
            pass
        # 獲得網頁源代碼
        source = self.driver.page_source
        # 構造response對象 | 進行返回
        response = HtmlResponse(url=self.driver.current_url, body=source, request=request, encoding='utf-8')
        return response

開發piplines.py代碼如下：

import pymysql
from pymysql import cursors
from twisted.enterprise import adbapi


# 進行數據處理
class JianshuSpiderPipeline(object):
    """
    這種方法只能同步進行保存到數據庫
    """

    def __init__(self):
        db_parames = {
            'host': '127.0.0.1',
            'port': 3306,
            'user': 'root',
            'password': 'root',
            'database': 'jianshu_spider',
            'charset': 'utf8'
        }
        self.conn = pymysql.connect(**db_parames)
        self.cursor = self.conn.cursor()
        self._sql = None

    def process_item(self, item, spider):
        self.cursor.execute(self.sql, (
            item['title'], item['content'], item['author'], item['avatar'], item['pub_time'], item['article_id'],
            item['origin_url'],))
        # self.cursor.commit()
        print(item)
        return item

    @property
    def sql(self):
        if not self._sql:
            self._sql = """
            INSERT INTO article(id,title,content,author,avatar,pub_time,article_id,origin_url)
            VALUE(null,%s,%s,%s,%s,%s,%s,%s)
            """
            return self._sql
        return self._sql


class JianshuTwistedPipeline(object):
    """
    使用Twisted進行異步保存到數據庫
    """

    def __init__(self):
        db_parames = {
            'host': '127.0.0.1',
            'port': 3306,
            'user': 'root',
            'password': 'root',
            'database': 'jianshu_spider',
            'charset': 'utf8',
            'cursorclass': cursors.DictCursor
        }

        # 定義數據庫連接池
        self.dbpool = adbapi.ConnectionPool('pymysql', **db_parames)
        self._sql = None

    @property
    def sql(self):
        if not self._sql:
            self._sql = """
                INSERT INTO article(id,title,content,author,avatar,pub_time,article_id,origin_url,word_count)
                VALUE(null,%s,%s,%s,%s,%s,%s,%s,%s)
                """
            return self._sql
        return self._sql

    def process_item(self, item, spider):
        defer = self.dbpool.runInteraction(self.insert_item, item)
        defer.addErrback(self.handle_error, item, spider)

        print(item)
        return item

    def insert_item(self, cursor, item):
        cursor.execute(self.sql, (
            item['title'], item['content'], item['author'], item['avatar'], item['pub_time'], item['article_id'],
            item['origin_url'], item['word_count']))

    def handle_error(self, error, item, spider):
        print('=' * 15 + 'error' + '=' * 15)
        print(error)
        print('=' * 15 + 'error' + '=' * 15)

進行設置setting.py文件相關配置更改如下：

# 開啓item pipelines
ITEM_PIPELINES = {
    'jianshu_spider.pipelines.JianshuTwistedPipeline': 300,
    # 'jianshu_spider.pipelines.JianshuSpiderPipeline': 300,
}

# 開啓中間件
DOWNLOADER_MIDDLEWARES = {
   'jianshu_spider.middlewares.SeleniumDownloadMiddleware': 543,
}

將數據庫相關表設計好，運行代碼即可。

Python爬蟲5.11 — scrapy框架結合selenium+chromedriver的使用