【爬蟲】使用 Scrapy + Selenium 爬取動態加載頁面的內容

上一篇文章裏面我們使用 Python Scrapy 爬取靜態網頁中所有文字:https://blog.csdn.net/sinat_40431164/article/details/81102476

但是有個問題,當我們把要訪問的URL修改爲:http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2的時候,可以發現爬取的內容裏面沒有“車型論壇”和“主題論壇”兩個板塊。

有時候,我們天真無邪的使用urllib庫或Scrapy下載HTML網頁時會發現,我們要提取的網頁元素並不在我們下載到的HTML之中,儘管它們在瀏覽器裏看起來唾手可得。

這說明我們想要的元素是在我們的某些操作下通過js事件動態生成的。舉個例子,我們在刷QQ空間或者微博評論的時候,一直往下刷,網頁越來越長,內容越來越多,就是這個讓人又愛又恨的動態加載。爬取動態頁面目前來說有兩種方法:

  1. 分析頁面請求
  2. selenium模擬瀏覽器行爲

下面我們就來講一講如何運用Selenium模擬瀏覽器行爲。

Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject URLCrawler

Our first Spider

This is the code for our first Spider. Save it in a file named my_spider.py under the URLCrawler/spiders directory in your project:

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 18 17:55:45 2018

@author: Administrator
"""

from scrapy import Spider,Request
from selenium import webdriver

class MySpider(Spider):
    name = "my_spider"

    def __init__(self):
        self.browser = webdriver.Firefox(executable_path='E:\software\python\geckodriver-v0.21.0-win64\geckodriver.exe')
        self.browser.set_page_load_timeout(30)

    def closed(self,spider):
        print("spider closed")
        self.browser.close()

    def start_requests(self):
        start_urls = ['http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2'.format(str(i)) for i in range(1,2,2)]
        for url in start_urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)
        print('---------------------------------------------------')

middlewares.py

加入以下內容:

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time

class SeleniumMiddleware(object):
    def process_request(self, request, spider):
        if spider.name == 'my_spider':
            try:
                spider.browser.get(request.url)
                spider.browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
            except TimeoutException as e:
                print('超時')
                spider.browser.execute_script('window.stop()')
            time.sleep(2)
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request)

settings.py

添加以下內容:

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'URLCrawler.middlewares.SeleniumMiddleware': 543,
}

How to run our spider

To put our spider to work, go to the project’s 最高一層的目錄 and run:

scrapy crawl my_spider

發現下載下來的網頁和用瀏覽器訪問該網頁的內容一樣!

如果僅僅需要文字內容,那麼將spider中的parse方法改成:

    def parse(self, response):
        '''domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)'''
        
        #textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
        textlist_with_scripts = response.selector.xpath('//text()[normalize-space(.)]').extract()
        #with open('filename_no_scripts', 'w', encoding='utf-8') as f:
        with open('filename_with_scripts', 'w', encoding='utf-8') as f:
            for i in range(0, len(textlist_with_scripts)):
                text = textlist_with_scripts[i].strip()
                f.write(text + '\n')
        print('---------------------------------------------------')

The End.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章