【爬蟲】使用 Scrapy + Selenium 爬取動態加載頁面的內容

原創

栗子ma

2018-08-26 15:59

上一篇文章裏面我們使用 Python Scrapy 爬取靜態網頁中所有文字：https://blog.csdn.net/sinat_40431164/article/details/81102476

但是有個問題，當我們把要訪問的URL修改爲：http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2的時候，可以發現爬取的內容裏面沒有“車型論壇”和“主題論壇”兩個板塊。

有時候，我們天真無邪的使用urllib庫或Scrapy下載HTML網頁時會發現，我們要提取的網頁元素並不在我們下載到的HTML之中，儘管它們在瀏覽器裏看起來唾手可得。

這說明我們想要的元素是在我們的某些操作下通過js事件動態生成的。舉個例子，我們在刷QQ空間或者微博評論的時候，一直往下刷，網頁越來越長，內容越來越多，就是這個讓人又愛又恨的動態加載。爬取動態頁面目前來說有兩種方法：

分析頁面請求
selenium模擬瀏覽器行爲

下面我們就來講一講如何運用Selenium模擬瀏覽器行爲。

Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject URLCrawler

Our first Spider

This is the code for our first Spider. Save it in a file named my_spider.py under the URLCrawler/spiders directory in your project:

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 18 17:55:45 2018

@author: Administrator
"""

from scrapy import Spider,Request
from selenium import webdriver

class MySpider(Spider):
    name = "my_spider"

    def __init__(self):
        self.browser = webdriver.Firefox(executable_path='E:\software\python\geckodriver-v0.21.0-win64\geckodriver.exe')
        self.browser.set_page_load_timeout(30)

    def closed(self,spider):
        print("spider closed")
        self.browser.close()

    def start_requests(self):
        start_urls = ['http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2'.format(str(i)) for i in range(1,2,2)]
        for url in start_urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)
        print('---------------------------------------------------')

middlewares.py

加入以下內容：

from scrapy import signals
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time

class SeleniumMiddleware(object):
    def process_request(self, request, spider):
        if spider.name == 'my_spider':
            try:
                spider.browser.get(request.url)
                spider.browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
            except TimeoutException as e:
                print('超時')
                spider.browser.execute_script('window.stop()')
            time.sleep(2)
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request)

settings.py

添加以下內容：

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'URLCrawler.middlewares.SeleniumMiddleware': 543,
}

How to run our spider

To put our spider to work, go to the project’s 最高一層的目錄 and run:

scrapy crawl my_spider

發現下載下來的網頁和用瀏覽器訪問該網頁的內容一樣！

如果僅僅需要文字內容，那麼將spider中的parse方法改成：

    def parse(self, response):
        '''domain = response.url.split("/")[-2]
        filename = '%s.html' %domain
        with open(filename, 'wb') as f:
            f.write(response.body)'''
        
        #textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
        textlist_with_scripts = response.selector.xpath('//text()[normalize-space(.)]').extract()
        #with open('filename_no_scripts', 'w', encoding='utf-8') as f:
        with open('filename_with_scripts', 'w', encoding='utf-8') as f:
            for i in range(0, len(textlist_with_scripts)):
                text = textlist_with_scripts[i].strip()
                f.write(text + '\n')
        print('---------------------------------------------------')

The End.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【爬蟲】使用 Scrapy + Selenium 爬取動態加載頁面的內容

Creating a project

Our first Spider

middlewares.py

settings.py

How to run our spider

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

【爬蟲】使用 Scrapy + Selenium 爬取動態加載頁面的內容

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結