上一篇文章裏面我們使用 Python Scrapy 爬取靜態網頁中所有文字:https://blog.csdn.net/sinat_40431164/article/details/81102476
但是有個問題,當我們把要訪問的URL修改爲:http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2的時候,可以發現爬取的內容裏面沒有“車型論壇”和“主題論壇”兩個板塊。
有時候,我們天真無邪的使用urllib庫或Scrapy下載HTML網頁時會發現,我們要提取的網頁元素並不在我們下載到的HTML之中,儘管它們在瀏覽器裏看起來唾手可得。
這說明我們想要的元素是在我們的某些操作下通過js事件動態生成的。舉個例子,我們在刷QQ空間或者微博評論的時候,一直往下刷,網頁越來越長,內容越來越多,就是這個讓人又愛又恨的動態加載。爬取動態頁面目前來說有兩種方法:
- 分析頁面請求
- selenium模擬瀏覽器行爲
下面我們就來講一講如何運用Selenium模擬瀏覽器行爲。
Creating a project
Before you start scraping, you will have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:
scrapy startproject URLCrawler
Our first Spider
This is the code for our first Spider. Save it in a file named my_spider.py
under the URLCrawler/spiders
directory in your project:
# -*- coding: utf-8 -*-
"""
Created on Wed Jul 18 17:55:45 2018
@author: Administrator
"""
from scrapy import Spider,Request
from selenium import webdriver
class MySpider(Spider):
name = "my_spider"
def __init__(self):
self.browser = webdriver.Firefox(executable_path='E:\software\python\geckodriver-v0.21.0-win64\geckodriver.exe')
self.browser.set_page_load_timeout(30)
def closed(self,spider):
print("spider closed")
self.browser.close()
def start_requests(self):
start_urls = ['http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2'.format(str(i)) for i in range(1,2,2)]
for url in start_urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
domain = response.url.split("/")[-2]
filename = '%s.html' %domain
with open(filename, 'wb') as f:
f.write(response.body)
print('---------------------------------------------------')
middlewares.py
加入以下內容:
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time
class SeleniumMiddleware(object):
def process_request(self, request, spider):
if spider.name == 'my_spider':
try:
spider.browser.get(request.url)
spider.browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
except TimeoutException as e:
print('超時')
spider.browser.execute_script('window.stop()')
time.sleep(2)
return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request)
settings.py
添加以下內容:
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'URLCrawler.middlewares.SeleniumMiddleware': 543,
}
How to run our spider
To put our spider to work, go to the project’s 最高一層的目錄 and run:
scrapy crawl my_spider
發現下載下來的網頁和用瀏覽器訪問該網頁的內容一樣!
如果僅僅需要文字內容,那麼將spider中的parse方法改成:
def parse(self, response):
'''domain = response.url.split("/")[-2]
filename = '%s.html' %domain
with open(filename, 'wb') as f:
f.write(response.body)'''
#textlist_no_scripts = response.selector.xpath('//*[not(self::script or self::style)]/text()[normalize-space(.)]').extract()
textlist_with_scripts = response.selector.xpath('//text()[normalize-space(.)]').extract()
#with open('filename_no_scripts', 'w', encoding='utf-8') as f:
with open('filename_with_scripts', 'w', encoding='utf-8') as f:
for i in range(0, len(textlist_with_scripts)):
text = textlist_with_scripts[i].strip()
f.write(text + '\n')
print('---------------------------------------------------')
The End.