最近在做爬蟲的項目, 遇到動態由js生成的html因爲是動態生成了, Scrapy是抓取不到的. 因爲現在網站單純全部寫成靜態的html的不是太多了, 抓取動態js生成的html必然是個繞不過去的坎, 所以需要研究下主流是如何處理這個問題的。關於Scrapy如何爬取網頁的內容可以參考之前Scrapy入門的文章。
比較主流的做法是通過Splash生成的服務, 爬蟲程序通過Splash的服務從而抓取到js動態生成的內容, 在這裏的Splash有點像代理的意思。
安裝
通過Docker安裝Scrapy, 這裏就不介紹如何安裝Docker和Docker-compose了, 可以參考之前Docker的文章
拉取Splash鏡像
docker pull scrapinghub/splash
Splash鏡像比較大, 如果是在虛擬機裏面安裝, 有可能會遇到虛擬機空間不足的問題, 要進行系統擴容, 擴容時候容易碰到Ubuntu系統不識別新的空間問題, 請參考虛擬機的Ubuntu擴容一文
啓動Splash服務
docker run -p 8050:8050 scrapinghub/splash
我比較喜歡做成一個docker-compose文件, 之後用起來比較方便
version: '3'
services:
splash:
restart: always
image: scrapinghub/splash
container_name: splash
ports:
- 8050:8050
在瀏覽器裏輸入http://192.168.25.145:8050/, 如果看到下面的畫面, 就證明splash安裝成功了
在python中也要安裝scrapy_splash
pip3 install scrapy_splash
scrapy爬蟲設置
創建一個新的項目做爲學習用
scrapy startproject jdproject
打開jdproject/spiders/jd.py, 修改內容:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request, FormRequest
from scrapy.selector import Selector
from scrapy_splash.request import SplashRequest, SplashFormRequest
class JdSpider(scrapy.Spider):
name = "jd"
def start_requests(self):
splash_args = {"lua_source": """
--splash.response_body_enabled = true
splash.private_mode_enabled = false
splash:set_user_agent("Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36")
assert(splash:go("https://item.jd.com/5089239.html"))
splash:wait(3)
return {html = splash:html()}
"""}
yield SplashRequest("https://item.jd.com/5089239.html", endpoint='run', args=splash_args, callback=self.onSave)
def onSave(self, response):
value = response.xpath('//span[@class="p-price"]//text()').extract()
print(value)
打開jdproject/settings.py, 修改:
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, # 不配置查不到信息
}
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
SPLASH_URL = "http://192.168.99.100:8050/" # 自己安裝的docker裏的splash位置
DUPEFILTER_CLASS = "scrapy_splash.SplashAwareDupeFilter"
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
測試
這裏我用的是https://item.jd.com/5089239.html
做測試,要拿產品價格
運行爬蟲