scrapy+spynner獲取ajax中的內容(以微信公衆號爲例)

現在越來越多的網站的使用ajax來動態加載數據,scrapy只能獲取靜態html中的數據,對於動態加載的就無能爲力了

spynner是一個模擬瀏覽器加載的工具,可以在後臺模擬ajax加載後的網頁,然後再通過scrapy進行爬取

原理就是在scrapy的中間件設置spynner模塊加載

微信公衆號裏面的內容,文字可以直接加載出來,但是圖片使用的是ajax技術,如果我們成功獲取到了圖片的src則我們就實現了對網頁的動態頁面的獲取


環境: Ubuntu16  Python2.7

1 spynner安裝

pip install spynner

安裝的時候會出現各種依賴包,如果沒有安裝的話會報錯,可以通過apt-file來定位需要的依賴http://blog.csdn.net/lcyong_/article/details/72904275


2創建scrapy工程,編寫中間件,添加spynner模塊

scrapy stratproject testSpynner

cd testSpynner

scrapy genspider weixin qq.com

現在成功創建了一個名爲wx的scrapy爬蟲

編寫中間件 middlewares   在這裏我添加了隨機更換瀏覽器頭的模塊代碼如下:


# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
import random
import spynner
import pyquery

from scrapy.http import HtmlResponse
from scrapy import signals

class WebkitDownloaderTest(object):
    def process_request(self, request, spider):
        print "創建spynner"
        browser = spynner.Browser()
        browser.create_webview()
        browser.set_html_parser(pyquery.PyQuery)
        browser.load(request.url, 20)
        print "打開網頁"
        try:
            browser.wait_load(10)
        except:
            pass
        string = browser.html.encode('utf-8')
        print "打開網頁"
        renderedBody = str(string)
        print "讀取數據"+string
        return HtmlResponse(request.url, body=renderedBody)



class UserAgentMiddleware(object):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:

            request.headers.setdefault('User-Agent', ua)
            print "********Current UserAgent:%s************" % ua


    user_agent_list = [ \
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]


3 設置setting打開中間件

DOWNLOADER_MIDDLEWARES = {
   'jsspider.middlewares.UserAgentMiddleware': 400,
   'jsspider.middlewares.WebkitDownloaderTest': 401,
   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
 }


這裏spynner模塊要比useragent的value值大,因爲要首先使用useragent


4 編輯爬蟲

# -*- coding: utf-8 -*-
import scrapy


class WeixinSpider(scrapy.Spider):
    name = "weixin"
    allowed_domains = ["qq.com"]
    start_urls = [
        'https://mp.weixin.qq.com/s?src=3×tamp=1496839659&ver=1&signature=IrQ2oi0qMCOa0-*lbf7OCdgKjBnbKqAYOumviodVwtgeWWkt-fvA1kcd63*u0Z4uQY4kJVn*jS8rbRwd9Hg4FLj9hxw*sKA7rVYTMpWKXaemALgabVrrAeOBCPBFmtLUQx3zSoapN7i1ZBhPw*2eQ2*gbTwQVTUvTDaBRhCKePg=']

    def parse(self, response):
        img = response.xpath('//*[@id="js_content"]/p[2]')
        data = img[0].xpath('img/@data-s')
        src = img[0].xpath('img/@src')
        self.log("^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^")
        self.log(img.extract()[0])
        self.log("^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^")
        self.log("^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^")
        self.log(data.extract()[0])
        self.log("^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^")
        self.log("^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^")
        self.log(src.extract()[0])
        self.log("^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^")




5 測試運行  

命令行輸入 scrapy crawl weixin

在日誌文件中看到了:

2017-06-07 22:16:49 [scrapy] DEBUG: Crawled (200) <GET https://mp.weixin.qq.com/s?src=3×tamp=1496839659&ver=1&signature=IrQ2oi0qMCOa0-*lbf7OCdgKjBnbKqAYOumviodVwtgeWWkt-fvA1kcd63*u0Z4uQY4kJVn*jS8rbRwd9Hg4FLj9hxw*sKA7rVYTMpWKXaemALgabVrrAeOBCPBFmtLUQx3zSoapN7i1ZBhPw*2eQ2*gbTwQVTUvTDaBRhCKePg=> (referer: None)
2017-06-07 22:16:49 [jingdong] DEBUG: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2017-06-07 22:16:49 [jingdong] DEBUG: <p><img data-s="300,640" data-type="jpeg" data-src="http://mmbiz.qpic.cn/mmbiz_jpg/xZe7vUPPTJSvnfLFLhuCyib4ZiclZleCA56AXdAunYSPWR3CLEUciatU40n2lhicycpw8IiadvicKkdSaBYl0BVzoT9Q/0?wx_fmt=jpeg" style="width: auto !important; height: auto !important; visibility: visible !important;" data-ratio="0.625" data-w="1000" class=" " src="http://mmbiz.qpic.cn/mmbiz_jpg/xZe7vUPPTJSvnfLFLhuCyib4ZiclZleCA56AXdAunYSPWR3CLEUciatU40n2lhicycpw8IiadvicKkdSaBYl0BVzoT9Q/640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1" data-fail="0"></p>
2017-06-07 22:16:49 [jingdong] DEBUG: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2017-06-07 22:16:49 [jingdong] DEBUG: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2017-06-07 22:16:49 [jingdong] DEBUG: 300,640
2017-06-07 22:16:49 [jingdong] DEBUG: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2017-06-07 22:16:49 [jingdong] DEBUG: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2017-06-07 22:16:49 [jingdong] DEBUG: http://mmbiz.qpic.cn/mmbiz_jpg/xZe7vUPPTJSvnfLFLhuCyib4ZiclZleCA56AXdAunYSPWR3CLEUciatU40n2lhicycpw8IiadvicKkdSaBYl0BVzoT9Q/640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1
2017-06-07 22:16:49 [jingdong] DEBUG: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


說明爬取成功


發佈了100 篇原創文章 · 獲贊 50 · 訪問量 29萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章