(python解析js)scrapy結合ghost抓取js生成的頁面,以及js變量的解析

現在頁面用ajax的越來越多, 好多代碼是通過js執行結果顯示在頁面的(比如:http://news.sohu.com/scroll/,搜狐滾動新聞的列表是在頁面請求時由後臺一次性將數據渲染到前臺js變量newsJason和arrNews裏面的,然後再由js生成div和li,故要想或得結果必須要解析執行js), 所以在scrapy抓取過程中就需要通過一箇中間件來執行這個js代碼。

scrapy 本身不能作爲js engine,這就導致很多js生成的頁面的數據會無法抓取到,因此,一些通用做法是使用webkit或基於webkit的庫。

Ghost是一個python的webkit客戶端,基於webkit的核心,使用了pyqt或者pyside的webkit實現。

安裝:

1.安裝sip(pyqt依賴):
wget http://sourceforge.net/projects/pyqt/files/sip/sip-4.14.6/sip-4.14.6.tar.gz
tar zxvf sip-4.14.6.tar.gz
cd sip-4.14.6
python configure.py
make
sudo make install
2.安裝pyqt
wget http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.10.1/PyQt-x11-gpl-4.10.1.tar.gz 
tar zxvf PyQt-x11-gpl-4.10.1.tar.gz 
cd PyQt-mac-gpl-4.10.1
python configure.py
make
sudo make install
3.安裝Ghost
git clone git://github.com/carrerasrodrigo/Ghost.py.git
cd Ghost.py
sudo python setup.py install 

scrapy使用ghost:

1.開發downloader middleware (webkit_js.py)

from scrapy.http import Request,FormRequest,HtmlResponse
import JsSpider.settings
from ghost import Ghost
class WebkitDownloader(object):
    def process_request(self,request,spider):
        if spider.name in JsSpider.settings.WEBKIT_DOWNLOADER:
            if(type(request) is not FormRequest):
                ghost = Ghost()
                session = ghost.start()
                session.open(request.url)
                result,resource = session.evaluate('document.documentElement.innerHTML')
                #保留會話到爬蟲,用以在爬蟲裏面執行js代碼
                spider.webkit_session = session
                renderedBody = str(result.toUtf8())
                #返回rendereBody就是執行了js後的頁面
                return HtmlResponse(request.url,body=renderedBody)

2.scrapy配置

在scrapy的settings.py中加入:

#which spider should use webkit
WEBKIT_DOWNLOADER = ['spider_name']
DOWNLOADER_MIDDLEWARES = {
  #'JsSpider.middleware.rotate_useragent.RotateUserAgentMiddleware': 533,
    'JsSpider.middleware.webkit_js.WebkitDownloader': 543,
}

其中
JsSpider.middleware.rotate_useragent.RotateUserAgentMiddleware爲user agent池

# -*-coding:utf-8-*-
from scrapy import log
"""避免被ban策略之一:使用useragent池。
使用注意:需在settings.py中進行相應的設置。
"""
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent
    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            #顯示當前使用的useragent
            #print "********Current UserAgent:%s************" %ua
            #記錄
            log.msg('Current UserAgent: '+ua, level=log.INFO)
            request.headers.setdefault('User-Agent', ua)
    #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    user_agent_list = [
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
        "(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
        "(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
        "(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
        "(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
        "(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
        "(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
       ]

3.spider中解析js

from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from JsSpider.items import JsSpiderItem

#import sys
#reload(sys)
#sys.setdefaultencoding('utf-8')

class JsSpider(Spider):
    def __init__(self):
        self.webkit_session = None
    name = "js"
    #download_delay = 3
    allowed_domains = ["news.sohu.com"]
    start_urls = [
        "http://news.sohu.com/scroll/"
        ]

    def parse(self, response):
        items = []
        newsJason = self.webkit_session.evaluate('newsJason')   #獲得js對象
        arrNews = self.webkit_session.evaluate('arrNews')       #獲得js對象
        print type(newsJason)
        print type(arrNews)

        newsJason = newsJason[0]
        arrNews = arrNews[0]

        category =[v for k,v in newsJason.iteritems()][0]

        for i in range(len(category)):
            for j in range(len(category[i])):
                category[i][j]=str(category[i][j])

        for i in range(len(arrNews)):
            for j in range(len(arrNews[i])):
                arrNews[i][j]=str(arrNews[i][j])
        #ghost解析js返回的結果一般爲QString對象,轉換起來比較麻煩,特別是複雜的數據結果,層層嵌套都算QString類型,還有編碼問題,後面數據的格式化,有空在弄吧
        print category
        print arrNews

        return items

說明:

scrapy請求的頁面通過中間件執行js後返回response給spider,此時的reponse的js變量裏面有我們需要的數據,再通過spider初始化的ghost會話webkit_session執行js變量解析,使用了evaluate(javascript)函數,要獲得js變量arrNews的值,只需要執行self.webkit_session.evaluate(‘arrNews’)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章