爬蟲採集-基於webkit核心的客戶端Ghost.py [爬蟲實例]

http://rfyiamcool.blog.51cto.com/1030776/1287810

對與要時不時要抓取頁面的我們來說，是痛苦的~

由於目前的Web開發中AJAX、Javascript、CSS的大量使用，一些網站上的重要數據是由Ajax或Javascript動態生成的，並不能直接通過解析html頁面內容就能獲得（例如採用urllib2，mechanize、lxml、Beautiful Soup ）。要實現對這些頁面數據的爬取，爬蟲必須支持Javacript、DOM、HTML解析。

比如：像監控的數據就不能用簡單的curl和urllib解析到的。。。

還有這個用ajax 渲染的頁面，用urllib2直接解析不了的。

http://rfyiamcool.blog.51cto.com/blog/1030776/1287810

常見的抓數據的方法：

urllib2+urlparse+re

最原始的辦法，其中urllib2是python的web庫、urlparse能處理url、re是正則庫，這種方法寫起來比較繁瑣，但也比較“實在”

urllib2+beautifulsoup

這裏的得力干將是beautifulsoup，beautifulsoup可以非常有效的解析HTML頁面，就可以免去自己用re去寫繁瑣的正則等。

Mechanize+BeautifulSoup

Mechanize是對於urllib2的部分功能的替換，使得除了http以外其他任何連接也都能被打開，也更加動態可配置

http://rfyiamcool.blog.51cto.com/blog/1030776/1287810

其實像上面的頁面，要是不嫌麻煩，可以從頁面狂找接口，下出來的大多是xml的格式，然後你再費勁的去解析。。。是在他折騰了。

這時候大家可以用 webkit核心的web 客戶端。他會像真正的瀏覽器一樣來解析頁面的。

WebKit: Safari, Google Chrome,傲遊3 360瀏覽器等等都是基於 Webkit 核心開發。

我們一般是終端取值的，這些也有不少封裝好的工具

Pyv8，PythonWebKit，Selenium，PhantomJS，Ghost.py 等等。。。。

我這裏推薦用ghost.py 。。。。因爲他夠直接和實用

發現國內webkit的資料很少，ghost.py的資料就更少了，那我就根據官方的文檔，簡單的翻譯下 ~

http://rfyiamcool.blog.51cto.com/blog/1030776/1287810

一個小例子，感受下Ghost~

from
ghost import Ghost

ghost
= Ghost()

page,
extra_resources = ghost.open("http://xiaorui.cc")

assert
page.http_status==200 and 'xiaorui' in ghost.content

安裝Ghost.py　以及相關的東東～~

用webkit，我們需要有pyqt或者是PySide

這些都安裝好了後，再開始

運氣好的直接 pip install Ghost.py

運氣不好的：

中間會遇到好多蛋疼的問題，大家多搜搜~

要是解決不了了，請回帖哈~

wget
http://sourceforge.net/projects/pyqt/files/sip/sip-4.14.6/sip-4.14.6.tar.gz

tar
zxvf sip-4.14.6.tar.gz

cd
sip-4.14.6

python
configure.py

make

sudo
make install

wget
http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.10.1/PyQt-mac-gpl-4.10.1.tar.gz

tar
zxvf PyQt-mac-gpl-4.10.1.tar.gz

cd
PyQt-mac-gpl-4.10.1

python
configure.py

make

sudo
make install

wget
http://pyside.markus-ullmann.de/pyside-1.1.1-qt48-py27apple.pkg

open
pyside-1.1.1-qt48-py27apple.pkg

git
clone https://github.com/mitsuhiko/flask.git

cd
flask

sudo
python setup.py install

git
clone git://github.com/carrerasrodrigo/Ghost.py.git

cd
Ghost.py

sudo
python setup.py install

創建一個實例對象：

from
ghost import Ghost

ghost
= Ghost()

打開一個頁面

page,
resources = ghost.open('http://my.web.page')

夾帶着 javascript代碼

result,
resources = ghost.evaluate(

    "document.getElementById('my-input').getAttribute('value');")

模擬點擊事件

page,
resources = ghost.evaluate(

    "document.getElementById('link').click();",
expect_loading=True)

填寫表單中的字段中的值 (selector, value, blur=True, expect_loading=False):

result,
resources = ghost.set_field_value("input[name=username]", "jeanphix")

If you set optional parameter `blur` to False, the focus will be left on the field (usefull for autocomplete tests).

For filling file input field, simply pass file path as `value`.

你可以填寫form表單 Ghost.fill(selector, values, expect_loading=False):

result,
resources = ghost.fill("form",
{

    "username": "jeanphix",

    "password": "mypassword"

})

提交表單~

page,
resources = ghost.fire_on("form", "submit",
expect_loading=True)

這是對於高級屬性的定義：

這些有很多好用的屬性

wait_for_page_loaded()

That wait until a new page is loaded.

page, resources = ghost.wait_for_page_loaded()

這個是等頁面都加載完畢，類似jquery

$(document).ready(function()

wait_for_selector(selector)

That wait until a element match the given selector.

result, resources = ghost.wait_for_selector("ul.results")

等你指定的dom名稱出現

wait_for_text(text)

That wait until the given text exists inside the frame.

result, resources = ghost.wait_for_selector("My result")

等我們要的字符出現

官網出現了 FlASK 的例子：

可以通過ghost.py和unittest實現程序的單元測試：

import unittest

from flask import Flask

from ghost import GhostTestCase

app = Flask(__name__)

@app.route('/')

def home():

    return 'hello
world'

class MyTest(GhostTestCase):

    port = 5000

    @classmethod

    def create_app(cls):

        return app

    def test_open_home(self):

        self.ghost.open("http://localhost:%s/" % self.port)

        self.assertEqual(self.ghost.content, 'hello
world')

if __name__ == '__main__':

    unittest.main()

~~~整體的小demo~~~

#
Opens the web page

ghost.open('http://www.openstreetmap.org/')

#
Waits for form
search field

ghost.wait_for_selector('input[name=query]')

#
Fills the form

ghost.fill("#search_form",
{'query': 'France'})

#
Submits the form

ghost.fire_on("#search_form", "submit")

#
Waits for results
(an XHR has been called here)

ghost.wait_for_selector(

    '#search_osm_nominatim
.search_results_entry a')

#
Clicks first result link

ghost.click(

    '#search_osm_nominatim
.search_results_entry:first-child a')

#
Checks if map
has moved to expected latitude

lat,
resources = ghost.evaluate("map.center.lat")

assert
float(lat.toString()) == 5860090.806537

aha，咱們來個實例哈~

咱們來個簡單的模擬瀏覽器到百度去搜 xiaorui.cc 然後看看內容和headers頭：

終端下的操作：

得到的是

http://www.baidu.com/s?wd=xiaorui.cc&rsv_bp=0&ch=&tn=baidu&bar=&rsv_spt=3&ie=utf-8

咱們訪問下

看他的http頭

In
[10]:
print page.headers

{u'BDQID':
u'0xf594a31a03344b4f',
u'Content-Encoding':
u'gzip',
u'Set-Cookie':
u'BDSVRTM=381;
path=/\nH_PS_PSSID=2976_2981_3091; path=/; domain=.baidu.com',
u'BDUSERID':
u'0',
u'Server':
u'BWS/1.0',
u'Connection':
u'Keep-Alive',
u'Cache-Control':
u'private',
u'Date':
u'Tue,
03 Sep 2013 09:53:56 GMT',
u'Content-Type':
u'text/html;charset=utf-8',
u'BDPAGETYPE':
u'3'}

他的內容：

先這樣吧~ 更詳細的功能大家看官網吧~

爬蟲採集-基於webkit核心的客戶端Ghost.py [爬蟲實例]

wait_for_page_loaded()

wait_for_selector(selector)

wait_for_text(text)

Google C++編程規範 – 第二十一條 -《-inl.h文件》

微信公衆號開發443端口，本地服務器，小記

在Django中使用markdown

安裝MySQL5.6新建用戶並創建密碼時總是提示密碼不符合要求：ERROR 1819 (HY000): Your password does NOT satisfy the CURRENT

爬蟲採集-基於webkit核心的客戶端Ghost.py [爬蟲實例]

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結