實現Scrapy的Demo
參閱文檔:
1. 準備工作
安裝python、OpenSSL,linux自帶
安裝pip: 指令爲apt-get install python-pip
安裝Scrapy:pip install Scrapy
安裝之後會出現四個錯誤,下面分別解決:
1). fatal error: Python.h: No such file or
directory
安裝pyhton-dev包
sudo apt-get install python-dev
2). gcc not found
安裝gcc:
安裝gcc編譯器:sudo apt-get install build-essential
3). ImportError: No module named w3lib.http
安裝w3lib:
下載地址如下所示:
http://pypi.python.org/pypi/w3lib
http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21
安裝過程如下所示:
tar -xvzf w3lib-1.0.tar.gz
cd w3lib-1.0
python setup.py install
4). fatal error: libxml/xmlversion.h: No such
file or directory
sudo apt-get install libxml2-dev
libxslt-dev
2. 建立工程
scrapy startproject tutorial
得到如下工程目錄,輸入tree tutorial得到目錄結構如下:
scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...
1). scrapy.cfg: 項目配置文件
2). tutorial/: 項目python模塊
3). tutorial/spiders/: 放置spider的目錄
4). tutorial/items.py:
項目items文件
5).
tutorial/pipelines.py: 項目管道文件
6).
tutorial/settings.py: 項目配置文件
3. 定義item
Items是將要裝載抓取的數據的容器,它工作方式像python裏面的字典,但它提供更多的保護,比如對未定義的字段填充以防止拼寫錯誤。它通過創建一個scrapy.item.Item類來聲明,定義它的屬性爲scrpy.item.Field對象,就像是一個對象關係映射(ORM).
我們通過將需要的item模型化,來控制從dmoz.org獲得的站點數據,比如我們要獲得站點的名字,url和網站描述,我們定義這三種屬性的域。要做到這點,我們編輯在tutorial目錄下的items.py文件:
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
4. 第一個爬蟲程序
Spider是用戶編寫的類,用於從一個域(或域組)中抓取信息。他們定義了用於下載的URL的初步列表,如何跟蹤鏈接,以及如何來解析這些網頁的內容用於提取items。要建立一個Spider,你必須爲scrapy.spider.BaseSpider創建一個子類,並確定三個主要的、強制的屬性:
name:爬蟲的識別名,它必須是唯一的,在不同的爬蟲中你必須定義不同的名字.
start_urls:爬蟲開始爬的一個URL列表。爬蟲從這裏開始抓取數據,所以,第一次下載的數據將會從這些URLS開始。其他子URL將會從這些起始URL中繼承性生成。
parse():爬蟲的方法,調用時候傳入從每一個URL傳回的Response對象作爲參數,response將會是parse方法的唯一的一個參數,
這個方法負責解析返回的數據、匹配抓取的數據(解析爲item)並跟蹤更多的URL。
這是我們的第一隻爬蟲的代碼,將其命名爲dmoz_spider.py並保存在tutorial\spiders目錄下。
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains =
["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self,
response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
5. 運行爬蟲程序
crawl dmoz 命令從dmoz.org域啓動爬蟲
root@qing-ubuntu:/home/qing/tutorial/tutorial# scrapy crawl
dmoz
2013-07-06 14:33:39-0400 [scrapy] INFO: Scrapy 0.16.5 started
(bot: tutorial)
2013-07-06 14:33:40-0400 [scrapy] DEBUG: Enabled extensions:
LogStats, TelnetConsole, CloseSpider, WebService, CoreStats,
SpiderState
2013-07-06 14:33:40-0400 [scrapy] DEBUG: Enabled downloader
middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware,
UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware,
RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware,
ChunkedTransferMiddleware, DownloaderStats
2013-07-06 14:33:41-0400 [scrapy] DEBUG: Enabled spider
middlewares: HttpErrorMiddleware, OffsiteMiddleware,
RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-06 14:33:41-0400 [scrapy] DEBUG: Enabled item
pipelines:
2013-07-06 14:33:41-0400 [dmoz] INFO: Spider opened
2013-07-06 14:33:41-0400 [dmoz] INFO: Crawled 0 pages (at 0
pages/min), scraped 0 items (at 0 items/min)
2013-07-06 14:33:41-0400 [scrapy] DEBUG: Telnet console
listening on 0.0.0.0:6023
2013-07-06 14:33:41-0400 [scrapy] DEBUG: Web service listening
on 0.0.0.0:6080
2013-07-06 14:33:43-0400 [dmoz] DEBUG: Crawled (200)
<GET
http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
(referer: None)
2013-07-06 14:33:44-0400 [dmoz] DEBUG: Crawled (200)
<GET
http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/>
(referer: None)
2013-07-06 14:33:44-0400 [dmoz] INFO: Closing spider
(finished)
2013-07-06 14:33:44-0400 [dmoz] INFO: Dumping Scrapy
stats:
{'downloader/request_bytes': 530,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13062,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 6, 18, 33, 44,
149862),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2013, 7, 6, 18, 33, 41,
21208)}
2013-07-06 14:33:44-0400 [dmoz] INFO: Spider closed
(finished)
注意包含 [dmoz]的行
,那對應着我們的爬蟲。你可以看到start_urls中定義的每個URL都有日誌行。因爲這些URL是起始頁面,所以他們沒有引用(referrers),所以在每行的末尾你會看到
(referer:
<None>).
有趣的是,在我們的 parse 方法的作用下,兩個文件被創建:分別是 Books 和
Resources,這兩個文件中有URL的頁面內容。
6. 提取數據
Scrapy使用XPath selector的機制,基於XPath的表達式。XPath資源:
現在我們嘗試從網頁中提取數據。 你可以在控制檯輸入 response.body, 檢查源代碼中的 XPaths
是否與預期相同。然而,檢查HTML源代碼是件很枯燥的事情。使用Google Chrome的Inspect
Element功能,而且可以提取元素的XPath。檢查源代碼後,你會發現我們需要的數據在一個
<ul>元素中,而且是第二個<ul>。
我們可以通過如下命令選擇每個在網站中的 <li>
元素,修改spiders下的dmoz_spider.py文件:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains =
["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self,
response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//fieldset/ul/li')
#sites = hxs.select('//ul/li')
for site in sites:
title =
site.select('a/text()').extract()
link =
site.select('a/@href').extract()
desc =
site.select('text()').extract()
#print
title, link, desc
print
title, link
再次抓取dmoz.org:
root@qing-ubuntu:/home/qing/tutorial/tutorial# scrapy crawl
dmoz
2013-07-06 15:44:04-0400 [scrapy] INFO: Scrapy 0.16.5 started
(bot: tutorial)
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled extensions:
LogStats, TelnetConsole, CloseSpider, WebService, CoreStats,
SpiderState
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled downloader
middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware,
UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware,
RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware,
ChunkedTransferMiddleware, DownloaderStats
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled spider
middlewares: HttpErrorMiddleware, OffsiteMiddleware,
RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled item
pipelines:
2013-07-06 15:44:04-0400 [dmoz] INFO: Spider opened
2013-07-06 15:44:04-0400 [dmoz] INFO: Crawled 0 pages (at 0
pages/min), scraped 0 items (at 0 items/min)
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Telnet console
listening on 0.0.0.0:6023
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Web service listening
on 0.0.0.0:6080
2013-07-06 15:44:06-0400 [dmoz] DEBUG: Crawled (200)
<GET
http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/>
(referer: None)
[u'Computers: Programming: Resources']
[u'/Computers/Programming/Resources/']
[u'Free Python and Zope Hosting Directory']
[u'http://www.oinko.net/freepython/']
[u"O'Reilly Python Center"]
[u'http://oreilly.com/python/']
[u"Python Developer's Guide"]
[u'http://www.python.org/dev/']
[u'Social Bug'] [u'http://win32com.goermezer.de/']
[u"eff-bot's Daily Python URL"]
[u'http://www.pythonware.com/daily/']
2013-07-06 15:44:07-0400 [dmoz] DEBUG: Crawled (200)
<GET
http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
(referer: None)
[u'Computers: Programming: Languages: Python: Resources']
[u'/Computers/Programming/Languages/Python/Resources/']
[u'Computers: Programming: Languages: Ruby: Books']
[u'/Computers/Programming/Languages/Ruby/Books/']
[u'German']
[u'/World/Deutsch/Computer/Programmieren/Sprachen/Python/Bücher/']
[u'An Introduction to Python']
[u'http://www.network-theory.co.uk/python/intro/']
[u'Core Python Programming']
[u'http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00+en-USS_01DBC.html']
[u'Data Structures and Algorithms with Object-Oriented Design
Patterns in Python']
[u'http://www.brpreiss.com/books/opus7/html/book.html']
[u'Dive Into Python 3']
[u'http://www.diveintopython.net/']
[u'Foundations of Python Network Programming']
[u'http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/']
[u'Free Python books']
[u'http://www.techbooksforfree.com/perlpython.shtml']
[u'FreeTechBooks: Python Scripting Language']
[u'http://www.freetechbooks.com/python-f6.html']
[u'How to Think Like a Computer Scientist: Learning with
Python'] [u'http://greenteapress.com/thinkpython/']
[u'Learn to Program Using Python']
[u'http://www.freenetpages.co.uk/hp/alan.gauld/']
[u'Making Use of Python']
[u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471219754.html']
[u'Practical Python']
[u'http://hetland.org/writing/practical-python/']
[u'Pro Python System Administration']
[u'http://www.sysadminpy.com/']
[u'Programming in Python 3 (Second Edition)']
[u'http://www.qtrac.eu/py3book.html']
[u'Python 2.1 Bible']
[u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764548077.html']
[u'Python 3 Object Oriented Programming']
[u'https://www.packtpub.com/python-3-object-oriented-programming/book']
[u'Python Language Reference Manual']
[u'http://www.network-theory.co.uk/python/language/']
[u'Python Programming Patterns']
[u'http://www.pearsonhighered.com/educator/academic/product/0,,0130409561,00+en-USS_01DBC.html']
[u'Python Programming with the Java Class Libraries: A
Tutorial for Building Web and Enterprise Applications with Jython']
[u'http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1']
[u'Python: Visual QuickStart Guide']
[u'http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00+en-USS_01DBC.html']
[u'Sams Teach Yourself Python in 24 Hours']
[u'http://www.informit.com/store/product.aspx?isbn=0672317354']
[u'Text Processing in Python']
[u'http://gnosis.cx/TPiP/']
[u'XML Processing with Python']
[u'http://www.informit.com/store/product.aspx?isbn=0130211192']
2013-07-06 15:44:07-0400 [dmoz] INFO: Closing spider
(finished)
2013-07-06 15:44:07-0400 [dmoz] INFO: Dumping Scrapy
stats:
{'downloader/request_bytes': 530,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13062,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 6, 19, 44, 7,
262026),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2013, 7, 6, 19, 44, 4,
558782)}
2013-07-06 15:44:07-0400 [dmoz] INFO: Spider closed
(finished)
7. 保存抓取的數據
scrapy crawl dmoz -o items.json -t json
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.