實現Scrapy的Demo

參閱文檔：

1. 準備工作

安裝python、OpenSSL，linux自帶

安裝pip: 指令爲apt-get install python-pip

安裝Scrapy：pip install Scrapy

安裝之後會出現四個錯誤，下面分別解決：

1). fatal error: Python.h: No such file or directory

安裝pyhton-dev包

sudo apt-get install python-dev

2). gcc not found

安裝gcc:

安裝gcc編譯器：sudo apt-get install build-essential

3). ImportError: No module named w3lib.http

安裝w3lib:

下載地址如下所示：

http://pypi.python.org/pypi/w3lib

http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21

安裝過程如下所示：

tar -xvzf w3lib-1.0.tar.gz

cd w3lib-1.0

python setup.py install

4). fatal error: libxml/xmlversion.h: No such file or directory

sudo apt-get install libxml2-dev libxslt-dev

2. 建立工程

scrapy startproject tutorial

得到如下工程目錄，輸入tree tutorial得到目錄結構如下：

scrapy.cfg

tutorial/

__init__.py

items.py

pipelines.py

settings.py

spiders/

__init__.py

...

1). scrapy.cfg: 項目配置文件

2). tutorial/: 項目python模塊

3). tutorial/spiders/: 放置spider的目錄

4). tutorial/items.py: 項目items文件

5). tutorial/pipelines.py: 項目管道文件

6). tutorial/settings.py: 項目配置文件

3. 定義item

如：http://www.cnblogs.com/txw1958/archive/2012/07/16/scrapy-tutorial.html

Items是將要裝載抓取的數據的容器，它工作方式像python裏面的字典，但它提供更多的保護，比如對未定義的字段填充以防止拼寫錯誤。它通過創建一個scrapy.item.Item類來聲明，定義它的屬性爲scrpy.item.Field對象，就像是一個對象關係映射(ORM). 我們通過將需要的item模型化，來控制從dmoz.org獲得的站點數據，比如我們要獲得站點的名字，url和網站描述，我們定義這三種屬性的域。要做到這點，我們編輯在tutorial目錄下的items.py文件:

from scrapy.item import Item, Field

class DmozItem(Item):

title = Field()

link = Field()

desc = Field()

4. 第一個爬蟲程序

Spider是用戶編寫的類，用於從一個域（或域組）中抓取信息。他們定義了用於下載的URL的初步列表，如何跟蹤鏈接，以及如何來解析這些網頁的內容用於提取items。要建立一個Spider，你必須爲scrapy.spider.BaseSpider創建一個子類，並確定三個主要的、強制的屬性：

name：爬蟲的識別名，它必須是唯一的，在不同的爬蟲中你必須定義不同的名字.

start_urls：爬蟲開始爬的一個URL列表。爬蟲從這裏開始抓取數據，所以，第一次下載的數據將會從這些URLS開始。其他子URL將會從這些起始URL中繼承性生成。

parse()：爬蟲的方法，調用時候傳入從每一個URL傳回的Response對象作爲參數，response將會是parse方法的唯一的一個參數,

這個方法負責解析返回的數據、匹配抓取的數據(解析爲item)並跟蹤更多的URL。

這是我們的第一隻爬蟲的代碼，將其命名爲dmoz_spider.py並保存在tutorial\spiders目錄下。

from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

filename = response.url.split("/")[-2]

open(filename, 'wb').write(response.body)

5. 運行爬蟲程序

crawl dmoz 命令從dmoz.org域啓動爬蟲

root@qing-ubuntu:/home/qing/tutorial/tutorial# scrapy crawl dmoz

2013-07-06 14:33:39-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: tutorial)

2013-07-06 14:33:40-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState

2013-07-06 14:33:40-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats

2013-07-06 14:33:41-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2013-07-06 14:33:41-0400 [scrapy] DEBUG: Enabled item pipelines:

2013-07-06 14:33:41-0400 [dmoz] INFO: Spider opened

2013-07-06 14:33:41-0400 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2013-07-06 14:33:41-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023

2013-07-06 14:33:41-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080

2013-07-06 14:33:43-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

2013-07-06 14:33:44-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2013-07-06 14:33:44-0400 [dmoz] INFO: Closing spider (finished)

2013-07-06 14:33:44-0400 [dmoz] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 530,

'downloader/request_count': 2,

'downloader/request_method_count/GET': 2,

'downloader/response_bytes': 13062,

'downloader/response_count': 2,

'downloader/response_status_count/200': 2,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2013, 7, 6, 18, 33, 44, 149862),

'log_count/DEBUG': 8,

'log_count/INFO': 4,

'response_received_count': 2,

'scheduler/dequeued': 2,

'scheduler/dequeued/memory': 2,

'scheduler/enqueued': 2,

'scheduler/enqueued/memory': 2,

'start_time': datetime.datetime(2013, 7, 6, 18, 33, 41, 21208)}

2013-07-06 14:33:44-0400 [dmoz] INFO: Spider closed (finished)

注意包含 [dmoz]的行，那對應着我們的爬蟲。你可以看到start_urls中定義的每個URL都有日誌行。因爲這些URL是起始頁面，所以他們沒有引用(referrers)，所以在每行的末尾你會看到 (referer: <None>).

有趣的是，在我們的 parse 方法的作用下，兩個文件被創建：分別是 Books 和 Resources，這兩個文件中有URL的頁面內容。

6. 提取數據

Scrapy使用XPath selector的機制，基於XPath的表達式。XPath資源：

http://www.w3schools.com/XPath/default.asp

現在我們嘗試從網頁中提取數據。你可以在控制檯輸入 response.body，檢查源代碼中的 XPaths 是否與預期相同。然而，檢查HTML源代碼是件很枯燥的事情。使用Google Chrome的Inspect Element功能，而且可以提取元素的XPath。檢查源代碼後，你會發現我們需要的數據在一個 <ul>元素中，而且是第二個<ul>。我們可以通過如下命令選擇每個在網站中的 <li> 元素，修改spiders下的dmoz_spider.py文件：

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):

name = "dmoz"

allowed_domains = ["dmoz.org"]

start_urls = [

"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",

"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"

]

def parse(self, response):

hxs = HtmlXPathSelector(response)

sites = hxs.select('//fieldset/ul/li')

#sites = hxs.select('//ul/li')

for site in sites:

title = site.select('a/text()').extract()

link = site.select('a/@href').extract()

desc = site.select('text()').extract()

#print title, link, desc

print title, link

再次抓取dmoz.org:

root@qing-ubuntu:/home/qing/tutorial/tutorial# scrapy crawl dmoz

2013-07-06 15:44:04-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: tutorial)

2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState

2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats

2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled item pipelines:

2013-07-06 15:44:04-0400 [dmoz] INFO: Spider opened

2013-07-06 15:44:04-0400 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2013-07-06 15:44:04-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023

2013-07-06 15:44:04-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080

2013-07-06 15:44:06-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

[u'Computers: Programming: Resources'] [u'/Computers/Programming/Resources/']

[u'Free Python and Zope Hosting Directory'] [u'http://www.oinko.net/freepython/']

[u"O'Reilly Python Center"] [u'http://oreilly.com/python/']

[u"Python Developer's Guide"] [u'http://www.python.org/dev/']

[u'Social Bug'] [u'http://win32com.goermezer.de/']

[u"eff-bot's Daily Python URL"] [u'http://www.pythonware.com/daily/']

2013-07-06 15:44:07-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

[u'Computers: Programming: Languages: Python: Resources'] [u'/Computers/Programming/Languages/Python/Resources/']

[u'Computers: Programming: Languages: Ruby: Books'] [u'/Computers/Programming/Languages/Ruby/Books/']

[u'German'] [u'/World/Deutsch/Computer/Programmieren/Sprachen/Python/Bücher/']

[u'An Introduction to Python'] [u'http://www.network-theory.co.uk/python/intro/']

[u'Core Python Programming'] [u'http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00+en-USS_01DBC.html']

[u'Data Structures and Algorithms with Object-Oriented Design Patterns in Python'] [u'http://www.brpreiss.com/books/opus7/html/book.html']

[u'Dive Into Python 3'] [u'http://www.diveintopython.net/']

[u'Foundations of Python Network Programming'] [u'http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/']

[u'Free Python books'] [u'http://www.techbooksforfree.com/perlpython.shtml']

[u'FreeTechBooks: Python Scripting Language'] [u'http://www.freetechbooks.com/python-f6.html']

[u'How to Think Like a Computer Scientist: Learning with Python'] [u'http://greenteapress.com/thinkpython/']

[u'Learn to Program Using Python'] [u'http://www.freenetpages.co.uk/hp/alan.gauld/']

[u'Making Use of Python'] [u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471219754.html']

[u'Practical Python'] [u'http://hetland.org/writing/practical-python/']

[u'Pro Python System Administration'] [u'http://www.sysadminpy.com/']

[u'Programming in Python 3 (Second Edition)'] [u'http://www.qtrac.eu/py3book.html']

[u'Python 2.1 Bible'] [u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764548077.html']

[u'Python 3 Object Oriented Programming'] [u'https://www.packtpub.com/python-3-object-oriented-programming/book']

[u'Python Language Reference Manual'] [u'http://www.network-theory.co.uk/python/language/']

[u'Python Programming Patterns'] [u'http://www.pearsonhighered.com/educator/academic/product/0,,0130409561,00+en-USS_01DBC.html']

[u'Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython'] [u'http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1']

[u'Python: Visual QuickStart Guide'] [u'http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00+en-USS_01DBC.html']

[u'Sams Teach Yourself Python in 24 Hours'] [u'http://www.informit.com/store/product.aspx?isbn=0672317354']

[u'Text Processing in Python'] [u'http://gnosis.cx/TPiP/']

[u'XML Processing with Python'] [u'http://www.informit.com/store/product.aspx?isbn=0130211192']

2013-07-06 15:44:07-0400 [dmoz] INFO: Closing spider (finished)

2013-07-06 15:44:07-0400 [dmoz] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 530,

'downloader/request_count': 2,

'downloader/request_method_count/GET': 2,

'downloader/response_bytes': 13062,

'downloader/response_count': 2,

'downloader/response_status_count/200': 2,

'finish_reason': 'finished',

'finish_time': datetime.datetime(2013, 7, 6, 19, 44, 7, 262026),

'log_count/DEBUG': 8,

'log_count/INFO': 4,

'response_received_count': 2,

'scheduler/dequeued': 2,

'scheduler/dequeued/memory': 2,

'scheduler/enqueued': 2,

'scheduler/enqueued/memory': 2,

'start_time': datetime.datetime(2013, 7, 6, 19, 44, 4, 558782)}

2013-07-06 15:44:07-0400 [dmoz] INFO: Spider closed (finished)

7. 保存抓取的數據

scrapy crawl dmoz -o items.json -t json

Sunrise0929

發佈了59 篇原創文章 · 獲贊 2 · 訪問量 3萬+

私信關注

實現Scrapy的Demo

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

GPT-4o 引領人機交互新風向，向量數據庫賽道沸騰了

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

企業大模型如何成爲自己數據的“百科全書”？

本地SSL證書過期輸入命令在IIS自動生成

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（二）使用kube-vip實現集羣VIP訪問

.NET週刊【5月第2期 2024-05-12】

根據域名查詢服務器的ip地址

MapReduce與遺傳算法、MapReduce與粒子羣算法結合與實現

2013年01月01日

POJ1018 Communication System

POJ1050 To the Max

POJ1125 Stockbroker Grapevine

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結