实现Scrapy的Demo

参阅文档:

1. 准备工作
安装python、OpenSSL,linux自带
安装pip: 指令为apt-get install python-pip
安装Scrapy:pip install Scrapy

安装之后会出现四个错误,下面分别解决:
1). fatal error: Python.h: No such file or directory
安装pyhton-dev包
sudo apt-get install python-dev

2). gcc not found
安装gcc:
安装gcc编译器:sudo apt-get install build-essential

3). ImportError: No module named w3lib.http
安装w3lib: 
下载地址如下所示:
http://pypi.python.org/pypi/w3lib
http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21
安装过程如下所示:
tar -xvzf w3lib-1.0.tar.gz  
cd w3lib-1.0  
python setup.py install  

4). fatal error: libxml/xmlversion.h: No such file or directory
sudo apt-get install libxml2-dev libxslt-dev 

2. 建立工程
scrapy startproject tutorial
得到如下工程目录,输入tree tutorial得到目录结构如下:
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...
1). scrapy.cfg: 项目配置文件
2). tutorial/: 项目python模块
3). tutorial/spiders/: 放置spider的目录
4). tutorial/items.py: 项目items文件
5). tutorial/pipelines.py: 项目管道文件
6). tutorial/settings.py: 项目配置文件

3. 定义item

Items是将要装载抓取的数据的容器,它工作方式像python里面的字典,但它提供更多的保护,比如对未定义的字段填充以防止拼写错误。它通过创建一个scrapy.item.Item类来声明,定义它的属性为scrpy.item.Field对象,就像是一个对象关系映射(ORM). 我们通过将需要的item模型化,来控制从dmoz.org获得的站点数据,比如我们要获得站点的名字,url和网站描述,我们定义这三种属性的域。要做到这点,我们编辑在tutorial目录下的items.py文件:
from scrapy.item import Item, Field 
class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

4. 第一个爬虫程序
Spider是用户编写的类,用于从一个域(或域组)中抓取信息。他们定义了用于下载的URL的初步列表,如何跟踪链接,以及如何来解析这些网页的内容用于提取items。要建立一个Spider,你必须为scrapy.spider.BaseSpider创建一个子类,并确定三个主要的、强制的属性:
name:爬虫的识别名,它必须是唯一的,在不同的爬虫中你必须定义不同的名字.
start_urls:爬虫开始爬的一个URL列表。爬虫从这里开始抓取数据,所以,第一次下载的数据将会从这些URLS开始。其他子URL将会从这些起始URL中继承性生成。
parse():爬虫的方法,调用时候传入从每一个URL传回的Response对象作为参数,response将会是parse方法的唯一的一个参数,
这个方法负责解析返回的数据、匹配抓取的数据(解析为item)并跟踪更多的URL。
这是我们的第一只爬虫的代码,将其命名为dmoz_spider.py并保存在tutorial\spiders目录下。
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

5. 运行爬虫程序
crawl dmoz 命令从dmoz.org域启动爬虫
root@qing-ubuntu:/home/qing/tutorial/tutorial# scrapy crawl dmoz
2013-07-06 14:33:39-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: tutorial)
2013-07-06 14:33:40-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-06 14:33:40-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-06 14:33:41-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-06 14:33:41-0400 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-06 14:33:41-0400 [dmoz] INFO: Spider opened
2013-07-06 14:33:41-0400 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-06 14:33:41-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-06 14:33:41-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-06 14:33:43-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2013-07-06 14:33:44-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2013-07-06 14:33:44-0400 [dmoz] INFO: Closing spider (finished)
2013-07-06 14:33:44-0400 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 530,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13062,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 6, 18, 33, 44, 149862),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2013, 7, 6, 18, 33, 41, 21208)}
2013-07-06 14:33:44-0400 [dmoz] INFO: Spider closed (finished)

注意包含 [dmoz]的行 ,那对应着我们的爬虫。你可以看到start_urls中定义的每个URL都有日志行。因为这些URL是起始页面,所以他们没有引用(referrers),所以在每行的末尾你会看到 (referer: <None>). 
有趣的是,在我们的 parse  方法的作用下,两个文件被创建:分别是 Books 和 Resources,这两个文件中有URL的页面内容。

6. 提取数据
Scrapy使用XPath selector的机制,基于XPath的表达式。XPath资源:
现在我们尝试从网页中提取数据。 你可以在控制台输入 response.body, 检查源代码中的 XPaths 是否与预期相同。然而,检查HTML源代码是件很枯燥的事情。使用Google Chrome的Inspect Element功能,而且可以提取元素的XPath。检查源代码后,你会发现我们需要的数据在一个 <ul>元素中,而且是第二个<ul>。 我们可以通过如下命令选择每个在网站中的 <li> 元素,修改spiders下的dmoz_spider.py文件:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//fieldset/ul/li')
        #sites = hxs.select('//ul/li')
        for site in sites:
            title = site.select('a/text()').extract()
            link = site.select('a/@href').extract()
            desc = site.select('text()').extract()
            #print title, link, desc
            print title, link

再次抓取dmoz.org:
root@qing-ubuntu:/home/qing/tutorial/tutorial# scrapy crawl dmoz
2013-07-06 15:44:04-0400 [scrapy] INFO: Scrapy 0.16.5 started (bot: tutorial)
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-06 15:44:04-0400 [dmoz] INFO: Spider opened
2013-07-06 15:44:04-0400 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-06 15:44:04-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-06 15:44:06-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
[u'Computers: Programming: Resources'] [u'/Computers/Programming/Resources/']
[u'Free Python and Zope Hosting Directory'] [u'http://www.oinko.net/freepython/']
[u"O'Reilly Python Center"] [u'http://oreilly.com/python/']
[u"Python Developer's Guide"] [u'http://www.python.org/dev/']
[u'Social Bug'] [u'http://win32com.goermezer.de/']
[u"eff-bot's Daily Python URL"] [u'http://www.pythonware.com/daily/']
2013-07-06 15:44:07-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
[u'Computers: Programming: Languages: Python: Resources'] [u'/Computers/Programming/Languages/Python/Resources/']
[u'Computers: Programming: Languages: Ruby: Books'] [u'/Computers/Programming/Languages/Ruby/Books/']
[u'German'] [u'/World/Deutsch/Computer/Programmieren/Sprachen/Python/Bücher/']
[u'An Introduction to Python'] [u'http://www.network-theory.co.uk/python/intro/']
[u'Core Python Programming'] [u'http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00+en-USS_01DBC.html']
[u'Data Structures and Algorithms with Object-Oriented Design Patterns in Python'] [u'http://www.brpreiss.com/books/opus7/html/book.html']
[u'Dive Into Python 3'] [u'http://www.diveintopython.net/']
[u'Foundations of Python Network Programming'] [u'http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/']
[u'Free Python books'] [u'http://www.techbooksforfree.com/perlpython.shtml']
[u'FreeTechBooks: Python Scripting Language'] [u'http://www.freetechbooks.com/python-f6.html']
[u'How to Think Like a Computer Scientist: Learning with Python'] [u'http://greenteapress.com/thinkpython/']
[u'Learn to Program Using Python'] [u'http://www.freenetpages.co.uk/hp/alan.gauld/']
[u'Making Use of Python'] [u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471219754.html']
[u'Practical Python'] [u'http://hetland.org/writing/practical-python/']
[u'Pro Python System Administration'] [u'http://www.sysadminpy.com/']
[u'Programming in Python 3 (Second Edition)'] [u'http://www.qtrac.eu/py3book.html']
[u'Python 2.1 Bible'] [u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764548077.html']
[u'Python 3 Object Oriented Programming'] [u'https://www.packtpub.com/python-3-object-oriented-programming/book']
[u'Python Language Reference Manual'] [u'http://www.network-theory.co.uk/python/language/']
[u'Python Programming Patterns'] [u'http://www.pearsonhighered.com/educator/academic/product/0,,0130409561,00+en-USS_01DBC.html']
[u'Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython'] [u'http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1']
[u'Python: Visual QuickStart Guide'] [u'http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00+en-USS_01DBC.html']
[u'Sams Teach Yourself Python in 24 Hours'] [u'http://www.informit.com/store/product.aspx?isbn=0672317354']
[u'Text Processing in Python'] [u'http://gnosis.cx/TPiP/']
[u'XML Processing with Python'] [u'http://www.informit.com/store/product.aspx?isbn=0130211192']
2013-07-06 15:44:07-0400 [dmoz] INFO: Closing spider (finished)
2013-07-06 15:44:07-0400 [dmoz] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 530,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 13062,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 6, 19, 44, 7, 262026),
'log_count/DEBUG': 8,
'log_count/INFO': 4,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2013, 7, 6, 19, 44, 4, 558782)}
2013-07-06 15:44:07-0400 [dmoz] INFO: Spider closed (finished)

7. 保存抓取的数据
scrapy crawl dmoz -o items.json -t json
发布了59 篇原创文章 · 获赞 2 · 访问量 3万+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章