第一步
- 啓動一個項目
- 定義自己的spider
- 定義自己的Itemline
- 定義Item pipeline存儲數據
創建一個爬蟲項目
scrapy startproject tutorial
生成如下文件
scrapy_test/
├── __init__.py
├── __init__.pyc
├── items.py
├── pipelines.py
├── settings.py
├── settings.pyc
└── spiders
├── dmoz_spider.py
├── dmoz_spider.pyc
├── __init__.py
└── __init__.pyc
1 directory, 10 files
每個文件作用:
- scrapy.cfg: 項目配置文件
- tutorial/: 項目python模塊, 呆會代碼將從這裏導入
- tutorial/items.py: 項目items文件
- tutorial/pipelines.py: 項目管道文件
- tutorial/settings.py: 項目配置文件
- tutorial/spiders: 放置spider的目錄
定義item
from scrapy.item import Item, Field
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
定義自己的spider
from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)
啓動爬蟲開始爬
scrapy crawl dmoz
看到啓動的一些日誌信息
2017-03-21 05:16:21 [scrapy] INFO: Scrapy 1.1.0dev1 started (bot: scrapy_test)
2017-03-21 05:16:21 [scrapy] INFO: Optional features available: ssl, http11
2017-03-21 05:16:21 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'scrapy_test.spiders', 'SPIDER_MODULES': ['scrapy_test.spiders'], 'BOT_NAME': 'scrapy_test'}
2017-03-21 05:16:22 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memdebug.MemoryDebugger',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.spiderstate.SpiderState',
'scrapy.extensions.throttle.AutoThrottle']
2017-03-21 05:16:23 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
定義自己的item來存儲爬到的數據