- ipython 是一个强化python的命令终端,具有语法高亮,自动补全,内置函数等。
pip install ipython
- XPath从1开始不是0, …[1]
- 控制获取数量
scrapy crawl manual -s CLOSESPIDER_ITEMCOUNT=90
UR2IM process
基本爬虫步骤: UR2IM (URL, Request, Response, Items, More URLs)
- URL
scrapy shell是一个scrapy命令终端工具,用来快速测试scrapy。
通过scrapy shell 'http://scrapy.org'
启动
返回对象,通过ipython操作
>>>$ scrapy shell 'http://scrapy.org' --nolog
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x101ade4a8>
[s] item {}
[s] request <GET http://scrapy.org>
[s] response <200 https://scrapy.org/>
[s] settings <scrapy.settings.Settings object at 0x1028b09e8>
[s] spider <DefaultSpider 'default' at 0x102b531d0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
request and response
对response进行操作,
输出response前50字符
>>> $ response.body[:50]
The item
提取出response的数据放进对应的item。使用XPath提取。
一个页面如下:具有logo,search boxes,buttons等等元素。
需要的是具体信息,比如姓名,电话等。
通过定位,提取(复制XPath,简化XPath)
使用
response.xpath('//h1/text()').extract()
提取当前所有h1元素,
使用 //h1/text()
,只提取文本信息
这里假设只有一个h1元素,一个网站最好只有一个h1元素,为了SEO(Search Engine Optimization) 探索引擎优化策略。
如果页面元素是<h1 itemprop="name" class="space-mbs">...</h1>
也可以通过//*[@itemprop="name"][1]/text()
提取
XPath的从1开始不是0
css选择器
response.css('.ad-price')
一般选择需求
Primary fields | XPath expression
title | //*[@itemprop=”name”][1]/text()
price | //*[@itemprop=”price”][1]/text()
description | //*[@itemprop=”description”][1]/text()
address | //*[@itemtype=”http://schema.org/Place”][1]/text()
image_urls | //*[@itemprop=”image”][1]/@src
A Scrapy Project
scrapy startproject properties
目录结构:
├── properties
│ ├── init.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── init.py
└── scrapy.cfg
item规划
规划需要的数据,不一定全部要用到,feel free to add fileds。
from scrapy.item import Item, Field
class PropertiesItem(Item):
# Primary fields
title = Field()
price = Field()
description = Field()
address = Field()
image_urls = Field()
# Calculated fields
images = Field()
location = Field()
# Housekeeping fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
爬虫编写
新建爬虫 scrapy genspider mydomain mydomain.com
默认:
import scrapy
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['web']
start_urls = ['http://web/']
def parse(self, response):
pass
修改后如下:
start_urls
目标url
self
使用内置函数, log()
方法 输出所有
self.log("response.xpath('//@src').extract())
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
start_urls = (
'http://web:9312/properties/property_000000.html',
)
def parse(self, response):
self.log("title: %s" % response.xpath(
'//*[@itemprop="name"][1]/text()').extract())
self.log("price: %s" % response.xpath(
'//*[@itemprop="price"][1]/text()').re('[.0-9]+'))
self.log("description: %s" % response.xpath(
'//*[@itemprop="description"][1]/text()').extract())
self.log("address: %s" % response.xpath(
'//*[@itemtype="http://schema.org/'
'Place"][1]/text()').extract())
self.log("image_urls: %s" % response.xpath(
'//*[@itemprop="image"][1]/@src').extract())
在终端目录通过scrapy crawl
启动
或者可以使用scrapy parse
parse 获取给定的URL并使用相应的spider分析处理
填充item
在爬虫basic.py中,导入item
导入from properties.items import PropertiesItem
把item各项接收对应返回
item = PropertiesItem()
item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()
完整如下
import scrapy
from helloworld.items import PropertiesItem
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['web']
start_urls = ['https://www.iana.org/domains/reserved']
def parse(self, response):
item = PropertiesItem()
item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()
保存文件
运行爬虫时 保存文件, 指定格式和路径
scrapy crawl basic -o items.json
json格式
scrapy crawl basic -o items.xml
xml格式
scrapy crawl basic -o items.csv
csv格式
scrapy crawl basic -o "ftp://user:[email protected]/items.j1"
j1格式
scrapy crawl basic -o "s3://aws_key:aws_secret@scrapybook/items.json"
item loader 简化parse
ItemLoader(item,resonse) 接收item,和XPath
def parse(self, response):
l = ItemLoader(item=PropertiesItem(), response=response)
l.add_xpath('title', '//*[@itemprop="name"][1]/text()')
还有各种处理器
join
多种合一
MapCompose
使用python函数
MapCompose(unicode.strip)
Removes leading and trailing whitespace characters.
MapCompose(unicode.strip, unicode.title)
Same as Mapcompose, but also gives title cased results.
MapCompose(float)
Converts strings to numbers.
MapCompose(lambda i: i.replace(',',''), float)
Converts strings to numbers, ignoring possible ‘,’ characters.
MapCompose(lambda i: urlparse.urljoin(response.url, i))
把相对路径转化为绝对路径url
add_value
个item添加当个具体信息
def parse(self, response):
l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
MapCompose(unicode.strip, unicode.title))
l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
MapCompose(lambda i: i.replace(',', ''), float),
re='[,.0-9]+')
l.add_xpath('description', '//*[@itemprop="description"]'
'[1]/text()', MapCompose(unicode.strip), Join())
l.add_xpath('address','//*[@itemtype="http://schema.org/Place"][1]/text()',
MapCompose(unicode.strip))
l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src',
MapCompose(lambda i: urlparse.urljoin(response.url, i)))
l.add_value('url', response.url)
l.add_value('project', self.settings.get('BOT_NAME'))
l.add_value('spider', self.name)
l.add_value('server', socket.gethostname())
l.add_value('date', datetime.datetime.now())
完整爬虫如下:
from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from properties.items import PropertiesItem
import datetime
import urlparse
import socket
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
# Start on a property page
start_urls = (
'http://web:9312/properties/property_000000.html',
)
def parse(self, response):
""" This function parses a property page.
@url http://web:9312/properties/property_000000.html
@returns items 1
@scrapes title price description address image_urls
@scrapes url project spider server date
"""
# Create the loader using the response
l = ItemLoader(item=PropertiesItem(), response=response)
# Load fields using XPath expressions
l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
MapCompose(unicode.strip, unicode.title))
l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
MapCompose(lambda i: i.replace(',', ''),
float),
re='[,.0-9]+')
l.add_xpath('description', '//*[@itemprop="description"]'
'[1]/text()',
MapCompose(unicode.strip), Join())
l.add_xpath('address',
'//*[@itemtype="http://schema.org/Place"]'
'[1]/text()',
MapCompose(unicode.strip))
l.add_xpath('image_urls', '//*[@itemprop="image"]'
'[1]/@src', MapCompose(
lambda i: urlparse.urljoin(response.url, i)))
# Housekeeping fields
l.add_value('url', response.url)
l.add_value('project', self.settings.get('BOT_NAME'))
l.add_value('spider', self.name)
l.add_value('server', socket.gethostname())
l.add_value('date', datetime.datetime.now())
return l.load_item()
多个URLs
当一个页面出现多页码时,
多个url可以手动一个个输入
start_urls = (
'http://web:9312/properties/property_000000.html',
'http://web:9312/properties/property_000001.html',
'http://web:9312/properties/property_000002.html',
)
可以把url放在文件里,然后读取
start_urls = [i.strip() for i in
open('todo.urls.txt').readlines()]
爬虫爬取有两种方向:
- 横向:从index页面顺序到另一个页面,页面布局基本一样
- 纵向:从index页面选中一个具体的item页面,页面布局改变,比如从列表页面到具体的产品页面。
urlparse.urljoin(base, URL)
Python语法连接两个url
找出url变量集合,横向爬取
urls = response.xpath('//*[@itemprop="url"]/@href').extract()
//[u'property_000000.html', ... u'property_000029.html']
通过urljoin
结合
[urlparse.urljoin(response.url, i) for i in urls]
//[u'http://..._000000.html', ... /property_000029.html']
urls = response.xpath('//*[@itemprop="url"]/@href').extract()
[urlparse.urljoin(response.url, i) for i in urls]
横纵向爬取
获得页码的url和产品的url
只是获得不同url,组合。
def parse(self, response):
# 获取index页面url
next_selector = response.xpath('//*[contains(@class,'
'"next")]//@href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
#获取产品url
item_selector = response.xpath('//*[@itemprop="url"]/@href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url),
callback=self.parse_item)
scrapy genspider -t crawl webname web.org
生成爬虫
...
class EasySpider(CrawlSpider):
name = 'easy'
allowed_domains = ['web']
start_urls = ['http://www.web/']
rules = (
Rule(LinkExtractor(allow=r'Items/'),
callback='parse_item', follow=True),
)
def parse_item(self, response):
...
其中rules
中,设置callback
可以忽略前面url,而执行parse_item
Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
callback='parse_item')
)