- ipython 是一個強化python的命令終端,具有語法高亮,自動補全,內置函數等。
pip install ipython
- XPath從1開始不是0, …[1]
- 控制獲取數量
scrapy crawl manual -s CLOSESPIDER_ITEMCOUNT=90
UR2IM process
基本爬蟲步驟: UR2IM (URL, Request, Response, Items, More URLs)
- URL
scrapy shell是一個scrapy命令終端工具,用來快速測試scrapy。
通過scrapy shell 'http://scrapy.org'
啓動
返回對象,通過ipython操作
>>>$ scrapy shell 'http://scrapy.org' --nolog
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x101ade4a8>
[s] item {}
[s] request <GET http://scrapy.org>
[s] response <200 https://scrapy.org/>
[s] settings <scrapy.settings.Settings object at 0x1028b09e8>
[s] spider <DefaultSpider 'default' at 0x102b531d0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
request and response
對response進行操作,
輸出response前50字符
>>> $ response.body[:50]
The item
提取出response的數據放進對應的item。使用XPath提取。
一個頁面如下:具有logo,search boxes,buttons等等元素。
需要的是具體信息,比如姓名,電話等。
通過定位,提取(複製XPath,簡化XPath)
使用
response.xpath('//h1/text()').extract()
提取當前所有h1元素,
使用 //h1/text()
,只提取文本信息
這裏假設只有一個h1元素,一個網站最好只有一個h1元素,爲了SEO(Search Engine Optimization) 探索引擎優化策略。
如果頁面元素是<h1 itemprop="name" class="space-mbs">...</h1>
也可以通過//*[@itemprop="name"][1]/text()
提取
XPath的從1開始不是0
css選擇器
response.css('.ad-price')
一般選擇需求
Primary fields | XPath expression
title | //*[@itemprop=”name”][1]/text()
price | //*[@itemprop=”price”][1]/text()
description | //*[@itemprop=”description”][1]/text()
address | //*[@itemtype=”http://schema.org/Place”][1]/text()
image_urls | //*[@itemprop=”image”][1]/@src
A Scrapy Project
scrapy startproject properties
目錄結構:
├── properties
│ ├── init.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── init.py
└── scrapy.cfg
item規劃
規劃需要的數據,不一定全部要用到,feel free to add fileds。
from scrapy.item import Item, Field
class PropertiesItem(Item):
# Primary fields
title = Field()
price = Field()
description = Field()
address = Field()
image_urls = Field()
# Calculated fields
images = Field()
location = Field()
# Housekeeping fields
url = Field()
project = Field()
spider = Field()
server = Field()
date = Field()
爬蟲編寫
新建爬蟲 scrapy genspider mydomain mydomain.com
默認:
import scrapy
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['web']
start_urls = ['http://web/']
def parse(self, response):
pass
修改後如下:
start_urls
目標url
self
使用內置函數, log()
方法 輸出所有
self.log("response.xpath('//@src').extract())
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
start_urls = (
'http://web:9312/properties/property_000000.html',
)
def parse(self, response):
self.log("title: %s" % response.xpath(
'//*[@itemprop="name"][1]/text()').extract())
self.log("price: %s" % response.xpath(
'//*[@itemprop="price"][1]/text()').re('[.0-9]+'))
self.log("description: %s" % response.xpath(
'//*[@itemprop="description"][1]/text()').extract())
self.log("address: %s" % response.xpath(
'//*[@itemtype="http://schema.org/'
'Place"][1]/text()').extract())
self.log("image_urls: %s" % response.xpath(
'//*[@itemprop="image"][1]/@src').extract())
在終端目錄通過scrapy crawl
啓動
或者可以使用scrapy parse
parse 獲取給定的URL並使用相應的spider分析處理
填充item
在爬蟲basic.py中,導入item
導入from properties.items import PropertiesItem
把item各項接收對應返回
item = PropertiesItem()
item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()
完整如下
import scrapy
from helloworld.items import PropertiesItem
class BasicSpider(scrapy.Spider):
name = 'basic'
allowed_domains = ['web']
start_urls = ['https://www.iana.org/domains/reserved']
def parse(self, response):
item = PropertiesItem()
item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()
保存文件
運行爬蟲時 保存文件, 指定格式和路徑
scrapy crawl basic -o items.json
json格式
scrapy crawl basic -o items.xml
xml格式
scrapy crawl basic -o items.csv
csv格式
scrapy crawl basic -o "ftp://user:[email protected]/items.j1"
j1格式
scrapy crawl basic -o "s3://aws_key:aws_secret@scrapybook/items.json"
item loader 簡化parse
ItemLoader(item,resonse) 接收item,和XPath
def parse(self, response):
l = ItemLoader(item=PropertiesItem(), response=response)
l.add_xpath('title', '//*[@itemprop="name"][1]/text()')
還有各種處理器
join
多種合一
MapCompose
使用python函數
MapCompose(unicode.strip)
Removes leading and trailing whitespace characters.
MapCompose(unicode.strip, unicode.title)
Same as Mapcompose, but also gives title cased results.
MapCompose(float)
Converts strings to numbers.
MapCompose(lambda i: i.replace(',',''), float)
Converts strings to numbers, ignoring possible ‘,’ characters.
MapCompose(lambda i: urlparse.urljoin(response.url, i))
把相對路徑轉化爲絕對路徑url
add_value
個item添加當個具體信息
def parse(self, response):
l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
MapCompose(unicode.strip, unicode.title))
l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
MapCompose(lambda i: i.replace(',', ''), float),
re='[,.0-9]+')
l.add_xpath('description', '//*[@itemprop="description"]'
'[1]/text()', MapCompose(unicode.strip), Join())
l.add_xpath('address','//*[@itemtype="http://schema.org/Place"][1]/text()',
MapCompose(unicode.strip))
l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src',
MapCompose(lambda i: urlparse.urljoin(response.url, i)))
l.add_value('url', response.url)
l.add_value('project', self.settings.get('BOT_NAME'))
l.add_value('spider', self.name)
l.add_value('server', socket.gethostname())
l.add_value('date', datetime.datetime.now())
完整爬蟲如下:
from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from properties.items import PropertiesItem
import datetime
import urlparse
import socket
import scrapy
class BasicSpider(scrapy.Spider):
name = "basic"
allowed_domains = ["web"]
# Start on a property page
start_urls = (
'http://web:9312/properties/property_000000.html',
)
def parse(self, response):
""" This function parses a property page.
@url http://web:9312/properties/property_000000.html
@returns items 1
@scrapes title price description address image_urls
@scrapes url project spider server date
"""
# Create the loader using the response
l = ItemLoader(item=PropertiesItem(), response=response)
# Load fields using XPath expressions
l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
MapCompose(unicode.strip, unicode.title))
l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
MapCompose(lambda i: i.replace(',', ''),
float),
re='[,.0-9]+')
l.add_xpath('description', '//*[@itemprop="description"]'
'[1]/text()',
MapCompose(unicode.strip), Join())
l.add_xpath('address',
'//*[@itemtype="http://schema.org/Place"]'
'[1]/text()',
MapCompose(unicode.strip))
l.add_xpath('image_urls', '//*[@itemprop="image"]'
'[1]/@src', MapCompose(
lambda i: urlparse.urljoin(response.url, i)))
# Housekeeping fields
l.add_value('url', response.url)
l.add_value('project', self.settings.get('BOT_NAME'))
l.add_value('spider', self.name)
l.add_value('server', socket.gethostname())
l.add_value('date', datetime.datetime.now())
return l.load_item()
多個URLs
當一個頁面出現多頁碼時,
多個url可以手動一個個輸入
start_urls = (
'http://web:9312/properties/property_000000.html',
'http://web:9312/properties/property_000001.html',
'http://web:9312/properties/property_000002.html',
)
可以把url放在文件裏,然後讀取
start_urls = [i.strip() for i in
open('todo.urls.txt').readlines()]
爬蟲爬取有兩種方向:
- 橫向:從index頁面順序到另一個頁面,頁面佈局基本一樣
- 縱向:從index頁面選中一個具體的item頁面,頁面佈局改變,比如從列表頁面到具體的產品頁面。
urlparse.urljoin(base, URL)
Python語法連接兩個url
找出url變量集合,橫向爬取
urls = response.xpath('//*[@itemprop="url"]/@href').extract()
//[u'property_000000.html', ... u'property_000029.html']
通過urljoin
結合
[urlparse.urljoin(response.url, i) for i in urls]
//[u'http://..._000000.html', ... /property_000029.html']
urls = response.xpath('//*[@itemprop="url"]/@href').extract()
[urlparse.urljoin(response.url, i) for i in urls]
橫縱向爬取
獲得頁碼的url和產品的url
只是獲得不同url,組合。
def parse(self, response):
# 獲取index頁面url
next_selector = response.xpath('//*[contains(@class,'
'"next")]//@href')
for url in next_selector.extract():
yield Request(urlparse.urljoin(response.url, url))
#獲取產品url
item_selector = response.xpath('//*[@itemprop="url"]/@href')
for url in item_selector.extract():
yield Request(urlparse.urljoin(response.url, url),
callback=self.parse_item)
scrapy genspider -t crawl webname web.org
生成爬蟲
...
class EasySpider(CrawlSpider):
name = 'easy'
allowed_domains = ['web']
start_urls = ['http://www.web/']
rules = (
Rule(LinkExtractor(allow=r'Items/'),
callback='parse_item', follow=True),
)
def parse_item(self, response):
...
其中rules
中,設置callback
可以忽略前面url,而執行parse_item
Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
callback='parse_item')
)