Learning Scrapy 1

  • ipython 是一个强化python的命令终端,具有语法高亮,自动补全,内置函数等。
    pip install ipython
  • XPath从1开始不是0, …[1]
  • 控制获取数量 scrapy crawl manual -s CLOSESPIDER_ITEMCOUNT=90

UR2IM process

基本爬虫步骤: UR2IM (URL, Request, Response, Items, More URLs)

  • URL
    scrapy shell是一个scrapy命令终端工具,用来快速测试scrapy。
    通过scrapy shell 'http://scrapy.org'启动
    返回对象,通过ipython操作
>>>$ scrapy shell 'http://scrapy.org' --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x101ade4a8>
[s]   item       {}
[s]   request    <GET http://scrapy.org>
[s]   response   <200 https://scrapy.org/>
[s]   settings   <scrapy.settings.Settings object at 0x1028b09e8>
[s]   spider     <DefaultSpider 'default' at 0x102b531d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
  • request and response
    对response进行操作,
    输出response前50字符
    >>> $ response.body[:50]

  • The item
    提取出response的数据放进对应的item。使用XPath提取。

一个页面如下:具有logo,search boxes,buttons等等元素。
需要的是具体信息,比如姓名,电话等。
通过定位,提取(复制XPath,简化XPath)

使用
response.xpath('//h1/text()').extract()
提取当前所有h1元素,

使用 //h1/text(),只提取文本信息
这里假设只有一个h1元素,一个网站最好只有一个h1元素,为了SEO(Search Engine Optimization) 探索引擎优化策略。

如果页面元素是<h1 itemprop="name" class="space-mbs">...</h1>
也可以通过//*[@itemprop="name"][1]/text()提取
XPath的从1开始不是0

css选择器

response.css('.ad-price')

一般选择需求


Primary fields | XPath expression
title | //*[@itemprop=”name”][1]/text()
price | //*[@itemprop=”price”][1]/text()
description | //*[@itemprop=”description”][1]/text()
address | //*[@itemtype=”http://schema.org/Place”][1]/text()
image_urls | //*[@itemprop=”image”][1]/@src

A Scrapy Project

scrapy startproject properties
目录结构:

├── properties
│ ├── init.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── init.py
└── scrapy.cfg

item规划

规划需要的数据,不一定全部要用到,feel free to add fileds。

from scrapy.item import Item, Field
   class PropertiesItem(Item):
       # Primary fields
       title = Field()
       price = Field()
       description = Field()
       address = Field()
       image_urls = Field()

       # Calculated fields
       images = Field()
       location = Field()

       # Housekeeping fields
       url = Field()
       project = Field()
       spider = Field()
       server = Field()
       date = Field()

爬虫编写

新建爬虫 scrapy genspider mydomain mydomain.com
默认:

import scrapy


class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['web']
    start_urls = ['http://web/']

    def parse(self, response):
        pass

修改后如下:
start_urls 目标url
self 使用内置函数, log()方法 输出所有
self.log("response.xpath('//@src').extract())

 import scrapy
   class BasicSpider(scrapy.Spider):
       name = "basic"
       allowed_domains = ["web"]
       start_urls = (
           'http://web:9312/properties/property_000000.html',
       )
       def parse(self, response):
           self.log("title: %s" % response.xpath(
               '//*[@itemprop="name"][1]/text()').extract())
           self.log("price: %s" % response.xpath(
               '//*[@itemprop="price"][1]/text()').re('[.0-9]+'))
           self.log("description: %s" % response.xpath(
                '//*[@itemprop="description"][1]/text()').extract())
           self.log("address: %s" % response.xpath(
               '//*[@itemtype="http://schema.org/'
               'Place"][1]/text()').extract())
           self.log("image_urls: %s" % response.xpath(
               '//*[@itemprop="image"][1]/@src').extract())

在终端目录通过scrapy crawl启动
或者可以使用scrapy parse
parse 获取给定的URL并使用相应的spider分析处理

填充item

在爬虫basic.py中,导入item
导入from properties.items import PropertiesItem
把item各项接收对应返回

item = PropertiesItem()
item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()

完整如下

import scrapy
from helloworld.items import PropertiesItem

class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['web']
    start_urls = ['https://www.iana.org/domains/reserved']

    def parse(self, response):
        item = PropertiesItem()
        item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()

保存文件

运行爬虫时 保存文件, 指定格式和路径
scrapy crawl basic -o items.json json格式
scrapy crawl basic -o items.xml xml格式
scrapy crawl basic -o items.csv csv格式
scrapy crawl basic -o "ftp://user:[email protected]/items.j1" j1格式
scrapy crawl basic -o "s3://aws_key:aws_secret@scrapybook/items.json"

item loader 简化parse

ItemLoader(item,resonse) 接收item,和XPath

    def parse(self, response):
        l = ItemLoader(item=PropertiesItem(), response=response)

        l.add_xpath('title', '//*[@itemprop="name"][1]/text()')

还有各种处理器
join 多种合一
MapCompose 使用python函数
MapCompose(unicode.strip) Removes leading and trailing whitespace characters.
MapCompose(unicode.strip, unicode.title) Same as Mapcompose, but also gives title cased results.
MapCompose(float) Converts strings to numbers.
MapCompose(lambda i: i.replace(',',''), float) Converts strings to numbers, ignoring possible ‘,’ characters.
MapCompose(lambda i: urlparse.urljoin(response.url, i)) 把相对路径转化为绝对路径url

add_value个item添加当个具体信息

def parse(self, response):
       l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
                   MapCompose(unicode.strip, unicode.title))
       l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
                   MapCompose(lambda i: i.replace(',', ''), float),
                   re='[,.0-9]+')
       l.add_xpath('description', '//*[@itemprop="description"]'
                   '[1]/text()', MapCompose(unicode.strip), Join())
       l.add_xpath('address','//*[@itemtype="http://schema.org/Place"][1]/text()',
                   MapCompose(unicode.strip))
       l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src',
                   MapCompose(lambda i: urlparse.urljoin(response.url, i)))

       l.add_value('url', response.url)
       l.add_value('project', self.settings.get('BOT_NAME'))
       l.add_value('spider', self.name)
       l.add_value('server', socket.gethostname())
       l.add_value('date', datetime.datetime.now())

完整爬虫如下:

from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from properties.items import PropertiesItem
import datetime
import urlparse
import socket
import scrapy
   class BasicSpider(scrapy.Spider):
       name = "basic"
       allowed_domains = ["web"]
       # Start on a property page
       start_urls = (
           'http://web:9312/properties/property_000000.html',
       )
       def parse(self, response):
           """ This function parses a property page.
           @url http://web:9312/properties/property_000000.html
           @returns items 1
           @scrapes title price description address image_urls
           @scrapes url project spider server date
           """
           # Create the loader using the response
           l = ItemLoader(item=PropertiesItem(), response=response)
           # Load fields using XPath expressions
           l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
                       MapCompose(unicode.strip, unicode.title))
           l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
                       MapCompose(lambda i: i.replace(',', ''),
                       float),
                       re='[,.0-9]+')
           l.add_xpath('description', '//*[@itemprop="description"]'
                       '[1]/text()',
                       MapCompose(unicode.strip), Join())
           l.add_xpath('address',
                       '//*[@itemtype="http://schema.org/Place"]'
                       '[1]/text()',
                       MapCompose(unicode.strip))
           l.add_xpath('image_urls', '//*[@itemprop="image"]'
                       '[1]/@src', MapCompose(
                       lambda i: urlparse.urljoin(response.url, i)))
           # Housekeeping fields
           l.add_value('url', response.url)
           l.add_value('project', self.settings.get('BOT_NAME'))
           l.add_value('spider', self.name)
           l.add_value('server', socket.gethostname())
           l.add_value('date', datetime.datetime.now())
           return l.load_item()

多个URLs

当一个页面出现多页码时,
多个url可以手动一个个输入

 start_urls = (
       'http://web:9312/properties/property_000000.html',
       'http://web:9312/properties/property_000001.html',
       'http://web:9312/properties/property_000002.html',
)

可以把url放在文件里,然后读取

start_urls = [i.strip() for i in
   open('todo.urls.txt').readlines()]

爬虫爬取有两种方向:
- 横向:从index页面顺序到另一个页面,页面布局基本一样
- 纵向:从index页面选中一个具体的item页面,页面布局改变,比如从列表页面到具体的产品页面。

urlparse.urljoin(base, URL)Python语法连接两个url

找出url变量集合,横向爬取

urls = response.xpath('//*[@itemprop="url"]/@href').extract()
//[u'property_000000.html', ... u'property_000029.html']

通过urljoin结合

[urlparse.urljoin(response.url, i) for i in urls]
//[u'http://..._000000.html', ... /property_000029.html']
urls = response.xpath('//*[@itemprop="url"]/@href').extract()
[urlparse.urljoin(response.url, i) for i in urls]

横纵向爬取

获得页码的url和产品的url
只是获得不同url,组合。

def parse(self, response):
    # 获取index页面url
    next_selector = response.xpath('//*[contains(@class,'
                                      '"next")]//@href')
    for url in next_selector.extract():
        yield Request(urlparse.urljoin(response.url, url))
    #获取产品url
    item_selector = response.xpath('//*[@itemprop="url"]/@href')
    for url in item_selector.extract():
        yield Request(urlparse.urljoin(response.url, url),
                      callback=self.parse_item)

scrapy genspider -t crawl webname web.org
生成爬虫

  ...
   class EasySpider(CrawlSpider):
       name = 'easy'
       allowed_domains = ['web']
       start_urls = ['http://www.web/']
       rules = (
           Rule(LinkExtractor(allow=r'Items/'),
   callback='parse_item', follow=True),
       )
       def parse_item(self, response):
           ...

其中rules中,设置callback可以忽略前面url,而执行parse_item

   Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
   Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
            callback='parse_item')
)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章