Learning Scrapy 1

  • ipython 是一個強化python的命令終端,具有語法高亮,自動補全,內置函數等。
    pip install ipython
  • XPath從1開始不是0, …[1]
  • 控制獲取數量 scrapy crawl manual -s CLOSESPIDER_ITEMCOUNT=90

UR2IM process

基本爬蟲步驟: UR2IM (URL, Request, Response, Items, More URLs)

  • URL
    scrapy shell是一個scrapy命令終端工具,用來快速測試scrapy。
    通過scrapy shell 'http://scrapy.org'啓動
    返回對象,通過ipython操作
>>>$ scrapy shell 'http://scrapy.org' --nolog
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x101ade4a8>
[s]   item       {}
[s]   request    <GET http://scrapy.org>
[s]   response   <200 https://scrapy.org/>
[s]   settings   <scrapy.settings.Settings object at 0x1028b09e8>
[s]   spider     <DefaultSpider 'default' at 0x102b531d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
  • request and response
    對response進行操作,
    輸出response前50字符
    >>> $ response.body[:50]

  • The item
    提取出response的數據放進對應的item。使用XPath提取。

一個頁面如下:具有logo,search boxes,buttons等等元素。
需要的是具體信息,比如姓名,電話等。
通過定位,提取(複製XPath,簡化XPath)

使用
response.xpath('//h1/text()').extract()
提取當前所有h1元素,

使用 //h1/text(),只提取文本信息
這裏假設只有一個h1元素,一個網站最好只有一個h1元素,爲了SEO(Search Engine Optimization) 探索引擎優化策略。

如果頁面元素是<h1 itemprop="name" class="space-mbs">...</h1>
也可以通過//*[@itemprop="name"][1]/text()提取
XPath的從1開始不是0

css選擇器

response.css('.ad-price')

一般選擇需求


Primary fields | XPath expression
title | //*[@itemprop=”name”][1]/text()
price | //*[@itemprop=”price”][1]/text()
description | //*[@itemprop=”description”][1]/text()
address | //*[@itemtype=”http://schema.org/Place”][1]/text()
image_urls | //*[@itemprop=”image”][1]/@src

A Scrapy Project

scrapy startproject properties
目錄結構:

├── properties
│ ├── init.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── init.py
└── scrapy.cfg

item規劃

規劃需要的數據,不一定全部要用到,feel free to add fileds。

from scrapy.item import Item, Field
   class PropertiesItem(Item):
       # Primary fields
       title = Field()
       price = Field()
       description = Field()
       address = Field()
       image_urls = Field()

       # Calculated fields
       images = Field()
       location = Field()

       # Housekeeping fields
       url = Field()
       project = Field()
       spider = Field()
       server = Field()
       date = Field()

爬蟲編寫

新建爬蟲 scrapy genspider mydomain mydomain.com
默認:

import scrapy


class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['web']
    start_urls = ['http://web/']

    def parse(self, response):
        pass

修改後如下:
start_urls 目標url
self 使用內置函數, log()方法 輸出所有
self.log("response.xpath('//@src').extract())

 import scrapy
   class BasicSpider(scrapy.Spider):
       name = "basic"
       allowed_domains = ["web"]
       start_urls = (
           'http://web:9312/properties/property_000000.html',
       )
       def parse(self, response):
           self.log("title: %s" % response.xpath(
               '//*[@itemprop="name"][1]/text()').extract())
           self.log("price: %s" % response.xpath(
               '//*[@itemprop="price"][1]/text()').re('[.0-9]+'))
           self.log("description: %s" % response.xpath(
                '//*[@itemprop="description"][1]/text()').extract())
           self.log("address: %s" % response.xpath(
               '//*[@itemtype="http://schema.org/'
               'Place"][1]/text()').extract())
           self.log("image_urls: %s" % response.xpath(
               '//*[@itemprop="image"][1]/@src').extract())

在終端目錄通過scrapy crawl啓動
或者可以使用scrapy parse
parse 獲取給定的URL並使用相應的spider分析處理

填充item

在爬蟲basic.py中,導入item
導入from properties.items import PropertiesItem
把item各項接收對應返回

item = PropertiesItem()
item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()

完整如下

import scrapy
from helloworld.items import PropertiesItem

class BasicSpider(scrapy.Spider):
    name = 'basic'
    allowed_domains = ['web']
    start_urls = ['https://www.iana.org/domains/reserved']

    def parse(self, response):
        item = PropertiesItem()
        item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()

保存文件

運行爬蟲時 保存文件, 指定格式和路徑
scrapy crawl basic -o items.json json格式
scrapy crawl basic -o items.xml xml格式
scrapy crawl basic -o items.csv csv格式
scrapy crawl basic -o "ftp://user:[email protected]/items.j1" j1格式
scrapy crawl basic -o "s3://aws_key:aws_secret@scrapybook/items.json"

item loader 簡化parse

ItemLoader(item,resonse) 接收item,和XPath

    def parse(self, response):
        l = ItemLoader(item=PropertiesItem(), response=response)

        l.add_xpath('title', '//*[@itemprop="name"][1]/text()')

還有各種處理器
join 多種合一
MapCompose 使用python函數
MapCompose(unicode.strip) Removes leading and trailing whitespace characters.
MapCompose(unicode.strip, unicode.title) Same as Mapcompose, but also gives title cased results.
MapCompose(float) Converts strings to numbers.
MapCompose(lambda i: i.replace(',',''), float) Converts strings to numbers, ignoring possible ‘,’ characters.
MapCompose(lambda i: urlparse.urljoin(response.url, i)) 把相對路徑轉化爲絕對路徑url

add_value個item添加當個具體信息

def parse(self, response):
       l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
                   MapCompose(unicode.strip, unicode.title))
       l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
                   MapCompose(lambda i: i.replace(',', ''), float),
                   re='[,.0-9]+')
       l.add_xpath('description', '//*[@itemprop="description"]'
                   '[1]/text()', MapCompose(unicode.strip), Join())
       l.add_xpath('address','//*[@itemtype="http://schema.org/Place"][1]/text()',
                   MapCompose(unicode.strip))
       l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src',
                   MapCompose(lambda i: urlparse.urljoin(response.url, i)))

       l.add_value('url', response.url)
       l.add_value('project', self.settings.get('BOT_NAME'))
       l.add_value('spider', self.name)
       l.add_value('server', socket.gethostname())
       l.add_value('date', datetime.datetime.now())

完整爬蟲如下:

from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from properties.items import PropertiesItem
import datetime
import urlparse
import socket
import scrapy
   class BasicSpider(scrapy.Spider):
       name = "basic"
       allowed_domains = ["web"]
       # Start on a property page
       start_urls = (
           'http://web:9312/properties/property_000000.html',
       )
       def parse(self, response):
           """ This function parses a property page.
           @url http://web:9312/properties/property_000000.html
           @returns items 1
           @scrapes title price description address image_urls
           @scrapes url project spider server date
           """
           # Create the loader using the response
           l = ItemLoader(item=PropertiesItem(), response=response)
           # Load fields using XPath expressions
           l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
                       MapCompose(unicode.strip, unicode.title))
           l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
                       MapCompose(lambda i: i.replace(',', ''),
                       float),
                       re='[,.0-9]+')
           l.add_xpath('description', '//*[@itemprop="description"]'
                       '[1]/text()',
                       MapCompose(unicode.strip), Join())
           l.add_xpath('address',
                       '//*[@itemtype="http://schema.org/Place"]'
                       '[1]/text()',
                       MapCompose(unicode.strip))
           l.add_xpath('image_urls', '//*[@itemprop="image"]'
                       '[1]/@src', MapCompose(
                       lambda i: urlparse.urljoin(response.url, i)))
           # Housekeeping fields
           l.add_value('url', response.url)
           l.add_value('project', self.settings.get('BOT_NAME'))
           l.add_value('spider', self.name)
           l.add_value('server', socket.gethostname())
           l.add_value('date', datetime.datetime.now())
           return l.load_item()

多個URLs

當一個頁面出現多頁碼時,
多個url可以手動一個個輸入

 start_urls = (
       'http://web:9312/properties/property_000000.html',
       'http://web:9312/properties/property_000001.html',
       'http://web:9312/properties/property_000002.html',
)

可以把url放在文件裏,然後讀取

start_urls = [i.strip() for i in
   open('todo.urls.txt').readlines()]

爬蟲爬取有兩種方向:
- 橫向:從index頁面順序到另一個頁面,頁面佈局基本一樣
- 縱向:從index頁面選中一個具體的item頁面,頁面佈局改變,比如從列表頁面到具體的產品頁面。

urlparse.urljoin(base, URL)Python語法連接兩個url

找出url變量集合,橫向爬取

urls = response.xpath('//*[@itemprop="url"]/@href').extract()
//[u'property_000000.html', ... u'property_000029.html']

通過urljoin結合

[urlparse.urljoin(response.url, i) for i in urls]
//[u'http://..._000000.html', ... /property_000029.html']
urls = response.xpath('//*[@itemprop="url"]/@href').extract()
[urlparse.urljoin(response.url, i) for i in urls]

橫縱向爬取

獲得頁碼的url和產品的url
只是獲得不同url,組合。

def parse(self, response):
    # 獲取index頁面url
    next_selector = response.xpath('//*[contains(@class,'
                                      '"next")]//@href')
    for url in next_selector.extract():
        yield Request(urlparse.urljoin(response.url, url))
    #獲取產品url
    item_selector = response.xpath('//*[@itemprop="url"]/@href')
    for url in item_selector.extract():
        yield Request(urlparse.urljoin(response.url, url),
                      callback=self.parse_item)

scrapy genspider -t crawl webname web.org
生成爬蟲

  ...
   class EasySpider(CrawlSpider):
       name = 'easy'
       allowed_domains = ['web']
       start_urls = ['http://www.web/']
       rules = (
           Rule(LinkExtractor(allow=r'Items/'),
   callback='parse_item', follow=True),
       )
       def parse_item(self, response):
           ...

其中rules中,設置callback可以忽略前面url,而執行parse_item

   Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
   Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
            callback='parse_item')
)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章