【爬蟲】Scrapy Item Pipeline

原創

2018-08-26 15:59

【原文鏈接】https://doc.scrapy.org/en/latest/topics/item-pipeline.html

爬蟲爬取了一個 item 後, 它會被髮送到 Item Pipeline, which 通過好幾個組件 that are executed sequentially 處理 item.

每個 item 管道組件 (sometimes referred as just “Item Pipeline”) 是 Python 的一個類。這個類實現了一個簡單的方法. 它們會接收一個 item 然後對其 perform an action, 並決定是否這個 item 應該繼續走管道，還是被丟棄不再進行處理.

Typical uses of item pipelines are:

cleansing HTML data
校驗爬取到的數據 (checking that the items contain certain fields)
檢查是否有重複 (並丟棄它們)
存儲爬取到的 item 到數據庫

Writing your own item pipeline

每個 item 管道組件是一個必須實現以下方法的 Python 類:

process_item(self, item, spider)

每個 item 管道組件都會調用這個方法. process_item() 必須要麼返回一個有數據的字典，要麼返回一個 Item (或任何子孫類) 對象, 要麼返回一個 Twisted Deferred 或拋出 DropItem 異常. Dropped items are no longer processed by further pipeline components.

Parameters:	item (`Item` object or a dict) – the item scraped spider (`Spider` object) – the spider which scraped the item

此外, 他們還可以實現下列方法:

open_spider(self, spider)

當爬蟲被打開的時候該方法會被調用.

Parameters:	spider (`Spider` object) – the spider which was opened

close_spider(self, spider)

當爬蟲被關閉的時候該方法被調用.

Parameters:	spider (`Spider` object) – the spider which was closed

from_crawler(cls, crawler)

如果該類方法存在，會調用該方法來根據 Crawler 創建一個管道實例. 它必須返回管道的一個新實例. Crawler 對象對所有 Scrapy 核心組件提供訪問權限，比如 settings 和 signals; 這是 pipeline 訪問它們並將自身功能 hook 到 Scrapy 的一種方法.

Parameters:	crawler (`Crawler` object) – crawler that uses this pipeline

Item pipeline 例子

對價格進行校驗並丟棄沒有價錢的 items

我們來看一下下面的管道，這個管道對那些不包含 VAT 的 items 調整了 price 屬性 (price_excludes_vat 屬性), 並丟掉了那些不包含價錢的 items:

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

Write items to a JSON file

下面的管道將所有爬取到的 items (from all spiders) 保存到一個單獨的 items.jl 文件, 該文件每行包含一個用 JSON 格式序列化的 item:

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

Note: The purpose of JsonWriterPipeline is just to introduce how to write item pipelines. If you really want to store all scraped items into a JSON file you should use the Feed exports.

Write items to MongoDB

在這個例子中我們會使用 pymongo 寫 items 到 MongoDB. MongoDB 地址和數據庫名稱是在 Scrapy settings 中指定的; MongoDB 集合以 item 類命名.

這個例子的要點是顯示如何使用 from_crawler() 方法和如何 clean up the resources properly.:

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

Take screenshot of item（略）

Duplicates filter（略）

激活一個 Item Pipeline 組件

想要激活一個 Item Pipeline 組件，你必須將其類加入到 ITEM_PIPELINES setting 中, like in the following example:

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

The integer values you assign to classes in this setting determine 決定了他們運行的順序: items go through from lower valued to higher valued classes. It’s customary to define these numbers in the 0-1000 range.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【爬蟲】Scrapy Item Pipeline

Writing your own item pipeline

Item pipeline 例子

對價格進行校驗並丟棄沒有價錢的 items

Write items to a JSON file

Write items to MongoDB

Take screenshot of item（略）

Duplicates filter（略）

激活一個 Item Pipeline 組件

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

【Sqoop】Export data into RDBMS using Sqoop 及其調優

【NLP】Python中文文本聚類

【NLP】Python英文文本聚類

【NLP】Jieba中文分詞

【Python】解決matplotlib圖例中文亂碼問題——win10版本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結