【爬蟲】Scrapy Item

【原文鏈接】https://doc.scrapy.org/en/latest/topics/items.html

 

Items

The main goal in scraping is to extract structured data from unstructured sources, typically, web pages. Scrapy spiders can return the extracted data as Python dicts. While convenient and familiar, Python dicts lack structure: 很容易犯拼寫錯誤 in a field name 或返回不一致數據, especially in a larger project with many spiders.

爲了定義普世的輸出數據格式 Scrapy provides the Item class. Item objects are simple 容器 used to 手機爬取到的數據. They provide a dictionary-like API with a convenient syntax for 聲明他們可用的 fields.

Various Scrapy 組件使用 Items 提供的額外的信息: exporters look at 聲明的 fields 來找出需要 export 的列, 使用 Item fields 元數據可以定製序列化, trackref 追蹤 Item 實例來找到內存泄漏 (see Debugging memory leaks with trackref), etc.

Declaring Items

Items are declared using a simple class definition syntax and Field objects. Here is an example:

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

Note

Those familiar with Django will notice that Scrapy Items are declared similar to Django Models, except that Scrapy Items are much simpler as there is no concept of different field types.

Item Fields

Field 對象用來指定每個 field 的元數據. For example, 上面代碼中 last_updated field 的序列化函數.

你可以指定每個 field 的任何元數據. Field 對象所能接受的值沒有任何限制. 因此, 元數據可使用的鍵沒有說明列表. Field 對象定義的每個鍵可以被不同的組件使用, 並且只有這些組件 know about it. 你也可以在項目中定義和使用其他任何 Field 鍵, for your own needs. Field 對象的主要作用是在一個地方提供一種方法來定義所有 field 元數據. Typically, 那些行爲表現依賴於每個 field 的組件會使用某些 field 鍵來配置這些行爲表現. 你必須參照文檔來查看每個組件都使用哪些元數據鍵.

It’s important to note that 用來聲明 item 的 Field 對象不會一直作爲類屬性 stay assigned. Instead, 他們可以通過 Item.fields 屬性被獲取到.

Working with Items

Here are some examples of 普遍的 tasks performed with items, 使用 the Product item declared above. You will notice the API is very similar to the dict API.

Creating items

>>> product = Product(name='Desktop PC', price=1000)
>>> print product
Product(name='Desktop PC', price=1000)

Getting field values

>>> product['name']
Desktop PC
>>> product.get('name')
Desktop PC

>>> product['price']
1000

>>> product['last_updated']
Traceback (most recent call last):
    ...
KeyError: 'last_updated'

>>> product.get('last_updated', 'not set')
not set

>>> product['lala'] # getting unknown field
Traceback (most recent call last):
    ...
KeyError: 'lala'

>>> product.get('lala', 'unknown field')
'unknown field'

>>> 'name' in product  # is name field populated?
True

>>> 'last_updated' in product  # is last_updated populated?
False

>>> 'last_updated' in product.fields  # is last_updated a declared field?
True

>>> 'lala' in product.fields  # is lala a declared field?
False

Setting field values

>>> product['last_updated'] = 'today'
>>> product['last_updated']
today

>>> product['lala'] = 'test' # setting unknown field
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

Accessing all populated values

To access all populated values, just use the typical dict API:

>>> product.keys()
['price', 'name']

>>> product.items()
[('price', 1000), ('name', 'Desktop PC')]

Other common tasks

Copying items:

>>> product2 = Product(product)
>>> print product2
Product(name='Desktop PC', price=1000)

>>> product3 = product2.copy()
>>> print product3
Product(name='Desktop PC', price=1000)

Creating dicts from items:

>>> dict(product) # create a dict from all populated values
{'price': 1000, 'name': 'Desktop PC'}

Creating items from dicts:

>>> Product({'name': 'Laptop PC', 'price': 1500})
Product(price=1500, name='Laptop PC')

>>> Product({'name': 'Laptop PC', 'lala': 1500}) # warning: unknown field in dict
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

Extending Items

你可以通過聲明原本 Item 的一個子類來擴展 Items (來增加更多的 fields 或改變一些 fields 的一些元數據).

For example:

class DiscountedProduct(Product):
    discount_percent = scrapy.Field(serializer=str)
    discount_expiration_date = scrapy.Field()

你也可以通過使用之前的 field 元數據並 change 或 append 值來擴展 field 元數據, like this:

class SpecificProduct(Product):
    name = scrapy.Field(Product.fields['name'], serializer=my_serializer)

這會爲 name field 增加或替換 the serializer 元數據鍵, 但是保留所有之前已經存在的元數據值.

Item objects

class scrapy.item.Item([arg])

根據指定 argument 可選擇性地初始化一個新 Item 並返回. Items 複製標準 dict API, 包括它的構造函數. Item 提供的屬性中唯一一個增加的是:

fields

這個 Item 的一個包含所有已聲明的 fields 的字典, not only those populated. 鍵和是 field 名稱,值是 Item declaration  中使用的 Field 對象.

Field objects

class scrapy.item.Field([arg])

Field 類只是內置的 dict 類的一個別名, 不提供多餘的功能或屬性. 換句話說, Field 對象是普通的 Python 字典. 會使用另一個基於類屬性的類來支持 item declaration syntax.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章