Scrapy入門項目

項目目標

創建一個 Scrapy 項目。
創建一個 Spider 來抓取站點和處理數據。
通過命令行將抓取的內容導出。
將抓取的內容保存到 MongoDB 數據庫。

開發工具

Scrapy 框架
MongoDB
PyMongo 庫

創建項目

創建一個 Scrapy 項目，項目文件可以直接用 scrapy 命令生成，命令如下所示：

scrapy startproject tutorial

這個命令將會創建一個名爲 tutorial 的文件夾，文件夾結構如下所示：

scrapy.cfg     # Scrapy 部署時的配置文件
tutorial         # 項目的模塊，引入的時候需要從這裏引入
    __init__.py    
    items.py     # Items 的定義，定義爬取的數據結構
    middlewares.py   # Middlewares 的定義，定義爬取時的中間件
    pipelines.py       # Pipelines 的定義，定義數據管道
    settings.py       # 配置文件
    spiders         # 放置 Spiders 的文件夾
    __init__.py

創建 Spider

在spiders文件夾裏創建.py爬蟲文件，格式如下：

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

這裏有三個屬性 ——name、allowed_domains 和 start_urls，還有一個方法 parse。

name，它是每個項目唯一的名字，用來區分不同的 Spider。
allowed_domains，它是允許爬取的域名，如果初始或後續的請求鏈接不是這個域名下的，則請求鏈接會被過濾掉。
start_urls，它包含了 Spider 在啓動時爬取的 url 列表，初始請求是由它來定義的。
parse，它是 Spider 的一個方法。默認情況下，被調用時 start_urls 裏面的鏈接構成的請求完成下載執行後，返回的響應就會作爲唯一的參數傳遞給這個函數。該方法負責解析返回的響應、提取數據或者進一步生成要處理的請求。

創建 Item

Item 是保存爬取數據的容器，它的使用方法和字典類似。不過，相比字典，Item 多了額外的保護機制，可以避免拼寫錯誤或者定義字段錯誤。
創建 Item 需要繼承 scrapy.Item 類，並且定義類型爲 scrapy.Field 的字段。類似如下定義：

import scrapy

class QuoteItem(scrapy.Item):

    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

解析 Response

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        pass

parse() 方法的參數 response 是 start_urls 裏面的鏈接爬取後的結果。所以在 parse() 方法中，我們可以直接對 response 變量包含的內容進行解析，比如瀏覽請求結果的網頁源代碼，或者進一步分析源代碼內容，或者找出結果中的鏈接而得到下一個請求。
提取的方式可以是 CSS 選擇器或 XPath 選擇器。

css選擇器

例：
源碼：

<div class="quote" itemscope=""itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world"> 
            <a class="tag" href="/tag/change/page/1/">change</a>
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            <a class="tag" href="/tag/world/page/1/">world</a>
        </div>
    </div>

不同css選擇器的返回結果如下：
quote.css(’.text’)

[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]"data='<span class="text"itemprop="text">“The '>]

quote.css(’.text::text’)

[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()"data='“The world as we have created it is a pr'>]

quote.css(’.text’).extract()

['<span class="text"itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>']

quote.css(’.text::text’).extract()

['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']

quote.css(’.text::text’).extract_first()

“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”

使用Item

上文定義了 Item，接下來就要使用它了。Item 可以理解爲一個字典，不過在聲明的時候需要實例化。然後依次用剛纔解析的結果賦值 Item 的每一個字段，最後將 Item 返回即可。
QuotesSpider 的改寫如下所示：

import scrapy
from tutorial.items import QuoteItem # 導入庫

class QuotesSpider(scrapy.Spider): # 自定義爬蟲類 繼承scrapy.Spider
    name = "quotes" # 爬蟲名字
    allowed_domains = ["quotes.toscrape.com"] # 待爬取網站域名
    start_urls = ['http://quotes.toscrape.com/'] # 待爬取網站的起始網址

    def parse(self, response): # 解析/提取規則
     '''
        <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
        </span>
        <div class="tags">
            Tags:
            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">
            <a class="tag" href="/tag/change/page/1/">change</a>
            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
            <a class="tag" href="/tag/thinking/page/1/">thinking</a>
            <a class="tag" href="/tag/world/page/1/">world</a>
        </div>
    </div>
        '''
        quotes = response.css('.quote') # 獲取當頁所有名言 div標籤
        for quote in quotes: 
            item = QuoteItem() # 實例化
            item['text'] = quote.css('.text::text').extract_first() # .text css選擇器 ::text獲取節點的文本內容，結果是列表，用extract_first()獲取第一個元素
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract() # 獲取整個列表
            yield item

後續 Request

這一頁爬完了，要生成下一頁的鏈接，構造請求時需要用到 scrapy.Request。這裏我們傳遞兩個參數 ——url 和 callback，這兩個參數的說明如下。

url：它是請求鏈接。
callback：它是回調函數。當指定了該回調函數的請求完成之後，獲取到響應，引擎會將該響應作爲參數傳遞給這個回調函數。回調函數進行解析或生成下一個請求，回調函數如上文的 parse() 所示。

利用選擇器得到下一頁鏈接並生成請求，在 parse() 方法後追加如下的代碼：

next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url=url, callback=self.parse)

第一句代碼首先通過 CSS 選擇器獲取下一個頁面的鏈接，即要獲取 a 超鏈接中的 href 屬性。這裏用到了::attr(href) 操作。然後再調用 extract_first() 方法獲取內容。

第二句代碼調用了 urljoin() 方法，urljoin() 方法可以將相對 URL 構造成一個絕對的 URL。例如，獲取到的下一頁地址是 /page/2，urljoin() 方法處理後得到的結果就是：http://quotes.toscrape.com/page/2/。

第三句代碼通過 url 和 callback 變量構造了一個新的請求，回調函數 callback 依然使用 parse() 方法。這個請求完成後，響應會重新經過 parse 方法處理，得到第二頁的解析結果，然後生成第二頁的下一頁，也就是第三頁的請求。這樣爬蟲就進入了一個循環，直到最後一頁。

通過幾行代碼，我們就輕鬆實現了一個抓取循環，將每個頁面的結果抓取下來了。

改寫之後的整個 Spider 類如下所示：

import scrapy
from tutorial.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        quotes = response.css('.quote')
        for quote in quotes:
            item = QuoteItem()
            item['text'] = quote.css('.text::text').extract_first()
            item['author'] = quote.css('.author::text').extract_first()
            item['tags'] = quote.css('.tags .tag::text').extract()
            yield item
    # 下一個要爬取的頁面url
        '''
        <li class="next">
                <a href="/page/2/">Next <span aria-hidden="true">→</span></a>
            </li>
        '''
        next = response.css('.pager .next a::attr("href")').extract_first()
        url = response.urljoin(next) # 生成新的URL
        yield scrapy.Request(url=url, callback=self.parse) # 當請求完成後，引擎將響應作爲參數傳遞給回調函數 繼續解析

運行

進入目錄，運行如下命令：

scrapy crawl quotes

保存到文件

Scrapy 提供的 Feed Exports 可以輕鬆將抓取結果輸出。例如，我們想將上面的結果保存成 JSON 文件，可以執行如下命令：

scrapy crawl quotes -o quotes.json

命令運行後，項目內多了一個 quotes.json 文件，文件包含了剛纔抓取的所有內容，內容是 JSON 格式。
另外我們還可以每一個 Item 輸出一行 JSON，輸出後綴爲 jl，爲 jsonline 的縮寫，命令如下所示：

scrapy crawl quotes -o quotes.jl

或

scrapy crawl quotes -o quotes.jsonlines

輸出格式還支持很多種，例如 csv、xml、pickle、marshal 等，還支持 ftp、s3 等遠程輸出，另外還可以通過自定義 ItemExporter 來實現其他的輸出。

例如，下面命令對應的輸出分別爲 csv、xml、pickle、marshal 格式以及 ftp 遠程輸出：

scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml
scrapy crawl quotes -o quotes.pickle
scrapy crawl quotes -o quotes.marshal
scrapy crawl quotes -o ftp://user:pass@ftp.example.com/path/to/quotes.csv

其中，ftp 輸出需要正確配置用戶名、密碼、地址、輸出路徑，否則會報錯。

通過 Scrapy 提供的 Feed Exports，我們可以輕鬆地輸出抓取結果到文件。對於一些小型項目來說，這應該足夠了。不過如果想要更復雜的輸出，如輸出到數據庫等，我們可以使用 Item Pileline 來完成。

使用 Item Pipeline

如果想進行更復雜的操作，如將結果保存到 MongoDB 數據庫，或者篩選某些有用的 Item，則我們可以定義 Item Pipeline 來實現。

Item Pipeline 爲項目管道。當 Item 生成後，它會自動被送到 Item Pipeline 進行處理，我們常用 Item Pipeline 來做如下操作：

清洗 HTML 數據
驗證爬取數據，檢查爬取字段
查重並丟棄重複內容
將爬取結果儲存到數據庫

要實現 Item Pipeline 很簡單，只需要定義一個類並實現 process_item() 方法即可。啓用 Item Pipeline 後，Item Pipeline 會自動調用這個方法。process_item() 方法必須返回包含數據的字典或 Item 對象，或者拋出 DropItem 異常。

**process_item() 方法有兩個參數。**一個參數是 item，每次 Spider 生成的 Item 都會作爲參數傳遞過來。另一個參數是 spider，就是 Spider 的實例。

實現一個 Item Pipeline，篩掉 text 長度大於 50 的 Item，並將結果保存到 MongoDB。代碼如下：

import pymongo
from scrapy.exceptions import DropItem

class TextPipeline(object):
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit: # 存在item 的 text 屬性，判斷長度是否大於 50
                item['text'] = item['text'][0:self.limit].rstrip() + '...' # 大於50，那就截斷然後拼接省略號
            return item
        else:
            return DropItem('Missing Text') # 不存在 item 的 text 屬性，拋出 DropItem 異常

# 將處理後的 item 存入 MongoDB，定義另外一個 Pipeline
# 實現另一個類 MongoPipeline
class MongoPipeline(object):
    def __init__(self,mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    # 從配置文件setting.py中獲取mongo_uri，mongo_db 需要自己在setting.py中定義
    # MongoDB 連接需要的地址(mongo_uri)和數據庫名稱(mongo_db)
    @classmethod
    def from_crawler(cls, crawler):
        return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    # 連接並打開數據庫
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    # 該方法必須定義，而且必須要有item和spider兩個參數 其他方法可以隨便寫
    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))  # 將數據插入集合 要轉換爲字典形式 鍵值對
        return item

    # 關閉連接
    def close_spider(self, spider):
        self.client.close()

MongoPipeline 類實現了 API 定義的另外幾個方法。

from_crawler，這是一個類方法，用 @classmethod 標識，是一種依賴注入的方式，方法的參數就是 crawler，通過 crawler 這個我們可以拿到全局配置的每個配置信息，在全局配置 settings.py 中我們可以定義 MONGO_URI 和 MONGO_DB 來指定 MongoDB 連接需要的地址和數據庫名稱，拿到配置信息之後返回類對象即可。所以這個方法的定義主要是用來獲取 settings.py 中的配置的。
open_spider，當 Spider 被開啓時，這個方法被調用。在這裏主要進行了一些初始化操作。
close_spider，當 Spider 被關閉時，這個方法會調用，在這裏將數據庫連接關閉。

最主要的 process_item() 方法則執行了數據插入操作。

定義好 TextPipeline 和 MongoPipeline 這兩個類後，我們需要在 settings.py 中使用它們。MongoDB 的連接信息還需要定義。
我們在 settings.py 中加入如下內容：

# 賦值 ITEM_PIPELINES 字典，鍵名是 Pipeline 的類名稱，鍵值是調用優先級，是一個數字，數字越小則對應的 Pipeline 越先被調用。
ITEM_PIPELINES = {
   'tutorial.pipelines.TextPipeline': 300,
   'tutorial.pipelines.MongoPipeline': 400,
}
MONGO_URI='localhost'
MONGO_DB='tutorial

再重新執行爬取，命令如下所示：

scrapy crawl quotes

爬取結束後，MongoDB 中創建了一個 tutorial 的數據庫、QuoteItem 的表，如圖所示：

參考資料：
《Python3網絡爬蟲開發實踐——13.2 Scrapy入門》
Python爬蟲實戰 | (20) Scrapy入門實例

Scrapy入門項目

項目目標

開發工具

創建項目

創建 Spider

創建 Item

解析 Response

css選擇器

使用Item

後續 Request

運行

保存到文件

使用 Item Pipeline

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

scrapy方法總結

Scrapy入門項目

Downloader Middleware的用法——實現隨機User-Agent

Django學習（1）——創建項目

flask入門項目

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結