Scrapy框架之新建Scrapy項目詳解

前言

從這篇開始，帶大家通過 Scrapy 框架來寫爬蟲，相比之前寫的爬蟲腳本，用上 Scrapy 才更像一個爬蟲項目

Scrapy 是一個爲了爬取網站數據，提取結構性數據而編寫的應用框架。可以應用在包括數據挖掘，信息處理或存儲歷史數據等一系列的程序中

其最初是爲了頁面抓取 (更確切來說, 網絡抓取 )所設計的，也可以應用在獲取 API 所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲

Scrapy 使用了 Twisted'twɪstɪd異步網絡框架來處理網絡通訊，可以加快我們的下載速度，不用自己去實現異步框架，並且包含了各種中間件接口，可以靈活的完成各種需求

簡單來說，Scrapy 就是一個方便爬蟲搭建的框架，那麼下面來看看我們的第一個 Scrapy 項目（其實之前就有一篇爬取堆糖的文章已經用過了，這裏完整的來一遍）

今天主要講解如何建一個 Scrapy 項目，以及** Scrapy 的基本結構和配置**

正文

首先，安裝 Scrapy 可以在 Pycharm 中安裝，也可以手動下載或通過命令行安裝，這裏就舉個直接在 Pycharm 中是如何安裝的

首先在 settings 中點擊 Project interpreter ，右側就是你當前使用的 Python 解釋器已經安裝的包，這裏我已經裝過了，還提示可以升級到 Scrapy 1.6.0

然後我們點擊最右側的綠色加號，在上方搜索 Scrapy 可以看到下方就出現了 scrapy ，右側是其最新的版本，點擊底部 Install Package 就可以安裝，稍等幾分鐘就能裝好了

接着來看看如何創建一個 Scrapy 項目，如下圖所示，打開 cmd 命令行，什麼？不會打開...

按 “Win + r” 鍵運行，輸入 cmd 回車即可

接着進入自己經常放代碼的文件夾下，在這裏輸入

scrapy startproject myscrapy

然後回車，myscrapy 是你自己定的項目名，之所以我這裏是

python3 -m scrapy startproject myscrapy

是因爲一開始我同時裝了 Python2 和 Python3，所以執行命令時要區別，所以沒有同時裝兩個版本的各位，可以不用跟我一樣加上 python3 -m

到這裏 Scrapy 項目就創建好了，但是我們還需要一個主要放爬蟲代碼的.py文件，而在我們創建完一個 Scrapy 之後也提示我們，可以進入這個項目，創建一個爬蟲文件

cd myscrapy
scrapy genspider first "bilibili.com"

first 是爬蟲名，“bilibili.com” 是爲了方便自定義的爬取網站的域名

這樣一來完整的 Scrapy 項目就創建完了，下面是完整的項目結構

這裏

scrapy.cfg: 項目的配置文件
myscrapy/: 該項目的python模塊。之後您將在此加入代碼
myscrapy/spiders/: 放置spider代碼的目錄
myscrapy/items.py: 項目中的item文件
myscrapy/pipelines.py: 項目中的pipelines管道文件
myscrapy/middlewares.py: 項目中的middlewares中間件文件
myscrapy/settings.py: 項目的設置文件

還有一點是，Scrapy 的命令可不止 startproject 和 genspider，更多的可以直接在命令行中輸入

scrapy

回車後就可以看見，而需要注意的是，沒有進入項目文件下，與進入項目文件下可執行的 Scrapy 命令有點不同

具體作用大家可以問問度娘~我就不再多說了（誰說我不知道了，這些小知識得自己搜過印象才深嘛）

項目結構以及如何創建 Scrapy 項目就先到這兒，下面我們來看看 Scrapy 到底有哪些五臟六腑

first.py

首先，我們的 first.py 文件，之後我們要寫的爬蟲主體代碼就在此編寫，可以看到生成的文件中，有一個 FirstSpider 類，繼承着 scrapy.Spider

name 是爬蟲名，之後運行爬蟲的時候，就要用到這個 name
allowed_domains 包含了spider允許爬取的域名(domain)的列表
start_urls 初始URL元祖/列表
parse 方法，當請求url返回網頁沒有指定回調函數時，默認的Request對象回調函數。用來處理網頁返回的response，以及生成Item或者Request對象

# -*- coding: utf-8 -*-
import scrapy


class FirstSpider(scrapy.Spider):
    name = 'first'
    allowed_domains = ['bilibili.com']
    start_urls = ['http://bilibili.com/']

    def parse(self, response):
        pass

items.py

Item 中定義結構化數據字段，用來保存爬取到的數據

可以通過創建一個 scrapy.Item 類，並且定義類型爲 scrapy.Field的類屬性來定義一個Item

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MyscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    name = scrapy.Field()

middlewares.py

中間件文件中定義了兩個中間件類 MyscrapySpiderMiddleware 和 MyscrapyDownloaderMiddleware

下載器中間件是介於 Scrapy 的request/response處理的鉤子框架。是用於全局修改Scrapy request和response的一個輕量、底層的系統

要使用下載器中間件，就需要激活，要激活下載器中間件組件，將其加入到 DOWNLOADER_MIDDLEWARES設置中。該設置是一個字典(dict)，鍵爲中間件類的路徑，值爲其中間件的順序(order)，需在 settings,py 中配置

當然也可以自己編寫中間件

將其註釋解掉即可

#SPIDER_MIDDLEWARES = {
#    'myscrapy.middlewares.MyscrapySpiderMiddleware': 543,
#}

#DOWNLOADER_MIDDLEWARES = {
#    'myscrapy.middlewares.MyscrapyDownloaderMiddleware': 543,
#}

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class MyscrapySpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class MyscrapyDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

pipelines.py

當Item在Spider中被收集之後，它將會被傳遞到Item Pipeline，一些組件會按照一定的順序執行對Item的處理。

每個item pipeline組件(有時稱之爲“Item Pipeline”)是實現了簡單方法的Python類。他們接收到Item並通過它執行一些行爲，同時也決定此Item是否繼續通過pipeline，或是被丟棄而不再進行處理。

以下是item pipeline的一些典型應用：

清理HTML數據
驗證爬取的數據(檢查item包含某些字段)
查重(並丟棄)
將爬取結果保存到數據庫中

同樣要使用 pipeline ，也需要在 settings.py 中啓用

#ITEM_PIPELINES = {
#    'myscrapy.pipelines.MyscrapyPipeline': 300,
#}

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class MyscrapyPipeline(object):
    def process_item(self, item, spider):
        return item

settings.py

Scrapy設置(settings)提供了定製Scrapy組件的方法。可以控制包括核心(core)，插件(extension)，pipeline及spider組件。比如設置Json Pipeliine、LOG_LEVEL等

結尾

關於 Scrapy 項目的基本點就講到這裏，另外公衆號後臺回覆【scrapy課件】，有一個關於 Scrapy 的課件等着你哦，對於新手來說基本知識點都涵蓋在內了，且易於理解

print('微信公衆號搜索 "猿獅的單身日常" ，Java技術升級、蟲師修煉，我們 不見不散!')
print('也可以掃下方二維碼哦~')

Scrapy框架之新建Scrapy項目詳解

前言

正文

first.py

items.py

middlewares.py

pipelines.py

settings.py

結尾

空雨

《天魔史》第十九章丹！

《天魔史》第十九章不是山！！

記

《天魔史》第十九章不是山！

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結