Python網絡爬蟲(十八)——Scrapy基本使用

原創

止步听风

2020-07-04 17:05

因爲 Scrapy 作爲爬蟲的一個工具來說功能比較強大，這裏會分幾篇文章來說明。

創建項目

Scrapy 的項目構建是通過終端完成的，在要放置項目的目錄下執行：

scrapy startproject projectname

此時就在終端當前的目錄下創建了名爲 projectname 的項目。

項目結構

項目構建後，會自動生成一系列文件，這些文件共同構成了該爬蟲項目，文件主要有：

items.py：用來存放爬蟲獲取的數據模型，也就是事先對數據結構進行定義
middlewares.py：用來存放 middlewares 的文件，包括 Downloader middlewares 和 Spider middlewares
pipelines.py：用來將 items.py 中的數據模型進行存儲
settings.py：爬蟲的配置信息(delay，pipeline，headers等)
scrapy.cfg：項目配置文件
spider 文件夾：爬蟲文件夾

創建爬蟲

雖然可以手動在 spider 下建立自己的爬蟲，但是 Scrapy 中也提供了命令能夠自動地生成爬蟲文件：

scrapy genspider spiderfilename domainname

spiderfilename：說明了爬蟲的文件名
domainname：則表示該爬蟲要爬取的域名，也就是說要爬取位於該域名下的資源

執行上述命令之後，會生成與名爲 spiderfilename 的 py 文件，文件中主要包含：

import scrapy

class spiderfilename(scrapy.Spider):
    name = 'spiderfilename'
    allowed_domains = ['domainname']
    start_urls = ['domainname']

    def parse(self, response):
        pass

上邊的程序就定義了一個爬蟲，Scrapy 中的爬蟲使用類實現，該類繼承自 scrapy.Spider，類中的數據成員和函數爲：

name：爬蟲名
allowed_domains：即創建爬蟲的域名，該爬蟲只會爬取該域名下的網頁
start_urls：爬蟲開始的 url，其實不一定是 domainname，但一定是 domainname 下的網頁
parse：根據 Scrapy 的框架，Engine 會將 Downloader 返回的響應數據返回給 Spider 解析，而爬蟲再將數據傳遞給 parse 方法。同樣從 Scrapy 的框架來看，該函數的作用主要爲：提取數據和生成下一個請求

爬蟲構建完成之後，需要在終端執行命令來運行程序：

scrapy crawl spidername

通過上邊的命令能夠執行名爲 spidername 的爬蟲，但每次運行爬蟲如果都需要在終端中執行命令的話，不斷進行界面切換和代碼調試都會顯得比較麻煩，因此可以選擇在項目目錄下構建 py 文件，然後使用 cmdline 模塊直接運行：

from scrapy import cmdline

cmdline.execute("scrapy crawl spidername".split())

這樣就省掉了界面切換的步驟，也提高了代碼調試的效率。

settings.py

該文件中包含了關於爬蟲的一些設置，這裏有幾個選項需要進行設置：

ROBOTSTXT_OBEY：設置爲 False，否則爲 True。True 表示遵守機器協議，此時爬蟲會首先找 robots.txt 文件，如果找不到則會停止
DEFAULT_REQUEST_HEADERS：默認請求頭，可以在其中添加 User-Agent，表示該請求是從瀏覽器發出的，而不是爬蟲
DOWNLOAD_DELAY：表示下載的延遲，防止過快
ITEM_PIPELINES：啓用 pipelines.py

spiders

spiders 文件夾中的文件內容爲：

# -*- coding: utf-8 -*-
import scrapy
from base.items import BaseItem


class QsbkSpider(scrapy.Spider):
    name = 'qsbk_spider'
    allowed_domains = ['www.qiushibaike.com']
    start_urls = ['http://www.qiushibaike.com/text/page/1/']
    base_url = 'http://www.qiushibaike.com'

    def parse(self, response):
        res_parse = response.xpath("//div[@class='col1 old-style-col1']/div")
        for res in res_parse:
            author = res.xpath("./div[@class='author clearfix']//img/@alt").get()
            content = res.xpath(".//div[@class='content']//text()").getall()
            content = ''.join(content).replace('\n','')
            item = BaseItem(author=author,content=content)
            yield item

        next_url = self.base_url+response.xpath("//ul[@class='pagination']/li[last()]/a/@href").get()
        if not next_url:
            return
        else:
            yield scrapy.Request(next_url,callback=self.parse)

上邊的程序主要分爲兩部分：

for 循環內部爲解析 response，該 response 來自 Downloader，然後將生成的 item 傳遞到 pipelines.py
for 循環後的內容爲生成下一個請求，然後通過 scrapy.Request 繼續回調到 parse

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BaseItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

該文件中定義了數據模型，對應的是從 spider 中傳回的 item。

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter


class BasePipeline(object):
    def __init__(self):
        self.fp = open('joke.json','wb')
        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')

    def open_spider(self,spider):
        print('Spider start.')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        print('Spider end.')

上面的代碼中存在三個固定含義的函數：

open_spider：在 spider 開始執行之前先被執行
process_item：spider 返回 item 後執行，可以進行數據的存儲
close_spider：在 spider 執行完畢之後執行

數據的存儲可以採用 JSON，或者是 Scrapy.exporters 中的 JsonItemExporter，JsonLinesItemExporter 等，在 Scrapy.exporters 中定義了很多種數據存儲的格式。

然後該項目就能夠直接運行，爬取指定的網頁信息了。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python網絡爬蟲(十八)——Scrapy基本使用

創建項目

項目結構

創建爬蟲

settings.py

spiders

items.py

pipelines.py

Python網絡爬蟲(二十三)——Redis

Python網絡爬蟲(十九)——CrawlSpider

Python網絡爬蟲(二十四)——Scrapy-Redis

Python網絡爬蟲(二十二)——Downloader Middlewares

Python網絡爬蟲(二十一)——Request 和 Response

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結