Scrapy框架

Scrapy之所以是框架,而不是一個簡單的庫,區別就是它相比於普通的庫有着更加強大的功能,而其中最常用的幾個功能就是鏈接提取器(LinkExtractors)、自動登錄和圖片下載器。

鏈接提取器(LinkExtractors)

攜帶鏈接提取器的爬蟲生成和我們常規的爬蟲生成有所不同,需要多攜帶一些參數。
scrapy genspider -t crawl 爬蟲名字域名
如果你覺得每次創建和啓動爬蟲都比較麻煩,你可以像我一樣建一個.py文件用來啓動和創建爬蟲

from scrapy import cmdline


class RunItem:
    def __init__(self, name, url=None):
        # 爬蟲名字
        self.name = name
        # 域名
        self.url = url

    # 啓動爬蟲
    def start_item(self):
        command = ['scrapy', 'crawl', self.name]
        print('爬蟲已啓動')
        cmdline.execute(command)

    # 新建爬蟲
    def new_item(self, auto_page=False):
        # 創建自動翻頁爬蟲(此爬蟲會自動提取網頁中的連接)
        if auto_page:
            command = ['scrapy', 'genspider', '-t', 'crawl', self.name, self.url]
            cmdline.execute(command)
        # 創建正常爬蟲
        else:
            command = ['scrapy', 'genspider', self.name, self.url]
            cmdline.execute(command)

使用我上方的類即可,你也可以自己寫一個更適合自己的。
這時候比如我像創建一個爬取陽光問政官網,帶鏈接提取器的爬蟲,只需要執行RunItem('yg', 'wz.sun0769.com').new_item(True)即可。

創建好後,我們發現攜帶鏈接提取器的爬蟲與普通爬蟲不同的地方有兩處,一處是多了一個rules

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

另一處則是我們的方法從parse變成了parse_item。

注意: 我們使用攜帶鏈接提取器的爬蟲不能創建def parse(self):方法,因爲這個方法被用來寫鏈接提取器了,如果使用會將鏈接提取器覆蓋,導致無法自動提取鏈接。除非你想重寫鏈接提取器。

Rule與LinkExtractor

rules規定,從網站中提取那些有用的鏈接(LinkExtractor),並在提取後執行那些操作(Rule)。我們可以創建多個規則,讓每個鏈接有不同的操作。
下面我們來簡單介紹一下Rule與LinkExtractor中都有那些常用的屬性

自動提取連接規則: Rule(連接提取器, [callback,follow,process_links])
- callback: 滿足此條件的url回調的函數
- follow: 是否開啓循環提取(從提取的頁面中在此提取滿足條件的網址)
- process_links: 從link_extractor中獲取到鏈接後會傳遞給這個函數，用來過濾不需要爬取的鏈接。
- 除了上述常用的之外,還有:cb_kwargs, process_request, errback。
連接提取器: LinkExtractor()
- allow：允許的url。所有滿足這個正則表達式的url都會被提取。
- deny：禁止的url。所有滿足這個正則表達式的url都不會被提取。
- allow_domains：允許的域名。只有在這個裏面指定的域名的url纔會被提取。
- deny_domains：禁止的域名。所有在這個裏面指定的域名的url都不會被提取。
- restrict_xpaths：嚴格的xpath。和allow共同過濾鏈接。
- 除上述常用屬性外還有:tags, attrs, canonicalize, unique, process_value, deny_extensions, restrict_css, strip, restrict_text

實戰演示

Scrapy鏈接提取器有個很好用的地方就是我們通過控制檯,可以看到這個網站的跳轉鏈接並非是完整的鏈接,但我們的鏈接提取器會自動將其補全,變成完整的鏈接後在進行篩選!

也就是我們放入鏈接提取器用於規則篩選的鏈接基於如下鏈接http://wz.sun0769.com/political/politics/index?id=451626進行改寫就可以,打開多個頁面,我們發現,只有id號會發生改變,那麼我們只需在id出給\d+(正則中\d表示數字+表示一個以上,連起來就是’一個以上的數字’)

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_text.items import Scrapy陽光問政Item


class YgSpider(CrawlSpider):
    name = 'yg'
    allowed_domains = ['wz.sun0769.com']
    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1']

    rules = (
# 翻頁鏈接提取(follow:用於重複提取,從提取的新頁面中繼續尋找符合當前要求的頁面)
		Rule(LinkExtractor(allow=r'wz.sun0769.com/political/index/politicsNewest\?id=1&page=\d+'), follow=True),
		# 詳情鏈接提取(callback:將提取到的詳情頁面傳入parse_item方法進行處理)
        Rule(LinkExtractor(allow=r'wz.sun0769.com/political/politics/index\?id=\d+'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = Scrapy陽光問政Item()
        item['標題'] = response.xpath('//p[@class="focus-details"]/text()').extract()
        item['內容'] = response.xpath('//div[@class="details-box"]/pre/text()').extract()
        item['配圖'] = response.xpath('//div[@class="mr-three"]/div[3]/img/@src').extract()
        print(item)

查看不使用鏈接提取器爬取相同內容

自動登錄

常規request要實現網站登錄,有兩種方法,一種是找到用戶登錄的提交表單,模擬用戶提交進行登錄,而另一種則是攜帶已登錄的cookie。
這兩種方式Scrapy都可以使用,但除此之外,scrapy還可以自動尋找可能的登錄框,我們輸入賬號密碼後可以自動提交併且登錄。下面我們就來用steam(https://steamcommunity.com/)登錄頁面,演示一下幾種登錄方式。
以防發生意外,我們先改一下請求頭

DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
   'Accept-Language': 'zh-CN,zh;q=0.9,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.2,en;q=0.1',
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0'
}

舊方法-攜帶cookie登錄和模擬提交登錄表單

當前我們想要實現,對首個url頁面發起請求的時候帶上cookie登錄,但是scrapy會自動對首個url發起請求,而這個請求時調用了staet_request()方法, 也就是說我們只要重寫此方法就能達到攜帶cookie登錄。

cookie直接複製此處的即可,無需任何處理cookie = {i.split('=')[0]: i.split('=')[1] for i in cookie.split('; ')}下列代碼的此處會將cookie自動轉化爲需要的字典形式

# -*- coding: utf-8 -*-
import scrapy


class SteamSpider(scrapy.Spider):
    name = 'Steam'
    allowed_domains = ['store.steampowered.com']
    start_urls = ['https://store.steampowered.com/login/?redir=%3Fsnr%3D1_join_4__global-header&redir_ssl=1']

    def start_requests(self):
        cookie = '把你的cookie複製到這裏'
        cookie = {i.split('=')[0]: i.split('=')[1] for i in cookie.split('; ')}
        yield scrapy.Request(
            url=self.start_urls[0],
            cookies=cookie,
            callback=self.parse
        )

    def parse(self, response):
        with open('steam.html', 'w', encoding='utf-8') as steam:
            steam.write(response.body.decode())

模擬提交登錄表單思路就是先找到from表單,在根據信息進行提交。下面使用github來進行模擬。通過觀察,我們發現提交表單的時候不僅要提交賬號密碼,還需要提交時間戳和祕鑰等一些東西。

提交post請求,使用的方法爲scrapy.FormRequest()代碼如下。

# -*- coding: utf-8 -*-
import scrapy


class GithubSpider(scrapy.Spider):
    name = 'github'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        commit = 'Sign in'
        authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').extract_first()
        ga_id = response.xpath('//input[@name="ga_id"]/@value').extract_first()
        login = ''
        password = ''
        timestamp = response.xpath('//input[@name="timestamp"]/@value').extract_first()
        timestamp_secret = response.xpath('//input[@name="timestamp_secret"]/@value').extract_first()

        post_data = {
            'commit': commit,
            'authenticity_token': authenticity_token,
            # 'ga_id': ga_id,
            'login': login,
            'password': password,
            'timestamp': timestamp,
            'timestamp_secret': timestamp_secret,
            'webauthn-support': 'supported',
            'webauthn-iuvpaa-support': 'unsupported'
        }
        # print(post_data)

        # 發送post請求
        yield scrapy.FormRequest(
            url='https://github.com/session',
            formdata=post_data,
            callback=self.after_login
        )

    def after_login(self, response):
        # print(response)
        with open('github.html', 'w', encoding='utf-8') as f:
            f.write(response.text)

新方法-自動登錄

自動登錄的好處是,scrapy會幫助我們自動找到from表單,然後自動將我們需要的只需填入需要填寫的內容(賬號密碼)其他用於驗證的我們無需處理即可實現登錄
這裏我們需要用到的方法是yield scrapy.FormRequest.from_response()

# -*- coding: utf-8 -*-
import scrapy


class GithubautoSpider(scrapy.Spider):
    name = 'githubAuto'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse(self, response):
        yield scrapy.FormRequest.from_response(
            response=response,
            formdata={"login": "", "password": ""},
            callback=self.after_login
        )

    def after_login(self, response):
        with open('github.html', 'w', encoding='utf-8') as f:
            f.write(response.text)

.from_response()除了能自動登錄外,他還可以作爲表單提交的方法使用,比如說進行搜索的時候,就可以使用.from_response()進行表單提交。頁面中有多個表單,可以使用其中的formname、formid、formxpath等屬性來識別表單

圖片(文件)下載器

圖片下載器(Images Pipeline)、文件下載器(Files Pipeline)使用方式類似,只是個別方法名不同,下列只進行圖片下載器的示例。
這次我們使用太平洋作爲測試網站,爬取一個我比較喜歡的手機品牌的圖片
還是老樣子,我們先來不使用下載器,使用原始方式嘗試下載:
雖然爬蟲中看上代碼量並不多,但我們在管道中還需要對圖片進行保存處理

# -*- coding: utf-8 -*-
import scrapy


class PhoneSpider(scrapy.Spider):
    name = 'phone'
    allowed_domains = ['product.pconline.com.cn']
    start_urls = ['https://product.pconline.com.cn/pdlib/1140887_picture_tag02.html']

    def parse(self, response):
        ul = response.xpath('//div[@id="area-pics"]/div/div/ul/li/a/img')
        for li in ul:
            item = {'圖片': li.xpath('./@src').extract_first()}
            item['圖片'] = 'https:' + item['圖片']
            yield item

import os
from urllib import request

class Scrapy圖片下載Pipeline(object):
    def process_item(self, item, spider):
        path = os.path.join(os.path.dirname(__file__), '圖片')
        name = item['圖片'].split('/')[-1]
        print(name, item['圖片'])
        request.urlretrieve(item['圖片'], path + "/" + name)
        return item

使用圖片下載器 Images Pipeline

如果使用了scrapy的下載器,可以:

避免重新下載最近已經下載過的數據
可以方便的指定文件存儲的路徑
可以將下載的圖片轉換成通用的格式。如：png,jpg
可以方便的生成縮略圖
可以方便的檢測圖片的寬和高，確保他們滿足最小限制
異步下載，效率非常高

使用images pipeline下載文件步驟：

定義好一個Item，然後在這個item中定義兩個屬性，分別爲image_urls以及images。image_urls是用來存儲需要下載的文件的url鏈接，需要給一個列表
當文件下載完成後，會把文件下載的相關信息存儲到item的images屬性中。如下載路徑、下載的url和圖片校驗碼等
在配置文件settings.py中配置IMAGES_STORE，這個配置用來設置文件下載路徑
啓動pipeline：在ITEM_PIPELINES中設置’scrapy.pipelines.images.ImagesPipeline’: 1
將上述配置完成後,執行下述代碼即可實現圖片爬取

# -*- coding: utf-8 -*-
import scrapy
from scrapy_text.items import Scrapy圖片下載Item

class PhoneautoSpider(scrapy.Spider):
    name = 'phoneAuto'
    allowed_domains = ['product.pconline.com.cn']
    start_urls = ['https://product.pconline.com.cn/pdlib/1140887_picture_tag02.html']

    def parse(self, response):
        ul = response.xpath('//div[@id="area-pics"]/div/div/ul/li/a/img')
        for li in ul:
            item = Scrapy圖片下載Item()
            item['image_urls'] = li.xpath('./@src').extract_first()
            item['image_urls'] = ['https:' + item['image_urls']]
            yield item

圖片下載器的源碼在from scrapy.pipelines.images import ImagesPipeline處,如果你感興趣可以自行查看

圖片下載器提示 ModuleNotFoundError: No module named ‘PIL’ 報錯解決

雖然報錯爲缺少PIL庫,但因爲此庫沒有Python3版本,已經被棄用,所以我們需要安裝此庫的Python版本Pillow即可
pip install pillow

使用文件下載器 Files Pipeline

定義好一個Item，然後在這個item中定義兩個屬性，分別爲file_urls以及files。files_urls是用來存儲需要下載的文件的url鏈接，需要給一個列表
當文件下載完成後，會把文件下載的相關信息存儲到item的files屬性中。如下載路徑、下載的url和文件校驗碼等
在配置文件settings.py中配置FILES_STORE，這個配置用來設置文件下載路徑
啓動pipeline：在ITEM_PIPELINES中設置’scrapy.piplines.files.FilesPipeline’: 1

源碼位於from scrapy.pipelines.files import FilesPipeline,使用和上述的圖片下載器類似,這裏就不在演示了。

Scrapy是什麼?Scrapy怎麼用?Scrapy進階使用[鏈接提取器、自動登錄、圖片(文件)下載器](基於scrapy2.0+編寫) ๑乛◡乛๑ Scrapy框架使用方法

Scrapy框架

鏈接提取器(LinkExtractors)

Rule與LinkExtractor

實戰演示

自動登錄

舊方法-攜帶cookie登錄和模擬提交登錄表單

新方法-自動登錄

圖片(文件)下載器

使用圖片下載器 Images Pipeline

圖片下載器提示 ModuleNotFoundError: No module named ‘PIL’ 報錯解決

使用文件下載器 Files Pipeline

2024年DataOps趨勢預測：AI不會取代數據工程師

雲原生週刊：K8s 中的服務和網絡｜ 2024.4.29

通過Http鏈接地址爬取有贊微信商城商品信息及下載至EXCEL

多人同時導出 Excel 幹崩服務器！新來的阿里大佬給出的解決方案太優雅了！

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

華爲云云原生FinOps解決方案，釋放雲原生最大價值

Vue生命週期函數是什麼?Vue生命週期函數有什麼用? ∠( °ω°)／前端知識

Vue如何自定義組件?超詳細Vue自定義組件指南!使用自定義組件減少重複造輪子! ∠( °ω°)／前端知識

從Django後端(python)角度學習前端VUE框架基礎-下(觸發視圖更新,計算屬性,監聽器,過濾器) ∠( °ω°)／前端知識

Vue中v-bind,v-modle,v-on都是幹什麼的(區別以及詳細用法)?自制動圖,一看就懂 ∠( °ω°)／前端知識

從Django後端(python)角度學習前端VUE框架基礎-上(配置環境,綁定屬性,條件判斷,循環) ∠( °ω°)／前端知識

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結