Scrapy中的crawlspider

原創

砍箱子

2020-02-22 20:17

crawlspider

能自動的獲取url並提交請求
命令:scrapy genspider -t crawl spidername 'example.cn'

所導入的模塊

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

繼承CrawlSpider

LInkEctractor(allow=r'Items/') : 通過正則表達式提取url鏈接
url不完整時crawlspider會自動補充
callback='parse_item':回調函數(可不寫)
follow=True: 是否繼續從響應內容裏提取url鏈接
可添加多個Rule

class PspiderSpider(CrawlSpider):
    name = 'spidername'
    allowed_domains = ['']
    start_urls = ['']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

還可以自定義函數對數據進行處理
不能定義parse函數
也可以yiled傳遞數據
可以通過正則表達式提取內容
可以xpath提取內容

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        # import re
        #item['description'] = re.findall('', response.body.decode())[0]
        return item

補充內容:
- LinkExtractor更多常見參數:
  - allow:滿足括號中“正則表達式”的URL會被提取，如果爲空，則全部匹配。
  - deny:滿足括號中“正則表達式”的URL-定不提取(優先級高於allow)。
  - allow_ domains:會被提取的鏈接的domains.
  - deny_ domains:-定不會被提取鏈接的domains.
  - restrict_ xpaths: 使用xpath表達式，和allow共同作用過濾鏈接，xpath滿足範圍內的url地址會被提取
- spiders . Rule常見參數:
  - link_ extractor: 是一個Link Extractor對象，用於定義需要提取的鏈接。
  - callback:從link extractor中每獲取到鏈接時，參數所指定的值作爲回調函數
  - follow:是一個布爾(boolean)值,指定了根據該規則從response提取的鏈接是否需要跟進。如果callback爲None, fllw 默認設置爲True，否則默認爲False。
  - process_ links:指定該spider中哪個的函數將會被調用, link_ extractor中獲取到鏈接列表時將會調用該函數，該方法主要用來過濾url。
  - process_ request: 指定該spider中哪個的函數將會被調用，該規則提取到每個request時都會調用該函數，用來過濾request.

砍箱子

發佈了33 篇原創文章 · 獲贊 3 · 訪問量 3001

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy中的crawlspider

crawlspider

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

關於linux

命令執行漏洞

thinkphp5.0.x

web安全基礎

scrapy爬蟲實例(1)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結