Scrapy的CrawlSpider用法

官方文檔 https://docs.scrapy.org/en/latest/topics/spiders.html#crawlspider

CrawlSpider定義了一組用以提取鏈接的規則，可以大大簡化爬蟲的寫法。

rules是一組Rule對象。每條Rule定義了抓取網頁的方式。如果多條規則匹配到同一鏈接，根據定義規則的順序，使用第一個鏈接。

parse_start_url(response)用來處理start_urls的響應，返回的結果必須是Item對象，或Request對象，或者是二者的可迭代對象。

爬取規則Rule的用法

scrapy.spiders.Rule(link_extractor, 
                    callback=None, 
                    cb_kwargs=None, 
                    follow=None, 
                    process_links=None, 
                    process_request=None)

link_extractor是鏈接抽取對象，它定義瞭如何抽取鏈接； callback是調回函數，注意不要使用parse做調回函數； cb_kwargs是一個字典，可以將關鍵字參數傳給調回函數； follow是一個布爾值，指定要不要抓取鏈接。如果callback是None，則follow默認是True，否則默認爲False； process_links可以對link_extractor提取出來的鏈接做處理，主要用於過濾； process_request是一個可調用函數，會處理這條Rule提取出來的每個請求，會返回request或None。

鏈接抽取link_extractor的用法

from scrapy.linkextractors import LinkExtractor

因爲用法和LxmlLinkExtractor相同，官網使用後者說明，LxmlLinkExtractor是基於lxml的HTMLParser實現的：

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(), 
                                                       deny=(), 
                                                       allow_domains=(), 
                                                       deny_domains=(), 
                                                       deny_extensions=None, 
                                                       restrict_xpaths=(), 
                                                       restrict_css=(), 
                                                       tags=('a', 'area'), 
                                                       attrs=('href', ), 
                                                       canonicalize=False, 
                                                       unique=True, 
                                                       process_value=None, 
                                                       strip=True)

allow：（一個或一個列表）出鏈必須要匹配的正則表達式。如果allow爲空，則匹配所有鏈接；

deny：（一個或一個列表）出鏈必須要匹配的正則表達式，以做排除。優先於allow。如果爲空，則不排除任何鏈接；

allow_domains：（一個或一個列表）提取鏈接的域名；

deny_domains：（一個或一個列表）不提取鏈接的域名；

deny_extensions：（一個或一個列表）要忽略的後綴，如果爲空，則爲包scrapy.linkextractors中的列表IGNORED_EXTENSIONS，如下所示：

IGNORED_EXTENSIONS = [
    # 圖片
    'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
    'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',

    # 音頻
    'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',

    # 視頻
    '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
    'm4a', 'm4v', 'flv',

    # 辦公軟件
    'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg',
    'odp',

    # 其它
    'css', 'pdf', 'exe', 'bin', 'rss', 'zip', 'rar',
]

restrict_xpaths：（一個或一個列表）xpath，定義了從響應文本的哪部分提取鏈接；

restrict_css：（一個或一個列表）css，定義了從響應文本的哪部分提取鏈接；

tags：（一個或一個列表）用以抽取鏈接的標籤，默認是('a', 'area')；

attrs：（一個或一個列表）屬性，定義了從響應文本的哪部分提取鏈接，默認是('href',)；

canonicalize：（布爾值）建議設爲False；

unique：（布爾值）是否過濾重複鏈接；

process_value：（可調用對象）可以對標籤和屬性掃描結果做修改，下面是官網給的例子；

# 一個要提取的鏈接
<a href="javascript:goToPage('../other/page.html'); return false">Link text</a>

# 要提取的是 “../other/page.html”
def process_value(value):
    m = re.search("javascript:goToPage\('(.*?)'", value)
    if m:
        return m.group(1)

strip：（布爾值）默認開啓。

官網給的CrawlSpider的例子：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # 提取匹配 'category.php' 的鏈接 （不匹配 'subsection.php'）
        # 沒有設置callback，則默認follow=True，繼續抓取符合該條規則的所有鏈接
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # 提取匹配 'item.php' 的鏈接，用parse_item方法做解析
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

感覺還是xpath更好用，還是用麥田租房舉例子：http://bj.maitian.cn/zfall/PG1

這樣寫規則就行了

rules = (    
    Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"down_page")]')),
    Rule(LinkExtractor(restrict_xpaths='//div[@class="list_title"]/h1/'), callback='parse_item')
)

Scrapy的CrawlSpider用法

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

Shell/Python中的用戶名獲取

《Scikit-Learn與TensorFlow機器學習實用指南》第16章強化學習（下）

《Scikit-Learn與TensorFlow機器學習實用指南》第16章強化學習（上）

《Scikit-Learn與TensorFlow機器學習實用指南》第15章自編碼器

《Scikit-Learn與TensorFlow機器學習實用指南》第14章循環神經網絡

Scrapy的CrawlSpider用法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結