python爬蟲--------scrapy學習筆記（二）

CrawlSpider爬蟲的講解+一個CrawlSpider爬蟲小案例

文章目錄

CrawlSpider爬蟲的講解+一個CrawlSpider爬蟲小案例

3.rules屬性

1.CrwalSpider的介紹

crawlspider是Spider的派生類(一個子類)，Spider類的設計原則是隻爬取start_url列表中的網頁，而CrawlSpider類定義了一些規則(rule)來提供跟進link的方便的機制，從爬取的網頁中獲取link並繼續爬取的工作更適合。

2.CrwalSpider的創建

創建項目

創建項目和scrapy項目的創建過程一樣
scrapy startproject + 項目名稱

這裏我創建了一個名叫WXapp的項目：

創建CrawlSpider爬蟲

這裏和scrapy模塊的創建不一樣，這裏多了‘-t crawl’.
scrapy genspider -t crawl 項目名稱 + 域(先進入到項目目錄下再創建)

我們用pycharm打開剛剛創建的爬蟲：
這是自動生成的代碼，顯而易見，這個爬蟲繼承的類是CrawlSpider,其中的rules屬性是這個爬蟲的核心。

下面來介紹一下rules屬性：

3.rules屬性

LinkExtractor(…)，用於提取response中的鏈接
callback=‘str’，回調函數，對提取的鏈接使用，用於提取數據填充item
cb_kwargs，傳遞給回調函數的參數字典
follow=True/False，對提取的鏈接是否需要跟進
process_links，一個過濾鏈接的函數
process_request，一個過濾鏈接Request的函數上面的參數除了LinkExtractor外其它都是可選的，且當callback參數爲None時，我們稱這個rule爲一個‘跳板’，也就是隻下載頁面，並不進行任何行爲，通常作翻頁功能

LinkExtractor參數

LinkExtractor有十個參數，分別是

allow=‘re_str’:正則表達式字符串，提取response中符合re表達式的鏈接。
deny=‘re_str’：排除正則表達式匹配的鏈接
restrict_xpaths=‘xpath_str’：提取滿足xpath表達式的鏈接
restrict_css=‘css_str’:提取滿足css表達式的鏈接
allow_domains=‘domain_str’:允許的域名
deny_domains=‘domain_str’：排除的域名
tags=‘tag’/[‘tag1’,’tag2’,…]：提取指定標籤下的鏈接，默認會從a和area標籤下提取鏈接
attrs=[‘href’,’src’,…]：提取滿足屬性的鏈接
unique=True/False：鏈接是否去重
10.process_value：值處理函數，優先級要大於allow
以上的參數可以一起使用，以提取同時滿足條件的鏈接

follow參數

follow：指定根據該規則從response中提取的鏈接是否需要跟進。可以是:True或False.

callback 參數

callback：callback爲回調函數，用於LinkExtractor參數在response中提取的鏈接使用

rules是怎麼工作的？

rules是這個CrawlSpider的核心，如果在這個rules中寫了有callback參數，比如CrawlSpider=‘parse’,那麼會對Rule提取的鏈接會自動調用parse函數，並返回response,如果follow參數爲True的話,就會更進，對提取的鏈接，對提取所滿足條件的鏈接繼續執行parse解析，

4.CrawlSpider爬蟲案例

這裏我們爬取小程序社區
我們獲取裏面每一篇文章的信息

代碼奉上：（WXapp_spider.py文件）

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_demo.WXapp.WXapp.items import WxappItem

class WxappSpiderSpider(CrawlSpider):
    name = 'WXapp_spider'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (
        Rule(LinkExtractor(allow=r'mod=list&catid=2&page=\d'),follow=True),
        Rule(LinkExtractor(allow=r'article-\d*-\d*.html'),callback='parse_item',follow=True)
    )

    def parse_item(self, response):
        item = {}
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        article_name = response.xpath("//h1[@class='ph']/text()").get()
        author_name = response.xpath('//p[@class="authors"]/a/text()').get()
        article_time = response.xpath('//p[@class="authors"]//span/text()').get()
        introduce = response.xpath('//td[@id="article_content"]//text()').getall()
        article_introduce = "".join(introduce)
        item = WxappItem(article_name=article_name,author_name=author_name,article_time=article_time,article_introduce=article_introduce)
        yield  item

另外，我還修改了這個項目的另外兩個文件，

修改setting.py文件（讓它取消註釋，否則item在傳數據時，不會保存json）

6.小結

需要使用LinkExtractor和Rule，這兩個東西決定爬蟲的具體走向。
1.allow設置規則的方法：要能夠限制在我們想要的ur1上面。不要跟其他的ur1產生相同的正則表達式即可。
2.什麼情況下使用follow：如果在爬取頁面的時候，需要將滿足當前條件的url再進行跟進，那麼就設置爲True。否則設置爲Fasle。
3.什麼情況下該指定callback：如果這個url對應的頁面，只是爲了獲取更多的url，並不需要裏面的數據，那麼可以不指定callback.如果想要獲取url對應頁面中的數據，那麼就需要指定一個callback。

參考：https://blog.csdn.net/killeri/article/details/80255500

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲--------scrapy學習筆記（二）

CrawlSpider爬蟲的講解+一個CrawlSpider爬蟲小案例

文章目錄

1.CrwalSpider的介紹

2.CrwalSpider的創建

創建項目

創建CrawlSpider爬蟲

3.rules屬性

LinkExtractor參數

follow參數

callback 參數

rules是怎麼工作的？

4.CrawlSpider爬蟲案例

6.小結

Intellj IDLE 構造異常（try/catch）的快捷鍵

Maltab中有關的函數知識，你都知道嗎？

Java StringBuffer類和 StringBuild類

深入理解HashMap的實現原理

用python特殊方法實現的重載操作

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結