利用CrawlSpider類改寫成套招標數據爬取

創建一個新的項目

scrapy startproject BidsSpider

新建一個利用crawlSpider的爬蟲基類

scrapy genspider -t crawl publicBids hubeibidding.com

重點：scrapy genspider (-t template 括號內可省略，省略的話默認使用basic模板) name domain創建spider

publicBids.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from BidsSpider.items import BidsspiderItem

class PublicbidsSpider(CrawlSpider):
    name = 'publicBids'
    allowed_domains = ['hubeibidding.com']
    start_urls = ['http://www.hubeibidding.com/plus/list.php?tid=4&TotalResult=10353&PageNo=1']

    rules = (
        Rule(LinkExtractor(allow='PageNo=\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        
        for each in response.xpath("//ul[@class='e2']/li"):
            item = BidsspiderItem()
            item['bidsType'] = each.xpath("./div/b/a/text()").extract()[0]
            item['bidsName'] = each.xpath("./div/a/text()").extract()[0]
            item['bidsLink'] = "http://www.hubeibidding.com/" + each.xpath("./div/a/@href").extract()[0]
            item['bidsTime'] = each.xpath("./span/text()").extract()[1]

            yield item

重點：
1）. CrawlSpider：爬取一般網站常用的spider。其定義了一些規則(rule)來提供跟進link的方便的機制

2） LinkExtractor：提取鏈接
– allow：接收一個正則表達式或一個正則表達式列表，提取絕對url與正則表達式匹配的鏈接。如果該參數爲空（默認），就提取全部鏈接。

3）Rule：定義抽取鏈接的規則
– link_extractor: 是一個 Link Extractor 對象。其定義瞭如何從爬取到的頁面提取鏈接。
– 是一個callable或string。從link_extractor中每獲取到鏈接時將會調用該函數。
– follow 是一個布爾(boolean)值，指定了根據該規則從response提取的鏈接是否需要跟進

其他文件參照第二課內容，結果如下：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy學習第三課

python爬蟲框架scrapy學習第三課

利用CrawlSpider類改寫成套招標數據爬取

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

scrapy學習第一課

PHP學習練手（十）

PHP學習練手（九）

spring錯誤及解決方法總結

hibernate之Validator使用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結