利用CrawlSpider类改写成套招标数据爬取

创建一个新的项目

scrapy startproject BidsSpider

新建一个利用crawlSpider的爬虫基类

scrapy genspider -t crawl publicBids hubeibidding.com

重点：scrapy genspider (-t template 括号内可省略，省略的话默认使用basic模板) name domain创建spider

publicBids.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from BidsSpider.items import BidsspiderItem

class PublicbidsSpider(CrawlSpider):
    name = 'publicBids'
    allowed_domains = ['hubeibidding.com']
    start_urls = ['http://www.hubeibidding.com/plus/list.php?tid=4&TotalResult=10353&PageNo=1']

    rules = (
        Rule(LinkExtractor(allow='PageNo=\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        
        for each in response.xpath("//ul[@class='e2']/li"):
            item = BidsspiderItem()
            item['bidsType'] = each.xpath("./div/b/a/text()").extract()[0]
            item['bidsName'] = each.xpath("./div/a/text()").extract()[0]
            item['bidsLink'] = "http://www.hubeibidding.com/" + each.xpath("./div/a/@href").extract()[0]
            item['bidsTime'] = each.xpath("./span/text()").extract()[1]

            yield item

重点：
1）. CrawlSpider：爬取一般网站常用的spider。其定义了一些规则(rule)来提供跟进link的方便的机制

2） LinkExtractor：提取链接
– allow：接收一个正则表达式或一个正则表达式列表，提取绝对url与正则表达式匹配的链接。如果该参数为空（默认），就提取全部链接。

3）Rule：定义抽取链接的规则
– link_extractor: 是一个 Link Extractor 对象。其定义了如何从爬取到的页面提取链接。
– 是一个callable或string。从link_extractor中每获取到链接时将会调用该函数。
– follow 是一个布尔(boolean)值，指定了根据该规则从response提取的链接是否需要跟进

其他文件参照第二课内容，结果如下：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy学习第三课

python爬虫框架scrapy学习第三课

利用CrawlSpider类改写成套招标数据爬取

【SQL进阶】CASE语句的使用

npm error Cannot read properties of null (reading 'isDescendantOf')

scrapy學習第一課

PHP學習練手（十）

PHP學習練手（九）

spring錯誤及解決方法總結

hibernate之Validator使用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結