爬蟲步驟

第一步，安裝scrapy，執行一下命令

pip install Scrapy

第二步，創建項目，執行一下命令

scrapy startproject novel

第三步，編寫spider文件，文件存放位置novel/spiders/toscrape-xpath.py，內容如下

# -*- coding: utf-8 -*-
import scrapy


class ToScrapeSpiderXPath(scrapy.Spider):
    # 爬蟲的名字
    name = 'novel'
    # 爬蟲啓始url
    start_urls = [
        'https://www.xbiquge6.com/0_638/1124120.html',
    ]

    def parse(self, response):
        # 定義存儲的數據格式
        yield {
            'text': response.xpath('//div[@class="bookname"]/h1[1]/text()').extract_first(),
            'content': response.xpath('//div[@id="content"]/text()').extract(),
            # 'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
            # 'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract()
        }
        # 下一章的鏈接
        next_page_url = response.xpath('//div[@class="bottem1"]/a[3]/@href').extract_first()
        # 如果下一章的鏈接不等於首頁 則爬取url內容  ps：最後一章的下一章鏈接爲首頁
        if next_page_url != 'https://www.xbiquge6.com/0_638/':
            yield scrapy.Request(response.urljoin(next_page_url))

總結

框架用時：23分，比requests快三倍！awesmome！xpath也蠻好用的，繼續學習，歡迎交流。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python3 爬蟲 scrapy框架爬取小說網站數據

爬蟲步驟

總結

工作中用到的腳本合集

24-5-18 X

ssh 別名登錄

Swift_RfcComplianceException: Address in RFC 2822, 3.6.2

jq 刪除綁定的事件

js 避免污染全局變量

datatable 可跳轉分頁樣式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python3 爬蟲 scrapy框架 爬取小說網站數據

爬蟲步驟

總結

Python3 爬蟲 scrapy框架爬取小說網站數據