scrapy爬蟲如何解決圖書大分類與小分類之間的匹配問題

follwing-sibling

following-sibling 選取當前節點之後的所有同級節點,跟preceding-sibling一樣都是選取同級同父的節點，只不過following是取對應節點之後的節點，preceding-sibling取的是該節點之前的節點。

1. 遍歷京東圖書的僞代碼：

# 獲取所有大分類標籤dt
        dt_list = '//*[@id="booksort"]/div[2]/dl/dt'
        
        #遍歷52個大分類
        for dt in dt_list:
            category = './a/text()'
            # 根據大分類找小分類
            em_list = './following-sibling::* [1]/em'
            
            for em in em_list:
                small_category = '.a/text()'
                # 注意點： 小分類的鏈接需要拼接 加 http
                small_link = 'http' + './a/@href'

2.scrapy數據數據解析：

 response.xpah().extract()
 response.xpah().extract_first() 取一個數據

檢驗爬取大分類下對應的小分類：

class BookSpider(scrapy.Spider):
    name = 'book'
    allowed_domains = ['jd.com']
    # 第一層 爬取所有圖書--首頁
    start_urls = ['https://book.jd.com/booksort.html']

    def parse(self, response):
        # 獲取所有大分類標籤dt
        dt_list = response.xpath('//*[@id="booksort"]/div[2]/dl/dt[1]')

        # 遍歷52個大分類
        for dt in dt_list:
            # 完整的 圖書數據包括 ： 大分類 小分類 名字 作者 出版社 價格
            # item{} 設置一個容器 用來存儲解析來的數據 方便與以後添加進數據庫
            item = {}
            item['category'] = dt.xpath('./a/text()').extract_first()

            # 根據大分類找小分類
            em_list = dt.xpath('./following-sibling::* [1]/em')

            for em in em_list:
                item['small_category'] = em.xpath('./a/text()').extract_first()
                # 注意點： 小分類的鏈接需要拼接 加 http
                small_link = 'http' + em.xpath('./a/@href').extract_first()
                print(item)

4. 開始遍歷60本書

 # 遍歷解析 60本書的詳細信息
	 list_book = '//*[@id="plist"]/ul/li'
        for book in list_book :
            # 書名
            name = './/div[@class="p-name"]/a/em/text()'
            
            # 作者
            author = './/span[@class="p-bi-name"]/span/a/text()'
            
            # 出版社
            store = './/span[@class="p-bi-store"]/a/text()' 
                     
            # 價格
            price = './/div/strong[@class="J_price"]/i/text()'
            
            # 圖片地址
            default_image = "https:"+'.//div/div[@class="p-img"]/a/img/@src'

代碼：

    def parse_book(self,response):
        item = response.meta.get('book')
        # 解析所有書的數據 --60本
        list_book = response.xpath('//*[@id="plist"]/ul/li/div')

        # 遍歷解析 60本書的詳細信息
        for book in list_book[:2]:
            # 書名
            item['name'] = book.xpath('.//div[@class="p-name"]/a/em/text()').extract_first()

            # 作者
            item['author'] = book.xpath('.//span[@class="p-bi-name"]/span/a/text()').extract_first()

            # 出版社
            item['store'] = book.xpath('.//span[@class="p-bi-store"]/a/text()').extract_first()

            # 價格
            item['price'] = book.xpath('.//strong[@class="J_price"]/i/text()').extract_first()

            # 圖片地址
            item['default_image'] = book.xpath('.//div/div[@class="p-img"]/a/img/@src').extract_first()

            print(item)

結果：

5在控制檯輸入: scrapy crawl book -o data.json
之後會生成data.json 文件，看不懂就粘貼複製到：https://www.json.cn/

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

scrapy爬蟲如何解決圖書大分類與小分類之間的匹配問題

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

IDEA創建Web項目以及熱部署問題

一般scrapy的安裝

Java 刪除字符串指定字符（包含刪除多個字符）

刷超星學習通，智慧樹，知乎，，，等等大學生網課，親測！！！！！！

scrapy京東翻頁爬取

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結