CrawlSpider爬取自如網

原創

2020-06-16 05:54

首先做一下頁面分析：

再看一下我的文件結構，因爲是使用crwalspider，所以的話，我們先寫每一個頁面的詳情頁面鏈接的提取器，可以邊寫邊調試，驗證代碼是否存在bug，便於及時修改。

寫好了詳情頁的鏈接提取規則，然後查看response是否爲空，如果沒問題，繼續寫提取詳情頁內容的代碼，打印看一下item是否提取正確。最後再寫獲取下一頁的提取規則。spider文件如下：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ziroom.items import ZiroomItem
from copy import deepcopy


class ZiruSpider(CrawlSpider):
    name = 'ziru'
    allowed_domains = ['wh.ziroom.com']
    start_urls = ['http://wh.ziroom.com/z/d23008780/?isOpen=1']

    rules = (
        # 獲取詳情頁
        Rule(LinkExtractor(restrict_xpaths='//a[@class="pic-wrap"]'), callback='parse_item'),
        # 翻頁功能
        Rule(LinkExtractor(restrict_xpaths='div[@id="page"]/a'), follow=True),
    )

    def parse_item(self, response):
        item = ZiroomItem()
        # 獲取標題
        item['title'] = response.xpath('//h1[@class="Z_name"]/text()').extract_first()
        item['title'] = item['title'].split('·')[1:]
        # 獲取面積大小
        item['size'] = response.xpath('//div[@class="Z_home_b clearfix"]/dl[1]/dt/text()').extract_first()
        item['size'] += response.xpath('//div[@class="Z_home_b clearfix"]/dl[1]/dd/text()').extract_first()
        # 獲取朝向
        item['face'] = response.xpath('//div[@class="Z_home_b clearfix"]/dl[2]/dd/text()').extract_first()
        # 獲取戶型
        item['house_type'] = response.xpath('//div[@class="Z_home_b clearfix"]/dl[3]/dd/text()').extract_first()
        # 獲取位置
        item['location'] = response.xpath('//ul[@class="Z_home_o"]/li[1]/span[@class="va"]/span/text()').extract_first()
        # 獲取介紹
        item['introduce'] = response.xpath('//div[@class="Z_rent_desc"]/text()').extract_first()
        item['introduce'] = item['introduce'].strip()
        # 獲取圖片下載鏈接 http://img.ziroom.com/pic/house_images/g2m3/M00/CF/2C/ChAZVF308mqAHTJEABghguCODIo999.jpg_C_380_285_Q80.jpg
        imgs = response.xpath('//ul[@class="Z_swiper_thumb_inner Z_sliders_nav"]/li/img/@src').extract()
        item['image_urls'] = ['https:' + img for img in imgs]

        yield deepcopy(item)

因爲是使用imgepipeline來下載文件，所以的話需要在items文件中寫好images、image_urls字段，items文件如下：

import scrapy


class ZiroomItem(scrapy.Item):
    # define the fields for your item here like:
    image_urls = scrapy.Field()
    images = scrapy.Field()

    title = scrapy.Field()
    size = scrapy.Field()
    face = scrapy.Field()
    house_type = scrapy.Field()
    location = scrapy.Field()
    introduce = scrapy.Field()

數據都沒有問題之後可以直接開啓imagepipeline獲取圖片了，但是我想把描述信息作爲文件夾名稱，然後講圖片存在文件夾中，所以需要重寫圖片的保存路勁，其中需要繼承ImagePipeline。因爲file_path需要獲取item作爲文件夾的名字, 所以重寫get_medie_reuqests，將item使用Request進行傳遞。因爲只修改保存的文件夾，所以file_path 方法的話可以直接從源碼中複製過來，然後在返回值的中間加一個fodler即可

from scrapy.pipelines.images import ImagesPipeline
import scrapy
from scrapy.utils.python import to_bytes
import hashlib


class ZiroomPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:

            yield scrapy.Request(
                image_url, 
                meta={'item': item}
            )

    def file_path(self, request, response=None, info=None):
        item = request.meta.get('item')
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        # 獲取文件名
        folder = ''
        for title in item['title']:
            folder += title
        folder += item['face']
        folder += item['size']
        folder += item['location']
        folder += item['introduce']
		
        return 'full/%s/%s.jpg' % (folder, image_guid)

現在settings配置一下就可以開啓爬蟲了：

# 設置請求頭
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
}


# 開啓pipeline
ITEM_PIPELINES = {
   'ziroom.pipelines.ZiroomPipeline': 300,
}

import os
# 圖片的保存路徑
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')

開啓爬蟲的話除了cmd的方法，還可以自己寫一個start文件：

from scrapy import cmdline
# 兩種寫法
cmdline.execute(['scrapy', 'crawl', 'ziru'])
cmdline.execute('scrapy crawl ziru'.split(' '))

最終保存結果如下：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

CrawlSpider爬取自如網

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

MongoDB增刪改查的使用

使用Tesseract識別圖片，獲取自如房子價格

爬取摩拜單車的車輛定位信息

MySQL、SQLyog的使用

CrawlSpider爬取自如網

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結