CrawlSpider爬取自如网

首先做一下页面分析:
在这里插入图片描述
再看一下我的文件结构,因为是使用crwalspider,所以的话,我们先写每一个页面的详情页面链接的提取器,可以边写边调试,验证代码是否存在bug,便于及时修改。
在这里插入图片描述
写好了详情页的链接提取规则,然后查看response是否为空,如果没问题,继续写提取详情页内容的代码,打印看一下item是否提取正确。最后再写获取下一页的提取规则。spider文件如下:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ziroom.items import ZiroomItem
from copy import deepcopy


class ZiruSpider(CrawlSpider):
    name = 'ziru'
    allowed_domains = ['wh.ziroom.com']
    start_urls = ['http://wh.ziroom.com/z/d23008780/?isOpen=1']

    rules = (
        # 获取详情页
        Rule(LinkExtractor(restrict_xpaths='//a[@class="pic-wrap"]'), callback='parse_item'),
        # 翻页功能
        Rule(LinkExtractor(restrict_xpaths='div[@id="page"]/a'), follow=True),
    )

    def parse_item(self, response):
        item = ZiroomItem()
        # 获取标题
        item['title'] = response.xpath('//h1[@class="Z_name"]/text()').extract_first()
        item['title'] = item['title'].split('·')[1:]
        # 获取面积大小
        item['size'] = response.xpath('//div[@class="Z_home_b clearfix"]/dl[1]/dt/text()').extract_first()
        item['size'] += response.xpath('//div[@class="Z_home_b clearfix"]/dl[1]/dd/text()').extract_first()
        # 获取朝向
        item['face'] = response.xpath('//div[@class="Z_home_b clearfix"]/dl[2]/dd/text()').extract_first()
        # 获取户型
        item['house_type'] = response.xpath('//div[@class="Z_home_b clearfix"]/dl[3]/dd/text()').extract_first()
        # 获取位置
        item['location'] = response.xpath('//ul[@class="Z_home_o"]/li[1]/span[@class="va"]/span/text()').extract_first()
        # 获取介绍
        item['introduce'] = response.xpath('//div[@class="Z_rent_desc"]/text()').extract_first()
        item['introduce'] = item['introduce'].strip()
        # 获取图片下载链接 http://img.ziroom.com/pic/house_images/g2m3/M00/CF/2C/ChAZVF308mqAHTJEABghguCODIo999.jpg_C_380_285_Q80.jpg
        imgs = response.xpath('//ul[@class="Z_swiper_thumb_inner Z_sliders_nav"]/li/img/@src').extract()
        item['image_urls'] = ['https:' + img for img in imgs]

        yield deepcopy(item)

因为是使用imgepipeline来下载文件,所以的话需要在items文件中写好images、image_urls字段,items文件如下:

import scrapy


class ZiroomItem(scrapy.Item):
    # define the fields for your item here like:
    image_urls = scrapy.Field()
    images = scrapy.Field()

    title = scrapy.Field()
    size = scrapy.Field()
    face = scrapy.Field()
    house_type = scrapy.Field()
    location = scrapy.Field()
    introduce = scrapy.Field()

数据都没有问题之后可以直接开启imagepipeline获取图片了,但是我想把描述信息作为文件夹名称,然后讲图片存在文件夹中,所以需要重写图片的保存路劲,其中需要继承ImagePipeline。因为file_path需要获取item作为文件夹的名字, 所以重写get_medie_reuqests,将item使用Request进行传递。因为只修改保存的文件夹,所以file_path 方法的话可以直接从源码中复制过来,然后在返回值的中间加一个fodler即可

from scrapy.pipelines.images import ImagesPipeline
import scrapy
from scrapy.utils.python import to_bytes
import hashlib


class ZiroomPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:

            yield scrapy.Request(
                image_url, 
                meta={'item': item}
            )

    def file_path(self, request, response=None, info=None):
        item = request.meta.get('item')
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        # 获取文件名
        folder = ''
        for title in item['title']:
            folder += title
        folder += item['face']
        folder += item['size']
        folder += item['location']
        folder += item['introduce']
		
        return 'full/%s/%s.jpg' % (folder, image_guid)

现在settings配置一下就可以开启爬虫了:

# 设置请求头
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
}


# 开启pipeline
ITEM_PIPELINES = {
   'ziroom.pipelines.ZiroomPipeline': 300,
}

import os
# 图片的保存路径
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')

开启爬虫的话除了cmd的方法,还可以自己写一个start文件:

from scrapy import cmdline
# 两种写法
cmdline.execute(['scrapy', 'crawl', 'ziru'])
cmdline.execute('scrapy crawl ziru'.split(' '))

最终保存结果如下:
在这里插入图片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章