CrawlSpider爬取自如网

原創

2020-06-16 05:54

首先做一下页面分析：

再看一下我的文件结构，因为是使用crwalspider，所以的话，我们先写每一个页面的详情页面链接的提取器，可以边写边调试，验证代码是否存在bug，便于及时修改。

写好了详情页的链接提取规则，然后查看response是否为空，如果没问题，继续写提取详情页内容的代码，打印看一下item是否提取正确。最后再写获取下一页的提取规则。spider文件如下：

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ziroom.items import ZiroomItem
from copy import deepcopy


class ZiruSpider(CrawlSpider):
    name = 'ziru'
    allowed_domains = ['wh.ziroom.com']
    start_urls = ['http://wh.ziroom.com/z/d23008780/?isOpen=1']

    rules = (
        # 获取详情页
        Rule(LinkExtractor(restrict_xpaths='//a[@class="pic-wrap"]'), callback='parse_item'),
        # 翻页功能
        Rule(LinkExtractor(restrict_xpaths='div[@id="page"]/a'), follow=True),
    )

    def parse_item(self, response):
        item = ZiroomItem()
        # 获取标题
        item['title'] = response.xpath('//h1[@class="Z_name"]/text()').extract_first()
        item['title'] = item['title'].split('·')[1:]
        # 获取面积大小
        item['size'] = response.xpath('//div[@class="Z_home_b clearfix"]/dl[1]/dt/text()').extract_first()
        item['size'] += response.xpath('//div[@class="Z_home_b clearfix"]/dl[1]/dd/text()').extract_first()
        # 获取朝向
        item['face'] = response.xpath('//div[@class="Z_home_b clearfix"]/dl[2]/dd/text()').extract_first()
        # 获取户型
        item['house_type'] = response.xpath('//div[@class="Z_home_b clearfix"]/dl[3]/dd/text()').extract_first()
        # 获取位置
        item['location'] = response.xpath('//ul[@class="Z_home_o"]/li[1]/span[@class="va"]/span/text()').extract_first()
        # 获取介绍
        item['introduce'] = response.xpath('//div[@class="Z_rent_desc"]/text()').extract_first()
        item['introduce'] = item['introduce'].strip()
        # 获取图片下载链接 http://img.ziroom.com/pic/house_images/g2m3/M00/CF/2C/ChAZVF308mqAHTJEABghguCODIo999.jpg_C_380_285_Q80.jpg
        imgs = response.xpath('//ul[@class="Z_swiper_thumb_inner Z_sliders_nav"]/li/img/@src').extract()
        item['image_urls'] = ['https:' + img for img in imgs]

        yield deepcopy(item)

因为是使用imgepipeline来下载文件，所以的话需要在items文件中写好images、image_urls字段，items文件如下：

import scrapy


class ZiroomItem(scrapy.Item):
    # define the fields for your item here like:
    image_urls = scrapy.Field()
    images = scrapy.Field()

    title = scrapy.Field()
    size = scrapy.Field()
    face = scrapy.Field()
    house_type = scrapy.Field()
    location = scrapy.Field()
    introduce = scrapy.Field()

数据都没有问题之后可以直接开启imagepipeline获取图片了，但是我想把描述信息作为文件夹名称，然后讲图片存在文件夹中，所以需要重写图片的保存路劲，其中需要继承ImagePipeline。因为file_path需要获取item作为文件夹的名字, 所以重写get_medie_reuqests，将item使用Request进行传递。因为只修改保存的文件夹，所以file_path 方法的话可以直接从源码中复制过来，然后在返回值的中间加一个fodler即可

from scrapy.pipelines.images import ImagesPipeline
import scrapy
from scrapy.utils.python import to_bytes
import hashlib


class ZiroomPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:

            yield scrapy.Request(
                image_url, 
                meta={'item': item}
            )

    def file_path(self, request, response=None, info=None):
        item = request.meta.get('item')
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        # 获取文件名
        folder = ''
        for title in item['title']:
            folder += title
        folder += item['face']
        folder += item['size']
        folder += item['location']
        folder += item['introduce']
		
        return 'full/%s/%s.jpg' % (folder, image_guid)

现在settings配置一下就可以开启爬虫了：

# 设置请求头
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36'
}


# 开启pipeline
ITEM_PIPELINES = {
   'ziroom.pipelines.ZiroomPipeline': 300,
}

import os
# 图片的保存路径
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'images')

开启爬虫的话除了cmd的方法，还可以自己写一个start文件：

from scrapy import cmdline
# 两种写法
cmdline.execute(['scrapy', 'crawl', 'ziru'])
cmdline.execute('scrapy crawl ziru'.split(' '))

最终保存结果如下：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

CrawlSpider爬取自如网

MongoDB增刪改查的使用

使用Tesseract識別圖片，獲取自如房子價格

爬取摩拜單車的車輛定位信息

MySQL、SQLyog的使用

CrawlSpider爬取自如網

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結