网络爬虫--20.【Scrapy-Redis实战】分布式爬虫获取房天下--代码实现

一. 案例介绍

爬取房天下(https://www1.fang.com/)的网页信息。

源代码已更新至:Github

在这里插入图片描述

二.创建项目

打开windows终端,切换至项目将要存放的目录下:

scrapy startproject fang

cd fang\

scrapy genspider sfw “fang.com”

项目目录结构如下所示:
在这里插入图片描述

三. settings.py配置

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}
DOWNLOADER_MIDDLEWARES = {
   'fang.middlewares.UserAgentDownloadMiddleware': 543,
}
ITEM_PIPELINES = {
   'fang.pipelines.FangPipeline': 300,
}

四. 详细代码

settings.py:

# -*- coding: utf-8 -*-

# Scrapy settings for fang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'fang'

SPIDER_MODULES = ['fang.spiders']
NEWSPIDER_MODULE = 'fang.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'fang (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'fang.middlewares.FangSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'fang.middlewares.UserAgentDownloadMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'fang.pipelines.FangPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class NewHouseItem(scrapy.Item):
    # 省份
    province = scrapy.Field()

    # 城市
    city = scrapy.Field()

    # 小区名字
    name = scrapy.Field()

    # 价格
    price = scrapy.Field()

    # 几居 列表
    rooms = scrapy.Field()

    # 面积
    area = scrapy.Field()

    # 地址
    address = scrapy.Field()

    # 行政区
    district = scrapy.Field()

    # 是否在售
    sale = scrapy.Field()

    # 房天下详情页面的url
    origin_url = scrapy.Field()

class ESFHouseItem(scrapy.Item):
    # 省份
    province = scrapy.Field()

    # 城市
    city = scrapy.Field()

    # 小区名字
    name = scrapy.Field()

    # 几室几厅
    rooms = scrapy.Field()

    # 层
    floor = scrapy.Field()

    # 朝向
    toward = scrapy.Field()

    # 年代
    year = scrapy.Field()

    # 地址
    address = scrapy.Field()

    # 建筑面积
    area = scrapy.Field()

    # 总价
    price = scrapy.Field()

    # 单价
    unit = scrapy.Field()

    # 原始url
    origin_url = scrapy.Field()

pipelines.py:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter

class FangPipeline(object):
    def __init__(self):
        self.newhouse_fp = open('newhouse.json','wb')
        self.esfhouse_fp = open('esfhouse.json','wb')
        self.newhouse_exporter = JsonLinesItemExporter(self.newhouse_fp,ensure_ascii=False)
        self.esfhouse_exporter = JsonLinesItemExporter(self.esfhouse_fp, ensure_ascii=False)

    def process_item(self, item, spider):
        self.newhouse_exporter.export_item(item)
        self.esfhouse_exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.newhouse_fp.close()
        self.esfhouse_fp.close()

sfw.py:

# -*- coding: utf-8 -*-
import re

import scrapy
from fang.items import NewHouseItem, ESFHouseItem


class SfwSpider(scrapy.Spider):
    name = 'sfw'
    allowed_domains = ['fang.com']
    start_urls = ['https://www.fang.com/SoufunFamily.htm']

    def parse(self, response):
        trs = response.xpath("//div[@class='outCont']//tr")
        province = None
        for tr in trs:
            tds = tr.xpath(".//td[not(@class)]")
            province_td = tds[0]
            province_text = province_td.xpath(".//text()").get()
            province_text = re.sub(r"\s","",province_text)
            if province_text:
                province = province_text
            if province == "其它":
                continue
            city_id = tds[1]
            city_links = city_id.xpath(".//a")
            for city_link in city_links:
                city = city_link.xpath(".//text()").get()
                city_url = city_link.xpath(".//@href").get()
                # print("省份:",province)
                # print("城市:", city)
                # print("城市链接:", city_url)


                #构建新房的url链接
                url_module = city_url.split("//")
                scheme = url_module[0]
                domain_all = url_module[1].split("fang")
                domain_0 = domain_all[0]
                domain_1 = domain_all[1]
                if "bj." in  domain_0:
                    newhouse_url = "https://newhouse.fang.com/house/s/"
                    esf_url = "https://esf.fang.com/"
                else:
                    newhouse_url =scheme + "//" + domain_0 + "newhouse.fang" + domain_1 + "house/s/"
                    # 构建二手房的URL链接
                    esf_url = scheme + "//" + domain_0 + "esf.fang" + domain_1
                # print("城市:%s%s"%(province, city))
                # print("新房链接:%s"%newhouse_url)
                # print("二手房链接:%s"%esf_url)

                # yield scrapy.Request(url=newhouse_url,callback=self.parse_newhouse,meta={"info":(province, city)})
                yield scrapy.Request(url=esf_url,callback=self.parse_esf,meta={"info":(province, city)},dont_filter=True)
            #     break
            # break



    def parse_newhouse(self,response):
        province,city = response.meta.get('info')
        lis = response.xpath("//div[contains(@class,'nl_con')]/ul/li")
        for li in lis:

            # 获取 项目名字
            name = li.xpath(".//div[@class='nlcd_name']/a/text()").get()
            name = li.xpath(".//div[@class='nlcd_name']/a/text()").get()
            if name == None:
                pass
            else:
                name = name.strip()
                # print(name)

            # 获取房子类型:几居
            house_type_list = li.xpath(".//div[contains(@class,'house_type')]/a/text()").getall()
            if len(house_type_list) == 0:
                pass
            else:
                house_type_list = list(map(lambda x:re.sub(r"\s","",x),house_type_list))
                rooms = list(filter(lambda x:x.endswith("居"),house_type_list))
                # print(rooms)

            # 获取房屋面积
            area = "".join(li.xpath(".//div[contains(@class,'house_type')]/text()").getall())
            area = re.sub(r"\s|/|-", "", area)
            if len(area) == 0:
                pass
            else:
                area =area
                # print(area)

            # 获取地址
            address = li.xpath(".//div[@class='address']/a/@title").get()
            if address == None:
                pass
            else:
                address = address
                # print(address)

            # 获取区划分:海淀 朝阳
            district_text = "".join(li.xpath(".//div[@class='address']/a//text()").getall())
            if len(district_text) == 0:
                pass
            else:
                district = re.search(r".*\[(.+)\].*",district_text).group(1)
                # print(district)

            # 获取是否在售
            sale = li.xpath(".//div[contains(@class,'fangyuan')]/span/text()").get()
            if sale == None:
                pass
            else:
                sale = sale
                # print(sale)

            # 获取价格
            price = li.xpath(".//div[@class='nhouse_price']//text()").getall()
            if len(price) == 0:
                pass
            else:
                price = "".join(price)
                price = re.sub(r"\s|广告","",price)
                # print(price)

            # 获取网址链接
            origin_url = li.xpath(".//div[@class='nlcd_name']/a/@href").get()
            if origin_url ==None:
                pass
            else:
                origin_url = origin_url
                # print(origin_url)

            item = NewHouseItem(name=name,rooms=rooms,area=area,address=address,district=district,sale=sale,
                                price=price,origin_url=origin_url,province=province,city=city,)
            yield item

        next_url = response.xpath(".//div[@class='page']//a[@class='next']/@href").get()
        if next_url:
            yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse_newhouse,meta={"info":(province,city)})

    def parse_esf(self, response):

        # 获取省份和城市
        province, city = response.meta.get('info')

        dls = response.xpath("//div[@class='shop_list shop_list_4']/dl")
        for dl in dls:
            item = ESFHouseItem(province=province,city=city)
            # 获取小区名字
            name = dl.xpath(".//p[@class='add_shop']/a/text()").get()
            if name == None:
                pass
            else:
                item['name'] = name.strip()
                # print(name)

            # 获取综合信息
            infos = dl.xpath(".//p[@class='tel_shop']/text()").getall()
            if len(infos) == 0:
                pass
            else:
                infos = list(map(lambda x:re.sub(r"\s","",x),infos))
                # print(infos)
                for info in infos:
                    if "厅" in info :
                        item['rooms']= info
                    elif '层' in info:
                        item['floor']= info
                    elif '向' in info:
                        item['toward']=info
                    elif '年' in info:
                        item['year']=info
                    elif '㎡' in info:
                        item['area'] = info
                    # print(item)

            # 获取地址
            address = dl.xpath(".//p[@class='add_shop']/span/text()").get()
            if address == None:
                pass
            else:
                # print(address)
                item['address'] = address

            # 获取总价
            price = dl.xpath("./dd[@class='price_right']/span[1]/b/text()").getall()
            if len(price) == 0:
                pass
            else:
                price="".join(price)
                # print(price)
                item['price'] = price


            # 获取单价
            unit = dl.xpath("./dd[@class='price_right']/span[2]/text()").get()
            if unit == None:
                pass
            else:
                # print(unit)
                item['unit'] = unit

            # 获取初始url
            detail_url = dl.xpath(".//h4[@class='clearfix']/a/@href").get()
            if detail_url == None:
                pass
            else:
                origin_url = response.urljoin(detail_url)
                # print(origin_url)
                item['origin_url'] = origin_url
            # print(item)
            yield item
        next_url = response.xpath(".//div[@class='page_al']/p/a/@href").get()
        # print(next_url)
        yield scrapy.Request(url=response.urljoin(next_url),callback=self.parse_esf,meta={"info":(province,city)})


middlewares.py:

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

import random


class UserAgentDownloadMiddleware(object):
    # user-agent随机请求头中间件
    USER_AGENTS = [
        'Mozilla/5.0 (Windows; U; Windows NT 6.1; rv:2.2) Gecko/20110201'
        'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.2.3) Gecko/20100401 Lightningquail/3.6.3'
        'Mozilla/5.0 (X11; ; Linux i686; rv:1.9.2.20) Gecko/20110805'
        'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1b3) Gecko/20090305'
        'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.14) Gecko/2009091010'
        'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.10) Gecko/2009042523'
    ]

    def process_request(self, request, spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent'] = user_agent

start.sh:

from scrapy import cmdline

cmdline.execute("scrapy crawl sfw".split())

此时在windows开发环境下运行start.sh,即可正常爬取数据。

五. 部署

1. windows环境下生成requirements.txt文件

打开cmder,首先切换至虚拟化境:

cd C:\Users\fxd.virtualenvs\sipder_env
.\Scripts\activate

在这里插入图片描述

然后切换至项目所在目录,输入指令,生成requirements.txt文件
pip freeze > requirements.txt
在这里插入图片描述

在这里插入图片描述

2. xshell连接ubuntu服务器并安装依赖环境

如果未安装openssh,需要首先安装,具体指令如下:

sudo apt-get install openssh-server

连接ubuntu服务器,切换至虚拟环境所在的目录,执行:

source ./bin/activate

进入虚拟环境,执行:

rz

上传requirements.txt,执行:

pip install -r requirements.txt

安装项目依赖环境。

然后安装scrapy-redis:

pip install scrapy-redis

3. 修改部分代码

要将一个Scrapy项目变成一个Scrapy-redis项目,只需要修改以下三点:
(1)将爬虫继承的类,从scrapy.Spider 变成scrapy_redis.spiders.RedisSpider;或者从scrapy.CrowlSpider变成scrapy_redis.spiders.RedisCrowlSpider。
(2)将爬虫中的start_urls删掉,增加一个redis_key="***"。这个key是为了以后在redis中控制爬虫启动的,爬虫的第一个url,就是在redis中通过这个推送出去的。
(3)在配置文件中增加如下配置:

# Scrapy-Redis相关配置
# 确保request存储到redis中
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 确保所有的爬虫共享相同的去重指纹
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 设置redis为item_pipeline
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline':300
}

# 在redis中保持scrapy_redis用到的队列,不会清理redis中的队列,从而可以实现暂停和回复的功能
SCHEDULER_PERSIST = True

# 设置连接redis信息
REDIS_HOST = '172.20.10.2'
REDIS_PORT = 6379

4. 上传代码至服务器并运行

将项目文件压缩,在xshell中通过命令rz上传,并解压

运行爬虫:
(1)在爬虫服务器上,进入爬虫文件sfw.py所在的路径,然后输入命令:scrapy runspider [爬虫名字]

scrapy runspider sfw.py

(2)在redis(windows)服务器上,开启redis服务:

redis-server redis.windows.conf
若报错,按步骤执行以下命令:
redis-cli.exe
shutdown
exit
redis-server.exe redis.windows.conf

(3)然后打开另外一个windows终端:

redis-cli

推入一个开始的url链接:

lpush fang:start_urls https://www.fang.com/SoufunFamily.htm

爬虫开始
在这里插入图片描述

进入RedisDesktopManager查看保存的数据:

在这里插入图片描述

另外一台爬虫服务器进行同样的操作。
项目结束!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章