初識Scrapy-實戰（一）

原創

Bgods

2018-09-04 09:11

接觸爬蟲也有一段時間了，起初都是使用request庫爬取數據，並沒有使用過什麼爬蟲框架。之前僅僅是好奇，這兩天看了一下scrapy文檔，也試着去爬了一些數據，發現還真是好用。

以下以爬易車網的銷售指數爲例。具體過程就不多說了；

需要的字段：

時間（年月）；
銷售量；
類別（包括小型、微型、中型、緊湊型、中大型、SUV、MPV、品牌、廠商）；
車型。

分析網站

分析URL http://index.bitauto.com/xiaoliang/jincouxingche/2016m3/2/，我們發現url中的2016是年份，3是月份，2是頁碼，jincouxingche是類別；
只需要改變URL中的年份、月份、頁碼、類別，就可以請求到不同的數據。

————————————————————–分割線部分—————————————————————

注意：再深入我們發現數據也是動態加載的，返回的是aspx頁面。而上面請求的是靜態頁面，返回html。

————————————————————–分割線—————————————————————

編寫spiders

# -*- coding:utf-8 -*-

import scrapy
from scrapy.http import Request
import re

# create UrlList
url_list = []
Type = ['changshang','pinpai','jincouxingche','xiaoxingche','weixingche','zhongxingche','zhongdaxingche','suv','mpv']
for t in Type:
    for year in range(2009,2016):
        for m in range(1,13):

            url = 'http://index.bitauto.com/xiaoliang/'+t+'/'+str(year)+'m'+str(m)+'/'

            url_list.append(url)



class YicheItem(scrapy.Item):
    # define the fields for your item here like:
    Date = scrapy.Field()
    CarName = scrapy.Field()
    Type = scrapy.Field()
    SalesNum = scrapy.Field()

class YicheSpider(scrapy.spiders.Spider):
    name = "yiche"
    allowed_domains = ["index.bitauto.com"]

    start_urls = url_list

    def parse(self, response):

        # 獲取第一個頁面的數據
        s = response.url
        t,year,m = re.findall('xiaoliang/(.*?)/(\d+)m(\d+)',s,re.S)[0]

        for sel in response.xpath('//ol/li'):

            Name = sel.xpath('a/text()').extract()[0]
            SalesNum = sel.xpath('span/text()').extract()[0]
            #print Name,SalesNum
            items = YicheItem()
            items['Date'] = str(year)+'/'+str(m)
            items['CarName'] = Name
            items['Type'] = t
            items['SalesNum'] = SalesNum
            yield items


        # 判斷是否還有下一頁，如果沒有跳過，有則爬取下一個頁面
        if len(response.xpath('//div[@class="the_pages"]/@class').extract())==0:
            pass
        else:
            next_pageclass = response.xpath('//div[@class="the_pages"]/div/span[@class="next_off"]/@class').extract()
            next_page = response.xpath('//div[@class="the_pages"]/div/span[@class="next_off"]/text()').extract()

            if len(next_page)!=0 and len(next_pageclass)!=0:
                pass
            else:
                next_url = 'http://index.bitauto.com'+response.xpath('//div[@class="the_pages"]/div/a/@href')[-1].extract()
                yield Request(next_url, callback=self.parse)

保存到數據庫

在mysql數據庫中新建用於保存數據的表格；
修改settings.py文件：

ITEM_PIPELINES = {
    'yiche.pipelines.YichePipeline': 300,
}

修改文件pipeline.py，代碼如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import MySQLdb
import MySQLdb.cursors
import logging
from twisted.enterprise import adbapi

class YichePipeline(object):
    def __init__(self):
        self.dbpool = adbapi.ConnectionPool(
                dbapiName ='MySQLdb',#數據庫類型，我這裏是mysql
                host ='127.0.0.1',#IP地址，這裏是本地
                db = 'scrapy',#數據庫名稱
                user = 'root',#用戶名
                passwd = 'root',#密碼
                cursorclass = MySQLdb.cursors.DictCursor,
                charset = 'utf8',#使用編碼類型
                use_unicode = False
        )

    # pipeline dafault function
    def process_item(self, item, spider):
        query = self.dbpool.runInteraction(self._conditional_insert, item)
        logging.debug(query)
        return item


    # 插入數據到數據庫
    def _conditional_insert(self, tx, item):
        parms = (item['Date'],item['CarName'],item['Type'],item['SalesNum'])
        sql = "insert into yiche (Date,CarName,Type,SalesNum) values('%s','%s','%s','%s') " % parms
        #logging.debug(sql)
        tx.execute(sql)

運行程序

命令行執行：

scrapy crawl yiche

查看數據：

ok，數據有大概3W+條記錄。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

初識Scrapy-實戰（一）

分析網站

編寫spiders

保存到數據庫

運行程序

sm4加密工具類

個人Django博客項目

ggplot2學習筆記之構建圖層

JDK環境變量配置（linux）

ggplot2學習筆記之qplot

python獲取新浪新聞

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結