接觸爬蟲也有一段時間了,起初都是使用request庫爬取數據,並沒有使用過什麼爬蟲框架。之前僅僅是好奇,這兩天看了一下scrapy文檔,也試着去爬了一些數據,發現還真是好用。
以下以爬 易車網的銷售指數爲例。具體過程就不多說了;
需要的字段:
- 時間(年月);
- 銷售量;
- 類別(包括小型、微型、中型、緊湊型、中大型、SUV、MPV、品牌、廠商);
- 車型。
分析網站
- 分析URL http://index.bitauto.com/xiaoliang/jincouxingche/2016m3/2/,我們發現url中的2016是年份,3是月份,2是頁碼,jincouxingche是類別;
- 只需要改變URL中的年份、月份、頁碼、類別,就可以請求到不同的數據。
————————————————————–分割線部分—————————————————————
注意:再深入我們發現數據也是動態加載的,返回的是aspx頁面。而上面請求的是靜態頁面,返回html。
————————————————————–分割線—————————————————————
編寫spiders
# -*- coding:utf-8 -*-
import scrapy
from scrapy.http import Request
import re
# create UrlList
url_list = []
Type = ['changshang','pinpai','jincouxingche','xiaoxingche','weixingche','zhongxingche','zhongdaxingche','suv','mpv']
for t in Type:
for year in range(2009,2016):
for m in range(1,13):
url = 'http://index.bitauto.com/xiaoliang/'+t+'/'+str(year)+'m'+str(m)+'/'
url_list.append(url)
class YicheItem(scrapy.Item):
# define the fields for your item here like:
Date = scrapy.Field()
CarName = scrapy.Field()
Type = scrapy.Field()
SalesNum = scrapy.Field()
class YicheSpider(scrapy.spiders.Spider):
name = "yiche"
allowed_domains = ["index.bitauto.com"]
start_urls = url_list
def parse(self, response):
# 獲取第一個頁面的數據
s = response.url
t,year,m = re.findall('xiaoliang/(.*?)/(\d+)m(\d+)',s,re.S)[0]
for sel in response.xpath('//ol/li'):
Name = sel.xpath('a/text()').extract()[0]
SalesNum = sel.xpath('span/text()').extract()[0]
#print Name,SalesNum
items = YicheItem()
items['Date'] = str(year)+'/'+str(m)
items['CarName'] = Name
items['Type'] = t
items['SalesNum'] = SalesNum
yield items
# 判斷是否還有下一頁,如果沒有跳過,有則爬取下一個頁面
if len(response.xpath('//div[@class="the_pages"]/@class').extract())==0:
pass
else:
next_pageclass = response.xpath('//div[@class="the_pages"]/div/span[@class="next_off"]/@class').extract()
next_page = response.xpath('//div[@class="the_pages"]/div/span[@class="next_off"]/text()').extract()
if len(next_page)!=0 and len(next_pageclass)!=0:
pass
else:
next_url = 'http://index.bitauto.com'+response.xpath('//div[@class="the_pages"]/div/a/@href')[-1].extract()
yield Request(next_url, callback=self.parse)
保存到數據庫
在mysql數據庫中新建用於保存數據的表格;
修改settings.py文件:
ITEM_PIPELINES = {
'yiche.pipelines.YichePipeline': 300,
}
- 修改文件pipeline.py,代碼如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdb
import MySQLdb.cursors
import logging
from twisted.enterprise import adbapi
class YichePipeline(object):
def __init__(self):
self.dbpool = adbapi.ConnectionPool(
dbapiName ='MySQLdb',#數據庫類型,我這裏是mysql
host ='127.0.0.1',#IP地址,這裏是本地
db = 'scrapy',#數據庫名稱
user = 'root',#用戶名
passwd = 'root',#密碼
cursorclass = MySQLdb.cursors.DictCursor,
charset = 'utf8',#使用編碼類型
use_unicode = False
)
# pipeline dafault function
def process_item(self, item, spider):
query = self.dbpool.runInteraction(self._conditional_insert, item)
logging.debug(query)
return item
# 插入數據到數據庫
def _conditional_insert(self, tx, item):
parms = (item['Date'],item['CarName'],item['Type'],item['SalesNum'])
sql = "insert into yiche (Date,CarName,Type,SalesNum) values('%s','%s','%s','%s') " % parms
#logging.debug(sql)
tx.execute(sql)
運行程序
命令行執行:
scrapy crawl yiche
查看數據:
ok,數據有大概3W+條記錄。