Scrapy實例:爬取中國天氣網天氣數據

1.創建項目

在你存放項目的目錄下,按shift+鼠標右鍵打開命令行,輸入命令創建項目:

PS F:\ScrapyProject> scrapy startproject weather      # weather是項目名稱

回車即創建成功

這個命令其實創建了一個文件夾而已,裏面包含了框架規定的文件和子文件夾.

我們要做的就是編輯其中的一部分文件即可.

其實scrapy構建爬蟲就像填空.這麼一想就很簡單了

cmd執行命令:

PS F:\ScrapyProject> cd weather   #進入剛剛創建的項目目錄
PS F:\ScrapyProject\weather>

進入項目根目錄.

你已經創建好了一個scrapy項目.

我想,你有必要了解一下scrapy構建的爬蟲的爬取過程:

scrapy crawl spidername開始運行,程序自動使用start_urls構造Request併發送請求,然後調用parse函數對其進行解析,在這個解析過程中使用rules中的規則從html(或xml)文本中提取匹配的鏈接,通過這個鏈接再次生成Request,如此不斷循環,直到返回的文本中再也沒有匹配的鏈接,或調度器中的Request對象用盡,程序才停止。

2.確定爬取目標:

這裏選擇中國天氣網做爬取素材,

所謂工欲善其事必先利其器,爬取網頁之前一定要先分析網頁,要獲取那些信息,怎麼獲取更加 方便.

篇幅有限,網頁源代碼這裏只展示部分:

<div class="ctop clearfix">
			<div class="crumbs fl">
				<a href="http://js.weather.com.cn" target="_blank">江蘇</a>
				<span>></span>
				<a href="http://www.weather.com.cn/weather/101190801.shtml" target="_blank">徐州</a><span>></span>  <span>鼓樓</span>
			</div>
			<div class="time fr"></div>
		</div>

可以看到這部分包含城市信息,這是我們需要的信息之一.

接下來繼續在頁面裏找其他需要的信息,例如天氣,溫度等.

3.填寫Items.py

Items.py只用於存放你要獲取的字段:

給自己要獲取的信息取個名字:

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy

class WeatherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
	city = scrapy.Field()
	city_addition = scrapy.Field()
	city_addition2 = scrapy.Field()
	weather = scrapy.Field()
	data = scrapy.Field()
	temperatureMax = scrapy.Field()
	temperatureMin = scrapy.Field()
	pass

4.填寫spider.py

spider.py顧名思義就是爬蟲文件.

在填寫spider.py之前,我們先看看如何獲取需要的信息

剛纔的命令行應該沒有關吧,關了也沒關係

win+R在打開cmd,鍵入:

C:\Users\admin>scrapy shell http://www.weather.com.cn/weather1d/101020100.shtml#search  #網址是你要爬取的url

這是scrapy的shell命令,可以在不啓動爬蟲的情況下,對網站的響應response進行處理調試

運行結果:

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x04C42C10>
[s]   item       {}
[s]   request    <GET http://www.weather.com.cn/weather1d/101020100.shtml#search>
[s]   response   <200 http://www.weather.com.cn/weather1d/101020100.shtml>
[s]   settings   <scrapy.settings.Settings object at 0x04C42B10>
[s]   spider     <DefaultSpider 'default' at 0x4e37d90>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]:

 還有很長一大串日誌信息,但不用管,只要你看到Available Scrapy objects(可用的scrapy對象)有response就夠了.

response就是scrapy幫你發送request請求到目標網站後接收的返回信息.

下面做些測試:

定位元素使用的是xpath,如果此前沒接觸過xpath,不要緊,百度一下就知道了

在此我解釋下In[3]的xpath: 獲取class="crumbs f1"的div下的a標籤的text文本 

至於extract()用來提取文本

經過測試In[3]裏輸入的語句可以獲得我們想要的信息

那就把它寫進spider裏:

import scrapy
from weather.items import WeatherItem
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
class Spider(CrawlSpider):
	name = 'weatherSpider'  #定義爬蟲的名字
	start_urls = [          #爬蟲開始爬取數據的url
		"http://www.weather.com.cn/weather1d/101020100.shtml#search"
	]
	
	
         #執行爬蟲的方法
	def parse_item(self, response):
		item = WeatherItem()
                #這裏的item['city']就是你定義的items.py裏的字段
		item['city'] = response.xpath("//div[@class='crumbs fl']/a/text()").extract_first()  
		
		yield item

爬蟲到這裏已經可以初步實現了.修改下items.py裏只留下city,

執行爬蟲:(注意要在項目路徑下)

PS F:\ScrapyProject\weather> scrapy crawl weatherSpider # weatherSpider是自己定義的爬蟲名稱

到這裏還只能獲取一個"city"字段,還需要在html裏獲取剩餘的字段.

你可以嘗試自己寫xpath路徑.

完整的spider.py:

import scrapy
from weather.items import WeatherItem
from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
class Spider(CrawlSpider):
	name = 'weatherSpider'   #spider的名稱
	#allowed_domains = "www.weather.com.cn"      #允許的域名
	start_urls = [                               #爬取開始的url
		"http://www.weather.com.cn/weather1d/101020100.shtml#search"
	]
    #定義規則,過濾掉不需要爬取的url
	rules = (
		Rule(LinkExtractor(allow=('http://www.weather.com.cn/weather1d/101\d{6}.shtml#around2')), follow=False, callback='parse_item'),
	)#聲明瞭callback屬性時,follow默認爲False,沒有聲明callback時,follow默認爲True
	
	
	#注意多頁面爬取時需要自定義方法名稱,不能用parse
	def parse_item(self, response):
		item = WeatherItem()
		item['city'] = response.xpath("//div[@class='crumbs fl']/a/text()").extract_first()
		city_addition = response.xpath("//div[@class='crumbs fl']/span[2]/text()").extract_first()
		if city_addition == '>':
			item['city_addition'] = response.xpath("//div[@class='crumbs fl']/a[2]/text()").extract_first()
		else:
			item['city_addition'] = response.xpath("//div[@class='crumbs fl']/span[2]/text()").extract_first()
		item['city_addition2'] = response.xpath("//div[@class='crumbs fl']/span[3]/text()").extract_first()
		weatherData = response.xpath("//div[@class='today clearfix']/input[1]/@value").extract_first()
		item['data'] = weatherData[0:6]
		item['weather'] = response.xpath("//p[@class='wea']/text()").extract_first()
		item['temperatureMax'] = response.xpath("//ul[@class='clearfix']/li[1]/p[@class='tem']/span[1]/text()").extract_first()
		item['temperatureMin'] = response.xpath("//ul[@class='clearfix']/li[2]/p[@class='tem']/span[1]/text()").extract_first()
		yield item

 多了不少東西,這裏簡單說明一下:

allowed_domains:顧名思義,允許的域名,爬蟲只會爬取該域名下的url

rule:定義爬取規則,爬蟲只會爬取符合規則的url

  rule有allow屬性,使用正則表達式書寫匹配規則.正則表達式不熟悉的話可以寫好後在網上在線校驗,嘗試幾次後,簡單的正則還是比較容易的,我們要用的也不復雜.

  rule有callback屬性可以指定回調函數,爬蟲在發現符合規則的url後就會調用該函數,注意要和默認的回調函數parse作區分.

  (爬取的數據在命令行裏都可以看到)

  rule有follow屬性.爲True時會爬取網頁裏所有符合規則的url,反之不會.        我這裏設置爲了False,因爲True的話要爬很久.大約兩千多條天氣信息

但要保存爬取的數據的話,還需寫下pipeline.py:

5.填寫pipeline.py

 

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import codecs
import json
import csv
from scrapy.exporters import JsonItemExporter
from openpyxl import Workbook
#保存爲json文件
class JsonPipeline(object):
	# 使用FeedJsonItenExporter保存數據
    def __init__(self):
        self.file = open('weather1.json','wb')
        self.exporter = JsonItemExporter(self.file,ensure_ascii =False)
        self.exporter.start_exporting()

    def process_item(self,item,spider):
        print('Write')
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        print('Close')
        self.exporter.finish_exporting()
        self.file.close()

#保存爲TXT文件
class TxtPipeline(object):
	def process_item(self, item, spider):
		#獲取當前工作目錄
		base_dir = os.getcwd()
		filename = base_dir + 'weather.txt'
		print('創建Txt')
		#從內存以追加方式打開文件,並寫入對應的數據
		with open(filename, 'a') as f:
			f.write('城市:' + item['city'] + ' ')
			f.write(item['city_addition'] + ' ')
			f.write(item['city_addition2'] + '\n')
			f.write('天氣:' + item['weather'] + '\n')
			f.write('溫度:' + item['temperatureMin'] + '~' + item['temperatureMax'] + '℃\n')
			
#保存爲Excel文件
class ExcelPipeline(object):
	#創建EXCEL,填寫表頭
	def __init__(self):
		self.wb = Workbook()
		self.ws = self.wb.active
		#設置表頭
		self.ws.append(['省', '市', '縣(鄉)', '日期', '天氣', '最高溫', '最低溫'])
	
	def process_item(self, item, spider):
		line = [item['city'], item['city_addition'], item['city_addition2'], item['data'], item['weather'], item['temperatureMax'], item['temperatureMin']]
		self.ws.append(line) #將數據以行的形式添加僅xlsx中
		self.wb.save('weather.xlsx')
		return item

 這裏寫了管道文件,還要在settings.py設置文件裏啓用這個pipeline:

6.填寫settings.py

改下DOWNLOAD_DELAY"下載延遲",避免訪問過快被網站屏蔽

把註釋符號去掉就行

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1

改下ITEM_PIPELINES:

去掉註釋就行,數字的意思是數值小先執行,

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
	#'weather.pipelines.TxtPipeline': 600,
        #'weather.pipelines.JsonPipeline': 6,
	'weather.pipelines.ExcelPipeline': 300,
}

完成.

進入項目根目錄執行爬蟲:

PS F:\ScrapyProject\weather> scrapy crawl weatherSpider

運行部分結果:

2018-10-17 15:16:19 [scrapy.core.engine] INFO: Spider opened
2018-10-17 15:16:19 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-17 15:16:19 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-10-17 15:16:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/robots.txt> (referer: None)
2018-10-17 15:16:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/weather1d/101020100.shtml#search> (referer: None)
2018-10-17 15:16:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/weather1d/101191101.shtml#around2> (referer: http://www.weather.com.cn/weather1d/101020100.shtml)
2018-10-17 15:16:22 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.weather.com.cn/weather1d/101191101.shtml>
{'city': '江蘇',
 'city_addition': '常州',
 'city_addition2': '城區',
 'data': '10月17日',
 'temperatureMax': '23',
 'temperatureMin': '13',
 'weather': '多雲'}
2018-10-17 15:16:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.weather.com.cn/weather1d/101190803.shtml#around2> (referer: http://www.weather.com.cn/weather1d/101020100.shtml)
2018-10-17 15:16:24 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.weather.com.cn/weather1d/101190803.shtml>
{'city': '江蘇',
 'city_addition': '徐州',
 'city_addition2': '豐縣',
 'data': '10月17日',
 'temperatureMax': '20',
 'temperatureMin': '7',
 'weather': '陰'}

根目錄下的excel文件:

寫入excel的內容 

寫入txt文件的內容:

歡迎留言交流!

完整項目代碼:https://github.com/sanqiansang/weatherSpider.git

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章