作者: kevin
Crawling
關於使用scrapy進行爬蟲的一些關鍵點:
- 我們需要知道有哪些爬取目標fields,並提前在items.py 里加入定義,例如下面這樣
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class HospitalItem(scrapy.Item):
# Hospital Name
name = scrapy.Field()
# 成立時間 year established
establish_date = scrapy.Field()
# 醫院等級 hospital level
level = scrapy.Field()
# 醫療機構類別
hosp_type = scrapy.Field()
# 經營性質
serv_property = scrapy.Field()
在settings.py 中的設置基本都無需更改,除了
ROBOTSTXT_OBEY = True然後在項目中的spider directory 裏新建一個爬蟲py 文件,並創建專門的spider class,其中有這些需要注意的點:
推薦加入類似以下的headers來僞裝爬蟲,然後在call Request 的時候帶上 header的選項就行
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36', }
爬蟲最主要的工作是要拿到所有需要爬取的信息的xpath,
對於沒有這方面知識的人來說最簡單的辦法就是通過Chrome的inspect工具來獲取,假設我們需要爬取商品的價格信息,例如下圖操作:
在此複製到價格的xpath信息後,在代碼中用以下方式extract:def parse_dir_contents(self, response): item = HospitalItem() item['name'] = response.xpath( '/html/body/div[5]/div[3]/div[1]/ul/li[1]/text()').extract() item['establish_date'] = response.xpath( '/html/body/div[5]/div[3]/div[1]/ul/li[2]/text()').extract() ...
如果需要對爬取到的文字信息通過regular expression 進行處理(例如去除newline char),可以通過以下類似的方式:
item['qq_acc'] = response.xpath( '//table[@class="link"]/tr[2]/td[2]/text()').re(r'\r\n\s*(.*)\r\n')
如果需要爬取多層級信息,例如top level 是 目錄,然後需要進入目錄中每一個目標頁面獲取詳細信息,這時候我們需要先爬取到目錄頁面中所有的目標網址,然後再iterate 每一個網址,並通過callback的方式call 專門爬取詳細信息的function來實現,具體實例如下:
def parse(self, response): hospitals = response.xpath( '//div[@class="seek_int_left"]/h3/a/@href').extract() #爬取目錄中所有項目的各自網址 for hospital in hospitals: # iterate 每一個網址 url = str(hospital) yield Request(url, callback=self.parse_dir_contents) #用callback的方式 call 具體爬取第二層頁面的function urlhead = "http://yyk.qqyy.com/search_dplmhks0i" urltail = ".html" if self.i < 3018: #iterate 每一頁目錄 real = urlhead + str(self.i) + urltail self.i = self.i + 1 yield Request(real, headers=self.headers)
Cleaning and Modify
使用pandas做數據清洗的一些小筆記:
- 清洗簡單的重複數據可以直接用 drop_duplicates
去除數據中例如tab,newline,等一些字符,可以使用replace進行替換,如下:
df = df.replace('\n','', regex=True)
如果需要導出的數據都被雙引號包裹,可以先把所有column的type 換成string,然後在導出的時候加上quoting的參數:
df.to_csv('test.csv',quoting=csv.QUOTE_NONNUMERIC)
- 同時建議在用to_csv導出時,把index 設爲 False,導出的數據就不會帶上pandas自己index的column
- 字符模糊匹配,我用的是github上的fuzzywuzzy,它提供了幾種不同模式的匹配,個人使用下來的感覺是它對中文字符的支持不夠好,因爲它單純用的是Levenshtein Distance的algorithm來計算(github上的mingpipe是另外一個專門對中文進行匹配的,如果需要進行中文的模糊匹配,可以使用以上的工具)。
下面簡單說說fuzzywuzzy的使用方式:
- 首先進行安裝
pip3 install fuzzywuzzy[speedup]
然後import
# fuzz 比較兩個string之間的 from fuzzywuzzy import fuzz # process是用來比較一個string和其他多個string之間的 from fuzzywuzzy import process
- 下面是它提供的4種不同的匹配方式(output分數越高表示越相近,100表示一樣):
- fuzz.ratio 比較整個string,以及單詞的順序
fuzz.ratio("this is a test", "this is a fun") #output 74
- fuzz.partial_ratio 只test string 的 subsections
fuzz.partial_ratio("this is a test", "test a is this") #output 57
- fuzz.token_sort_ratio 會忽略單詞的順序
fuzz.token_sort_ratio("this is a test", "is this a test") #output 100
- fuzz.token_set_ratio 會忽略重複的單詞
fuzz.token_set_ratio("this is a test", "this is is a test") #output 100
- fuzz.ratio 比較整個string,以及單詞的順序
當我們進行一對多比較時,就要用到process.extract,例如:
choices = ['fuzzy fuzzy was a bear', 'is this a test', 'THIS IS A TEST'] process.extract("this is a test", choices, scorer=fuzz.ratio)
對應的output會是:
[('THIS IS A TEST!!', 100), ('is this a test', 86), ('fuzzy fuzzy was a bear', 33)]
對pandas df 中的數據我們可以通過類似以下方式來模糊匹配,假設我們要找出df中可能重複的地址:
lookups_addr = df[df.addr.notnull()].addr res = [(lookup_a,) + item for lookup_a in lookups_addr for item in process.extract(lookup_a, lookups_addr,limit=2)] df1 = pd.DataFrame(res, columns=["lookup", "matched", "score", "name"]) df1[(df1.score <100) & (df1.score >90)]
- 首先進行安裝
Visualization
Used Echarts for visualization, which is an open source JavaScript library by Baidu
Lots of good templates to choose from http://echarts.baidu.com/examples/
- Loading Data:
- 2 options:
- save large data as json file, and use JQuery to asynchronously get the data, and then load the data in ‘setOption’ (you need to write parser for this)
- Provide directly in code as var:good and easy for small sets of data
- loading animation:
- If data loading time is really long, we can provide a loading animation to notify users that the data is loading
- save large data as json file, and use JQuery to asynchronously get the data, and then load the data in ‘setOption’ (you need to write parser for this)
- 2 options:
myChart.showLoading();
$.get('data.json').done(function(data){
myChart.hideLoading();
myChart.setOption(...);
});
- Adding Map
- If we want to exhibit data on the map, we need to include ‘geo’ or ‘bmap’ in the ‘setOption’, then setting relevant options
geo: {
map: 'china',
label: {
emphasis: {
show: false
}
},
roam: false,
itemStyle: {
normal: {
areaColor: '#404448',
borderColor: '#111'
},
emphasis: {
areaColor: '#2a333d'
}
},
silent: true, // do not responde to mouse click on map
}
- Data Settings:
- We set data options in ‘series’, for visualizing data on map, usaully we choose ‘scatter’ or ‘effectScatter’ type to display data on map
series : [
{
name: 'Top 50',
type: 'effectScatter', // here we choose ''effectScatter
coordinateSystem: 'geo', // either 'geo' or 'bmap' depends on what you've specified above in setOptions
data: convertData(top50), // point data
symbolSize: function (val) {
return Math.sqrt(val[2]) / 10; // we can change symbol size based on its value
// if the range is too large, we can take
// their squareroot or even cubic root to reduce range
},
showEffectOn: 'render',
rippleEffect: {
brushType: 'stroke'
},
hoverAnimation: true,
label: {
normal: {
formatter: '{b}',
position: 'right',
show: false
}
},
itemStyle: {
normal: {
color: '#891d14',
shadowBlur: 5,
shadowColor: '#333'
}
},
zlevel: 1
},
]