Data Crawling, Cleaning, and Visualization

作者： kevin

Crawling

關於使用scrapy進行爬蟲的一些關鍵點:
- 我們需要知道有哪些爬取目標fields，並提前在items.py 里加入定義，例如下面這樣

    # -*- coding: utf-8 -*-

    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/en/latest/topics/items.html

    import scrapy


    class HospitalItem(scrapy.Item):
        # Hospital Name
        name = scrapy.Field()
        # 成立時間 year established
        establish_date = scrapy.Field()
        # 醫院等級 hospital level
        level = scrapy.Field()
        # 醫療機構類別
        hosp_type = scrapy.Field()
        # 經營性質
        serv_property = scrapy.Field()

在settings.py 中的設置基本都無需更改，除了 ROBOTSTXT_OBEY = True 這項，很多網站都有 robots.txt這個文件來定義爬蟲規則，如果你發現爬蟲無法正常爬取，可以嘗試把它設爲False

然後在項目中的spider directory 裏新建一個爬蟲py 文件，並創建專門的spider class，其中有這些需要注意的點：

推薦加入類似以下的headers來僞裝爬蟲,然後在call Request 的時候帶上 header的選項就行

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36',
}

爬蟲最主要的工作是要拿到所有需要爬取的信息的xpath，
對於沒有這方面知識的人來說最簡單的辦法就是通過Chrome的inspect工具來獲取，假設我們需要爬取商品的價格信息，例如下圖操作：

在此複製到價格的xpath信息後，在代碼中用以下方式extract：
```
def parse_dir_contents(self, response):
    item = HospitalItem()    
    item['name'] = response.xpath(
        '/html/body/div[5]/div[3]/div[1]/ul/li[1]/text()').extract()
    item['establish_date'] = response.xpath(
        '/html/body/div[5]/div[3]/div[1]/ul/li[2]/text()').extract()
    ...
```
如果需要對爬取到的文字信息通過regular expression 進行處理（例如去除newline char），可以通過以下類似的方式：
```
item['qq_acc'] = response.xpath(
        '//table[@class="link"]/tr[2]/td[2]/text()').re(r'\r\n\s*(.*)\r\n')
```

如果需要爬取多層級信息，例如top level 是目錄，然後需要進入目錄中每一個目標頁面獲取詳細信息，這時候我們需要先爬取到目錄頁面中所有的目標網址，然後再iterate 每一個網址，並通過callback的方式call 專門爬取詳細信息的function來實現，具體實例如下：

def parse(self, response):

        hospitals = response.xpath(
            '//div[@class="seek_int_left"]/h3/a/@href').extract()  #爬取目錄中所有項目的各自網址
        for hospital in hospitals: # iterate 每一個網址
            url = str(hospital)
            yield Request(url, callback=self.parse_dir_contents) #用callback的方式 call 具體爬取第二層頁面的function

        urlhead = "http://yyk.qqyy.com/search_dplmhks0i"
        urltail = ".html"

        if self.i < 3018: #iterate 每一頁目錄
            real = urlhead + str(self.i) + urltail
            self.i = self.i + 1
            yield Request(real, headers=self.headers)

Cleaning and Modify

使用pandas做數據清洗的一些小筆記：

清洗簡單的重複數據可以直接用 drop_duplicates
去除數據中例如tab，newline，等一些字符，可以使用replace進行替換，如下：
```
df = df.replace('\n','', regex=True)
```
如果需要導出的數據都被雙引號包裹，可以先把所有column的type 換成string，然後在導出的時候加上quoting的參數：
```
df.to_csv('test.csv',quoting=csv.QUOTE_NONNUMERIC)
```
同時建議在用to_csv導出時，把index 設爲 False，導出的數據就不會帶上pandas自己index的column
字符模糊匹配，我用的是github上的fuzzywuzzy，它提供了幾種不同模式的匹配，個人使用下來的感覺是它對中文字符的支持不夠好，因爲它單純用的是Levenshtein Distance的algorithm來計算（github上的mingpipe是另外一個專門對中文進行匹配的，如果需要進行中文的模糊匹配，可以使用以上的工具）。
下面簡單說說fuzzywuzzy的使用方式：
- 首先進行安裝
  pip3 install fuzzywuzzy[speedup]
- 然後import
```
# fuzz 比較兩個string之間的

from fuzzywuzzy import fuzz


# process是用來比較一個string和其他多個string之間的

from fuzzywuzzy import process
```
- 下面是它提供的4種不同的匹配方式（output分數越高表示越相近，100表示一樣）：
  - fuzz.ratio 比較整個string，以及單詞的順序
    fuzz.ratio("this is a test", "this is a fun") #output 74
  - fuzz.partial_ratio 只test string 的 subsections
    fuzz.partial_ratio("this is a test", "test a is this") #output 57
  - fuzz.token_sort_ratio 會忽略單詞的順序
    fuzz.token_sort_ratio("this is a test", "is this a test") #output 100
  - fuzz.token_set_ratio 會忽略重複的單詞
    fuzz.token_set_ratio("this is a test", "this is is a test") #output 100
- 當我們進行一對多比較時，就要用到process.extract,例如：
```
choices = ['fuzzy fuzzy was a bear', 'is this a test', 'THIS IS A TEST']
process.extract("this is a test", choices, scorer=fuzz.ratio)
```
  對應的output會是：
```
[('THIS IS A TEST!!', 100),
 ('is this a test', 86),
 ('fuzzy fuzzy was a bear', 33)]
```
- 對pandas df 中的數據我們可以通過類似以下方式來模糊匹配，假設我們要找出df中可能重複的地址：
```
lookups_addr = df[df.addr.notnull()].addr
res = [(lookup_a,) + item for lookup_a in lookups_addr for item in process.extract(lookup_a, lookups_addr,limit=2)]
df1 = pd.DataFrame(res, columns=["lookup", "matched", "score", "name"])
df1[(df1.score <100) & (df1.score >90)] 
```

Visualization

Used Echarts for visualization, which is an open source JavaScript library by Baidu
Lots of good templates to choose from http://echarts.baidu.com/examples/

Loading Data:
- 2 options:
  - save large data as json file, and use JQuery to asynchronously get the data, and then load the data in ‘setOption’ (you need to write parser for this)
    - Provide directly in code as var：good and easy for small sets of data
    - loading animation:
    - If data loading time is really long, we can provide a loading animation to notify users that the data is loading

        myChart.showLoading();
        $.get('data.json').done(function(data){
           myChart.hideLoading();
           myChart.setOption(...);
        });

Adding Map
- If we want to exhibit data on the map, we need to include ‘geo’ or ‘bmap’ in the ‘setOption’, then setting relevant options

        geo: {
            map: 'china',
            label: {
                emphasis: {
                    show: false
                }
            },
            roam: false,
            itemStyle: {
                normal: {
                    areaColor: '#404448',
                    borderColor: '#111'
                },
                emphasis: {
                    areaColor: '#2a333d'
                }
            },
            silent: true, // do not responde to mouse click on map

        }

Data Settings:
- We set data options in ‘series’, for visualizing data on map, usaully we choose ‘scatter’ or ‘effectScatter’ type to display data on map

        series : [
            {
                name: 'Top 50',
                type: 'effectScatter',  // here we choose ''effectScatter
                coordinateSystem: 'geo', // either 'geo' or 'bmap' depends on what you've specified above in setOptions
                data: convertData(top50), // point data
                symbolSize: function (val) {
                    return Math.sqrt(val[2]) / 10;  // we can change symbol size based on its value
                                                    // if the range is too large, we can take 
                                                    // their squareroot or even cubic root to reduce range
                },
                showEffectOn: 'render',
                rippleEffect: {
                    brushType: 'stroke'
                },
                hoverAnimation: true,
                label: {
                    normal: {
                        formatter: '{b}',
                        position: 'right',
                        show: false
                    }
                },

                itemStyle: {
                    normal: {
                        color: '#891d14',
                        shadowBlur: 5,
                        shadowColor: '#333'
                    }
                },
                zlevel: 1
            },
        ]

Data Crawling, Cleaning, and Visualization

Crawling

Cleaning and Modify

Visualization

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

數據展示動態（跑分）顯示

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

spark讀取elasticsearch nested array

使用python 進行oracle 全庫數據描述性及探索性逆向分析

使用python fake module批量製造測試數據

elasticsearch 6.3.0 快照

動態添加tab選項卡及tab頁面內容（ajax請求json數據）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結