異常值處理

目前還沒有在具體項目中用到異常值檢驗，總結下以備後面項目使用
1）基於馬氏距離的異常檢驗

基於馬氏距離的異常檢驗是針對異常樣本點的檢驗，與下面基於箱線圖的異常檢驗不同，箱線圖的異常檢驗是針對單一屬性的異常檢驗。

歐式距離是我們常用的距離度量工具，但是它將目標的不同屬性（即各指標或各變量）之間的差別等同看待，這一點有時不能滿足實際要求，而馬氏距離考慮到各種特性之間的聯繫（例如：一條關於身高的信息會帶來一條關於體重的信息，因爲兩者是有關聯的）並且是尺度無關的(scale-invariant)，即獨立於測量尺度。對於一個均值爲μ，協方差矩陣爲Σ的多變量向量，其馬氏距離爲sqrt( (x-μ)'Σ^(-1)(x-μ) )。當協方差是單位矩陣時，馬氏距離與歐式距離等價。

import pandas as pd
import numpy as np
from numpy import float64
from sklearn import preprocessing
from matplotlib import pyplot as plt
from pandas import Series
from scipy.spatial import distance
from mpl_toolkits.mplot3d import Axes3D

Height_cm = np.array([164, 167, 168, 169, 169, 170, 170, 170, 171, 172, 172, 173, 173, 175, 176, 178], dtype=float64)
Weight_kg = np.array([54,  57,  58,  60,  61,  60,  61,  62,  62,  64,  62,  62,  64,  56,  66,  70], dtype=float64)
hw = {'Height_cm': Height_cm, 'Weight_kg': Weight_kg}
hw = pd.DataFrame(hw)
print(hw)

is_height_outlier = abs(preprocessing.scale(hw['Height_cm'])) > 2
is_weight_outlier = abs(preprocessing.scale(hw['Weight_kg'])) > 2
is_outlier = is_height_outlier | is_weight_outlier
color = ['g', 'r']
pch = [1 if is_outlier[i] == True else 0 for i in range(len(is_outlier))]
cValue = [color[is_outlier[i]] for i in range(len(is_outlier))]
# print is_height_outlier
# print cValue
fig = plt.figure()
plt.title('Scatter Plot')
plt.xlabel('Height_cm')
plt.ylabel('Weight_kg')
plt.scatter(hw['Height_cm'], hw['Weight_kg'], s=40, c=cValue)
plt.show()


n_outliers = 2
m_dist_order = Series([float(distance.mahalanobis(hw.iloc[i], hw.mean(), np.mat(hw.cov().as_matrix()).I) ** 2) for i in range(len(hw))]).sort_values(ascending=False).index.tolist()
is_outlier = [False, ] * 16
for i in range(n_outliers):
    is_outlier[m_dist_order[i]] = True
color = ['g', 'r']
pch = [1 if is_outlier[i] == True else 0 for i in range(len(is_outlier))]
cValue = [color[is_outlier[i]] for i in range(len(is_outlier))]
fig = plt.figure()
plt.title('Scatter Plot')
plt.xlabel('Height_cm')
plt.ylabel('Weight_kg')
plt.scatter(hw['Height_cm'], hw['Weight_kg'], s=40, c=cValue)
plt.show()

2）基於標準偏差的異常值檢驗

基於標準偏差的異常值檢驗也稱3σ原則，可以針對樣本點也可以針對單個屬性，一般適用於服從正態分佈的數據，即異常值被定義爲觀測值和平均值的偏差超過3倍標準偏差的值。

P(|x-u|>3sigma)<=0.003 正態分佈情況下，如果|x-u|>3sigma說明x出現的概率很小，所以可以看成是異常值。

3）基於箱線圖的異常值檢驗

箱線圖通過計算中位數（50百分位數）、上四分爲數（75百分位數）、上四分爲數（25百分位數）、上邊緣（100百分位數）、下邊緣（0百分位數）如下：

箱型圖識別異常值標準: 異常值被定義爲大於QU+1.5IQR或小於QL−1.5IQR的值，QU是上四分位數，QL是下四分位數，IQR是QU和QL的差。

同時也可以自己定義其他異常值檢驗，如95百分位數，將上下各2.5%的數看做異常

下面是箱線圖檢測異常值的python實例

數據集：http://download.csdn.net/detail/u010111016/9524285

源碼：http://blog.csdn.net/shuaishuai3409/article/details/51428106

import pandas as pd
import matplotlib.pyplot as plt

# 設定播放數據路徑,該路徑爲代碼所在路徑的上一個目錄data中.
number = 'D:/data/ather/all_musicers.xlsx'
data = pd.read_excel(number)

data1 = data.iloc[:, 0:10]  # 10位歌手的183天音樂播放量
plt.rcParams['font.sans-serif'] = ['SimHei']  # 用來正常顯示中文標籤
plt.rcParams['axes.unicode_minus'] = False  # 用來正常顯示負號
plt.figure(1, figsize=(13, 26))  # 可設定圖像大小
# plt.figure()  # 建立圖像
# 畫箱線圖，直接使用DataFrame的方法.代碼到這爲止,就已經可以顯示帶有異常值的箱型圖了,
# 但爲了標註出異常值的數值,還需要以下代碼進行標註.
p = data1.boxplot(return_type='dict')
# for i in range(0,4):
# 'flies'即爲異常值的標籤.[0]是用來標註第1位歌手的異常值數值,同理[i]標註第i+1位歌手的異常值.
for j in range(0, 4):
    x = p['fliers'][j].get_xdata()
    y = p['fliers'][j].get_ydata()
    y.sort()  # 從小到大排序

    for i in range(len(x)):
        if i > 0:
            plt.annotate(y[i], xy=(x[i], y[i]), xytext=(x[i]+0.05 - 0.8/(y[i]-y[i-1]), y[i]))
        else:
            plt.annotate(y[i], xy=(x[i], y[i]), xytext=(x[i]+0.08, y[i]))

# 展示箱線圖
plt.show()