數據清洗之 異常值處理

異常值處理

  • 指那些偏離正常範圍的值,不是錯誤值
  • 異常值出現頻率較低,但又會對實際項目分析造成偏差
  • 異常值一般用過箱線圖法(分位差法)或者分佈圖(標準差法)來判斷
  • 異常值檢測可以使用均值的二倍標準差範圍,也可以使用上下4分位數差方法
  • 異常值往往採取蓋帽法或者數據離散化
import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\\Jupyter\\notebook\\Python數據清洗實戰\\數據清洗之數據預處理'
os.chdir('D:\\Jupyter\\notebook\\Python數據清洗實戰\\數據')
df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')
def f(x):
    if '$' in str(x):
        x = str(x).strip('$')
        x = str(x).replace(',', '')
    else:
        x = str(x).replace(',', '')
    return float(x)
df['Price'] = df['Price'].apply(f)
df['Mileage'] = df['Mileage'].apply(f)
df.head(5)
Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model ... Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count
0 Used mint!!! very low miles 11412.0 McHenry, Illinois, United States 2013.0 16000.0 Black Harley-Davidson Unspecified Touring ... NaN FALSE 8.1 NaN 2427 Private Seller Clear True FALSE 28.0
1 Used Perfect condition 17200.0 Fort Recovery, Ohio, United States 2016.0 60.0 Black Harley-Davidson Vehicle has an existing warranty Touring ... NaN FALSE 100 17 657 Private Seller Clear True TRUE 0.0
2 Used NaN 3872.0 Chicago, Illinois, United States 1970.0 25763.0 Silver/Blue BMW Vehicle does NOT have an existing warranty R-Series ... NaN FALSE 100 NaN 136 NaN Clear True FALSE 26.0
3 Used CLEAN TITLE READY TO RIDE HOME 6575.0 Green Bay, Wisconsin, United States 2009.0 33142.0 Red Harley-Davidson NaN Touring ... NaN FALSE 100 NaN 2920 Dealer Clear True FALSE 11.0
4 Used NaN 10000.0 West Bend, Wisconsin, United States 2012.0 17800.0 Blue Harley-Davidson NO WARRANTY Touring ... NaN FALSE 100 13 271 OWNER Clear True TRUE 0.0

5 rows × 22 columns

# 對價格異常值處理
# 計算價格均值
x_bar = df['Price'].mean()
# 計算價格標準差
x_std = df['Price'].std()
# 異常值上限檢測
any(df['Price'] > x_bar + 2 * x_std)
True
# 異常值下限檢測
any(df['Price'] < x_bar - 2 * x_std)
False
# 描述性統計
df['Price'].describe()
count      7493.000000
mean       9968.811557
std        8497.326850
min           0.000000
25%        4158.000000
50%        7995.000000
75%       13000.000000
max      100000.000000
Name: Price, dtype: float64
# 25% 分位數
Q1 = df['Price'].quantile(q = 0.25)
# 75% 分位數
Q3 = df['Price'].quantile(q = 0.75)
# 分位差
IQR = Q3 - Q1
any(df['Price'] > Q3 + 1.5 * IQR)
True
any(df['Price'] < Q1 - 1.5 * IQR)
False
import matplotlib.pyplot as plt
%matplotlib inline
df['Price'].plot(kind='box')
<matplotlib.axes._subplots.AxesSubplot at 0x11ddad20ac8>

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-PioEYXZs-1587367435767)(output_21_1.png)]

# 設置繪圖風格
plt.style.use('seaborn')
# 繪製直方圖
df.Price.plot(kind='hist', bins=30, density=True)
# 繪製核密度圖
df.Price.plot(kind='kde')
# 圖形展現
plt.show()

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-JWb6qAoD-1587367435770)(output_22_0.png)]

# 用99分位數和1分位數替換
# 計算P1和P99
P99 = df['Price'].quantile(q=0.99)
P1 = df['Price'].quantile(q=0.01)
P99
39995.32
df['Price_new'] = df['Price']
# 蓋帽法
df.loc[df['Price'] > P99, 'Price_new'] = P99
df.loc[df['Price'] < P1, 'Price_new'] = P1
df[['Price', 'Price_new']].describe()
Price Price_new
count 7493.000000 7493.000000
mean 9968.811557 9821.220873
std 8497.326850 7737.092537
min 0.000000 100.000000
25% 4158.000000 4158.000000
50% 7995.000000 7995.000000
75% 13000.000000 13000.000000
max 100000.000000 39995.320000
# df['Price_new'].plot(kind='box')
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章