數據清洗之異常值處理

原創

2020-04-21 18:55

異常值處理

指那些偏離正常範圍的值，不是錯誤值
異常值出現頻率較低，但又會對實際項目分析造成偏差
異常值一般用過箱線圖法(分位差法)或者分佈圖(標準差法)來判斷
異常值檢測可以使用均值的二倍標準差範圍，也可以使用上下4分位數差方法
異常值往往採取蓋帽法或者數據離散化

import pandas as pd
import numpy as np
import os

os.getcwd()

'D:\\Jupyter\\notebook\\Python數據清洗實戰\\數據清洗之數據預處理'

os.chdir('D:\\Jupyter\\notebook\\Python數據清洗實戰\\數據')

df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')

def f(x):
    if '$' in str(x):
        x = str(x).strip('$')
        x = str(x).replace(',', '')
    else:
        x = str(x).replace(',', '')
    return float(x)

df['Price'] = df['Price'].apply(f)

df['Mileage'] = df['Mileage'].apply(f)

df.head(5)

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	11412.0	McHenry, Illinois, United States	2013.0	16000.0	Black	Harley-Davidson	Unspecified	Touring	...	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0
1	Used	Perfect condition	17200.0	Fort Recovery, Ohio, United States	2016.0	60.0	Black	Harley-Davidson	Vehicle has an existing warranty	Touring	...	NaN	FALSE	100	17	657	Private Seller	Clear	True	TRUE	0.0
2	Used	NaN	3872.0	Chicago, Illinois, United States	1970.0	25763.0	Silver/Blue	BMW	Vehicle does NOT have an existing warranty	R-Series	...	NaN	FALSE	100	NaN	136	NaN	Clear	True	FALSE	26.0
3	Used	CLEAN TITLE READY TO RIDE HOME	6575.0	Green Bay, Wisconsin, United States	2009.0	33142.0	Red	Harley-Davidson	NaN	Touring	...	NaN	FALSE	100	NaN	2920	Dealer	Clear	True	FALSE	11.0
4	Used	NaN	10000.0	West Bend, Wisconsin, United States	2012.0	17800.0	Blue	Harley-Davidson	NO WARRANTY	Touring	...	NaN	FALSE	100	13	271	OWNER	Clear	True	TRUE	0.0

5 rows × 22 columns

# 對價格異常值處理
# 計算價格均值
x_bar = df['Price'].mean()

# 計算價格標準差
x_std = df['Price'].std()

# 異常值上限檢測
any(df['Price'] > x_bar + 2 * x_std)

True

# 異常值下限檢測
any(df['Price'] < x_bar - 2 * x_std)

False

# 描述性統計
df['Price'].describe()

count      7493.000000
mean       9968.811557
std        8497.326850
min           0.000000
25%        4158.000000
50%        7995.000000
75%       13000.000000
max      100000.000000
Name: Price, dtype: float64

# 25% 分位數
Q1 = df['Price'].quantile(q = 0.25)

# 75% 分位數
Q3 = df['Price'].quantile(q = 0.75)

# 分位差
IQR = Q3 - Q1

any(df['Price'] > Q3 + 1.5 * IQR)

True

any(df['Price'] < Q1 - 1.5 * IQR)

False

import matplotlib.pyplot as plt

%matplotlib inline

df['Price'].plot(kind='box')

<matplotlib.axes._subplots.AxesSubplot at 0x11ddad20ac8>

# 設置繪圖風格
plt.style.use('seaborn')
# 繪製直方圖
df.Price.plot(kind='hist', bins=30, density=True)
# 繪製核密度圖
df.Price.plot(kind='kde')
# 圖形展現
plt.show()

# 用99分位數和1分位數替換
# 計算P1和P99
P99 = df['Price'].quantile(q=0.99)
P1 = df['Price'].quantile(q=0.01)

P99

39995.32

df['Price_new'] = df['Price']

# 蓋帽法
df.loc[df['Price'] > P99, 'Price_new'] = P99
df.loc[df['Price'] < P1, 'Price_new'] = P1

df[['Price', 'Price_new']].describe()

	Price	Price_new
count	7493.000000	7493.000000
mean	9968.811557	9821.220873
std	8497.326850	7737.092537
min	0.000000	100.000000
25%	4158.000000	4158.000000
50%	7995.000000	7995.000000
75%	13000.000000	13000.000000
max	100000.000000	39995.320000

# df['Price_new'].plot(kind='box')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

數據清洗之異常值處理

異常值處理

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

圖的應用——最短路徑

馮 · 諾依曼結構原理及層次結構分析

基於ECS搭建FTP服務

Scrapy 爬取懶人圖庫（自定義下載中間件、selenium）

遞歸全排列問題（兩種方法 Java實現）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

數據清洗之 異常值處理

異常值處理

數據清洗之異常值處理