數據清洗之數據離散化

原創

若尘

2020-04-21 18:55

數據離散化

數據離散化就是分箱
一把你常用分箱方法是等頻分箱或者等寬分箱
一般使用pd.cut或者pd.qcut函數

pandas.cut(x, bins, right=True, labels)

x: 數據
bins: 離散化的數目，或者切分的區間
labels: 離散化後各個類別的標籤
right: 是否包含區間右邊的值

import pandas as pd
import numpy as np
import os

os.getcwd()

'D:\\Jupyter\\notebook\\Python數據清洗實戰\\數據'

os.chdir('D:\\Jupyter\\notebook\\Python數據清洗實戰\\數據')

df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')

def f(x):
    if '$' in str(x):
        x = str(x).strip('$')
        x = str(x).replace(',', '')
    else:
        x = str(x).replace(',', '')
    return float(x)

df['Price'] = df['Price'].apply(f)

df['Mileage'] = df['Mileage'].apply(f)

df.head(5)

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	11412.0	McHenry, Illinois, United States	2013.0	16000.0	Black	Harley-Davidson	Unspecified	Touring	...	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0
1	Used	Perfect condition	17200.0	Fort Recovery, Ohio, United States	2016.0	60.0	Black	Harley-Davidson	Vehicle has an existing warranty	Touring	...	NaN	FALSE	100	17	657	Private Seller	Clear	True	TRUE	0.0
2	Used	NaN	3872.0	Chicago, Illinois, United States	1970.0	25763.0	Silver/Blue	BMW	Vehicle does NOT have an existing warranty	R-Series	...	NaN	FALSE	100	NaN	136	NaN	Clear	True	FALSE	26.0
3	Used	CLEAN TITLE READY TO RIDE HOME	6575.0	Green Bay, Wisconsin, United States	2009.0	33142.0	Red	Harley-Davidson	NaN	Touring	...	NaN	FALSE	100	NaN	2920	Dealer	Clear	True	FALSE	11.0
4	Used	NaN	10000.0	West Bend, Wisconsin, United States	2012.0	17800.0	Blue	Harley-Davidson	NO WARRANTY	Touring	...	NaN	FALSE	100	13	271	OWNER	Clear	True	TRUE	0.0

5 rows × 22 columns

df['Price_bin'] = pd.cut(df['Price'], 5, labels=range(5))

# 計算頻數
df['Price_bin'].value_counts()

0    6762
1     659
2      50
3      20
4       2
Name: Price_bin, dtype: int64

%matplotlib inline

df['Price_bin'].value_counts().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x1b35fba9048>

df['Price_bin'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1b35f681278>

w = [100, 1000, 5000, 10000, 20000, 100000]

df['Price_bin'] = pd.cut(df['Price'], bins=w, labels=range(5))

df[['Price', 'Price_bin']].head(5)

	Price	Price_bin
0	11412.0	3
1	17200.0	3
2	3872.0	1
3	6575.0	2
4	10000.0	2

df['Price_bin'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1b35fb99898>

# 分位數
k = 5
w = [1.0 * i/k for i in range(k+1)]
w

[0.0, 0.2, 0.4, 0.6, 0.8, 1.0]

# 等頻分成5段
df['Price_bin'] = pd.qcut(df['Price'], q=w, labels=range(5))

df['Price_bin'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1b35fe2a080>

# 計算分位點
k = 5
w1 = df['Price'].quantile([1.0 * i/k for i in range(k+1)])

w1

0.0         0.0
0.2      3500.0
0.4      6491.0
0.6      9777.0
0.8     14999.0
1.0    100000.0
Name: Price, dtype: float64

# 一般第一個分位點要比實際小
# 最後一個分位點要比實際大
w1[0] = w[0] * 0.95
w1[1.0] = w1[1.0] * 1.1

w1

0.0         0.0
0.2      3500.0
0.4      6491.0
0.6      9777.0
0.8     14999.0
1.0    110000.0
Name: Price, dtype: float64

# 按照新的分段標準分割
df['Price_bin'] = pd.cut(df['Price'], bins=w1, labels=range(5))

df['Price_bin'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1b35e53fa20>

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

數據清洗之數據離散化

數據離散化

HTTP URL 詳解

圖的應用——最短路徑

馮 · 諾依曼結構原理及層次結構分析

基於ECS搭建FTP服務

Scrapy 爬取懶人圖庫（自定義下載中間件、selenium）

遞歸全排列問題（兩種方法 Java實現）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

數據清洗之 數據離散化

數據離散化

數據清洗之數據離散化