首先介紹數據標準化工具onehot[獨熱編碼]:
One-Hot編碼,又稱爲一位有效編碼,主要是採用N位狀態寄存器來對N個狀態進行編碼,每個狀態都由他獨立的寄存器位,並且在任意時候只有一位有效。以手寫數字識別爲例,我們需要將0-9共十個數字標籤轉化成onehot標籤。例如:數字標籤“6”轉化爲onehot標籤就是[0,0,0,0,0,0,1,0,0,0].
我第一次利用以下函數進行編碼:
def convert2onehot(data):
# covert data to onehot representation
return pd.get_dummies(data, prefix=data.columns)
結果報錯:
很明顯職業一項我們一共有15個而該函數只能提供10個以下的編碼,只能另覓他法嘍,
ValueError: Length of 'prefix' (15) did not match the length of the columns being encoded (9).
雖然已經有很多人在 stackoverflow 和 sklearn 的 github issue 上討論過這個問題,但目前爲止的 sklearn 版本仍沒有增加OneHotEncoder對字符串型類別變量的支持,所以一般都採用曲線救國的方式:
- 方法一 :先用 LabelEncoder() 轉換成連續的數值型變量,再用 OneHotEncoder() 二值化;
- 方法二 :直接用 LabelBinarizer() 進行二值化;
import pandas as pd
#import Numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
col_names = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
data = pd.read_csv("D://ML//Data//adult.csv", names=col_names)
#利用函數將文字標準化爲tensorflow可以處理的數字
#OneHotEncoder(sparse = False).fit_transform( data[['age','education-num']] )
data['age']=LabelBinarizer().fit_transform(data['age'])
data['workclass']=LabelBinarizer().fit_transform(data['workclass'])
data['fnlwgt']=data['fnlwgt']/100
data['fnlwgt']=OneHotEncoder(sparse = False).fit_transform( data[['fnlwgt']] )
data['education']=LabelBinarizer().fit_transform(data['education'])
data['education-num']=LabelBinarizer().fit_transform(data['education-num'])
data['marital-status']=LabelBinarizer().fit_transform(data['marital-status'])
data['occupation']=LabelBinarizer().fit_transform(data['occupation'])
data['relationship']=LabelBinarizer().fit_transform(data['relationship'])
data['race']=LabelBinarizer().fit_transform(data['race'])
data['sex']=LabelBinarizer().fit_transform(data['sex'])
data['capital-gain']=LabelBinarizer().fit_transform(data['capital-gain'])
data['capital-loss']=LabelBinarizer().fit_transform(data['capital-loss'])
data['hours-per-week']=LabelBinarizer().fit_transform(data['hours-per-week'])
data['native-country']=LabelBinarizer().fit_transform(data['native-country'])
data['result']=LabelBinarizer().fit_transform(data['result'])
print(data[:10])
打印結果如下所示:
age workclass fnlwgt education education-num marital-status \
0 0 0 0.0 0 0 0
1 0 0 0.0 0 0 0
2 0 0 0.0 0 0 1
3 0 0 0.0 0 0 0
4 0 0 0.0 0 0 0
5 0 0 0.0 0 0 0
6 0 0 0.0 0 0 0
7 0 0 0.0 0 0 0
8 0 0 0.0 0 0 0
9 0 0 0.0 0 0 0
occupation relationship race sex capital-gain capital-loss \
0 0 0 0 1 0 1
1 0 1 0 1 1 1
2 0 0 0 1 1 1
3 0 1 0 1 1 1
4 0 0 0 0 1 1
5 0 0 0 0 1 1
6 0 0 0 0 1 1
7 0 1 0 1 1 1
8 0 0 0 0 0 1
9 0 1 0 1 0 1
hours-per-week native-country result
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 1
8 0 0 1
9 0 0 1
如果覺得本文寫的還不錯的夥伴,可以給個關注一起交流進步,如果有在找工作且對阿里感興趣的夥伴,也可以發簡歷給我進行內推: