接上節機器學習實例(1):
接下來我們要做的工作是屢清楚數據集的各個特徵的特點,針對不同的特徵提出不同的處理方法:
由於機器學習處理的都是都是數值信息,但是數據集有一部分是文本信息,這就需要對不同的文本信息進行不同的處理了。即下一步工作:
1. 特徵的類別信息
age:連續性數值變量;可能的處理方法:分年齡段;
workcass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov,State-gov, Without-pay, Never-worked.
:僱主類型,多類別,一般處理方法:化爲數值類別,比如以上八個可以分別表示爲1-8(僅爲示例,本文並不推薦);
fnlwgt: 連續性數值變量;人口普查員認爲觀察值的人數。該變量在本文不被使用,筆者認爲該特徵並不重要。
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,Preschool.受教育程度,多類別數據,處理方法同workcass;
education-num: 連續性數值變量,受教育水平年限,一般來講,該值越大,工資越高;
marital-status: Married-civ-spouse, Divorced, Never-married, Separated,Widowed, Married-spouse-absent, Married-AF-spouse.婚姻狀況,多類別數據;
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical,Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv,Armed-Forces.職業,多類別數據;
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,Unmarried.羣體性關係,多類別數據;
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.種族,多類別數據,雖然美國反對種族歧視,但是實際上這個在區分美國工資的時候蠻重要;
sex: Female, Male.性別,最簡單的二分法(0&1);
capital-gain: 資本收益,連續數值;
capital-loss: 資本損失,連續數值;
hours-per-week: 工作時間,連續數值;
native-country: United-States, Cambodia, England, Puerto-Rico, Canada,Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China,Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico,Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti,Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia,El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.國籍,多類別數據;
result:結果:“>50K”或“<=50K”,二分類數據,也是本文機器學習的目的(0&1);
2.特徵處理:
可以看到目前,特徵無非分三類:
1. 連續性數值特徵,如age,最好處理;
2. 二分類文本信息,二分法處理;
3. 多類別文本信息,
現在進行特徵處理:
先查看數據集的缺失情況:
adult.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age 32561 non-null int64
workclass 32561 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education-num 32561 non-null int64
marital-status 32561 non-null object
occupation 32561 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital-gain 32561 non-null int64
capital-loss 32561 non-null int64
hours-per-week 32561 non-null int64
native-country 32561 non-null object
result 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
或者第二種方法:
adult.isnull().any()
age False
workclass False
fnlwgt False
education False
education-num False
marital-status False
occupation False
relationship False
race False
sex False
capital-gain False
capital-loss False
hours-per-week False
native-country False
result False
dtype: bool
但是實際情況呢:
打開實際的csv文件進行查看,很容易可以看到紅色部分“?”其實屬於無效字符,雖未缺失(nan),實則無用,也是我們要想辦法去掉的;
這裏有個小技巧:
先來看
adult.shape
(32561, 15)
可以看到目前32561行數據,我們藉助正則表達式,用nan代替‘?’:
adult_clean=adult.replace(regex=[r'\?|\.|\$'],value=np.nan)
這裏用nan代替 ? . $ 三種符號,按照自己寫法和數據替換
此時再來查看:
adult_clean.isnull().any()
age False
workclass True
fnlwgt False
education False
education-num False
marital-status False
occupation True
relationship False
race False
sex False
capital-gain False
capital-loss False
hours-per-week False
native-country True
result False
dtype: bool
然後我們將所有含有缺失值的行都去掉,對於有些數據集,有的爲空值填充均值,但是此數據集要預測收入,所以我在這裏直接去掉,並不影響結果,如果此處填充均值,訓練集反而會受到精度影響;
adult=adult_clean.dropna(how='any')
#凡是含有nan的行一律去掉
再來查看:
adult.shape
(30162, 15)
有大概2000行的某些特徵存在數據不齊整,到此爲止,數據缺失整理完畢;
因爲要預測,而且給出的數據屬於有監督數據,那麼我們使用監督學習;
首先分離訓練集和測試集;
先將“最沒用”的特徵剔除:
adult=adult.drop(['fnlwgt'],axis=1)
adult.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 14 columns):
age 30162 non-null int64
workclass 30162 non-null object
education 30162 non-null object
education-num 30162 non-null int64
marital-status 30162 non-null object
occupation 30162 non-null object
relationship 30162 non-null object
race 30162 non-null object
sex 30162 non-null object
capital-gain 30162 non-null int64
capital-loss 30162 non-null int64
hours-per-week 30162 non-null int64
native-country 30162 non-null object
result 30162 non-null object
dtypes: int64(5), object(9)
memory usage: 4.7+ MB
導入分離訓練和測試集需要的包:
from sklearn.cross_validation import train_test_split
col_names = ["age", "workclass", "education", "education-num", "marital-status", "occupation",
"relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
數據分離:按照75%的訓練集和25%的測試集;random_state隨便輸入一個數字;
X_train,X_test,y_train,y_test=train_test_split(adult[col_names[1:13]],adult[col_names[13]],test_size=0.25,random_state=33)
# X_train,X_test,y_train,y_test=train_test_split(adult[1:13],adult[13],test_size=0.25,random_state=33)
很多人會寫成第二種,但是這樣寫是錯的,可以試試;
查看訓練集和測試集情況:
print(X_train.shape)
print(X_test.shape)
(22621, 12)
(7541, 12)
測試集同上,具體來看:
X_train.head()
部分輸出:
workclass education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
20607 Private Some-college 10 Married-civ-spouse Craft-repair Husband White Male 0 0 50 United-States
31257 Private HS-grad 9 Married-civ-spouse Other-service Husband Black Male 0 0 50 United-States
31892 Private HS-grad 9 Never-married Adm-clerical Not-in-family White Female 0 0 45 United-States
20220 Private HS-grad 9 Divorced Machine-op-inspct Unmarried Black Female 0 0 40 United-States
24044 Private Some-college 10 Divorced Sales Not-in-family White Female 0 0 45 United-States
y_train.head()
20607 >50K
31257 <=50K
31892 <=50K
20220 <=50K
24044 >50K
Name: result, dtype: object
到此爲止,訓練和測試集就分開了,那麼是不是可以直接開始隨機森林大法了呢?且慢,最重要的特徵尚未處理,至於如何處理,下次見分曉;
第三節終結篇來嘍