機器學習實例(2)

接上節機器學習實例(1):
接下來我們要做的工作是屢清楚數據集的各個特徵的特點,針對不同的特徵提出不同的處理方法:

由於機器學習處理的都是都是數值信息,但是數據集有一部分是文本信息,這就需要對不同的文本信息進行不同的處理了。即下一步工作:
1. 特徵的類別信息

age:連續性數值變量;可能的處理方法:分年齡段;
workcass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov,                    Local-gov,State-gov, Without-pay, Never-worked.
 :僱主類型,多類別,一般處理方法:化爲數值類別,比如以上八個可以分別表示爲1-8(僅爲示例,本文並不推薦);
fnlwgt: 連續性數值變量;人口普查員認爲觀察值的人數。該變量在本文不被使用,筆者認爲該特徵並不重要。
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,Preschool.受教育程度,多類別數據,處理方法同workcass;
education-num: 連續性數值變量,受教育水平年限,一般來講,該值越大,工資越高;
marital-status: Married-civ-spouse, Divorced, Never-married, Separated,Widowed, Married-spouse-absent, Married-AF-spouse.婚姻狀況,多類別數據;
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical,Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv,Armed-Forces.職業,多類別數據;
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,Unmarried.羣體性關係,多類別數據;
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.種族,多類別數據,雖然美國反對種族歧視,但是實際上這個在區分美國工資的時候蠻重要;
sex: Female, Male.性別,最簡單的二分法(0&1);
capital-gain: 資本收益,連續數值;
capital-loss: 資本損失,連續數值;
hours-per-week: 工作時間,連續數值;
native-country: United-States, Cambodia, England, Puerto-Rico, Canada,Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China,Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico,Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti,Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia,El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.國籍,多類別數據;
result:結果:“>50K”或“<=50K”,二分類數據,也是本文機器學習的目的(0&1);

2.特徵處理:
可以看到目前,特徵無非分三類:
1. 連續性數值特徵,如age,最好處理;
2. 二分類文本信息,二分法處理;
3. 多類別文本信息,
現在進行特徵處理:
先查看數據集的缺失情況:

adult.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
result            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

或者第二種方法:

adult.isnull().any()

age               False
workclass         False
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country    False
result            False
dtype: bool

但是實際情況呢:
這裏寫圖片描述
打開實際的csv文件進行查看,很容易可以看到紅色部分“?”其實屬於無效字符,雖未缺失(nan),實則無用,也是我們要想辦法去掉的;
這裏有個小技巧:
先來看

adult.shape

(32561, 15)

可以看到目前32561行數據,我們藉助正則表達式,用nan代替‘?’:

adult_clean=adult.replace(regex=[r'\?|\.|\$'],value=np.nan)
這裏用nan代替 ?  .   $ 三種符號,按照自己寫法和數據替換

此時再來查看:

adult_clean.isnull().any()

age               False
workclass          True
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation         True
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country     True
result            False
dtype: bool

然後我們將所有含有缺失值的行都去掉,對於有些數據集,有的爲空值填充均值,但是此數據集要預測收入,所以我在這裏直接去掉,並不影響結果,如果此處填充均值,訓練集反而會受到精度影響;

adult=adult_clean.dropna(how='any')
#凡是含有nan的行一律去掉

再來查看:

adult.shape

(30162, 15)

有大概2000行的某些特徵存在數據不齊整,到此爲止,數據缺失整理完畢;
因爲要預測,而且給出的數據屬於有監督數據,那麼我們使用監督學習;
首先分離訓練集和測試集;
先將“最沒用”的特徵剔除:

adult=adult.drop(['fnlwgt'],axis=1)
adult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 14 columns):
age               30162 non-null int64
workclass         30162 non-null object
education         30162 non-null object
education-num     30162 non-null int64
marital-status    30162 non-null object
occupation        30162 non-null object
relationship      30162 non-null object
race              30162 non-null object
sex               30162 non-null object
capital-gain      30162 non-null int64
capital-loss      30162 non-null int64
hours-per-week    30162 non-null int64
native-country    30162 non-null object
result            30162 non-null object
dtypes: int64(5), object(9)
memory usage: 4.7+ MB

導入分離訓練和測試集需要的包:

from sklearn.cross_validation import train_test_split
col_names = ["age", "workclass", "education", "education-num", "marital-status", "occupation", 
             "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]
數據分離:按照75%的訓練集和25%的測試集;random_state隨便輸入一個數字;
X_train,X_test,y_train,y_test=train_test_split(adult[col_names[1:13]],adult[col_names[13]],test_size=0.25,random_state=33)
# X_train,X_test,y_train,y_test=train_test_split(adult[1:13],adult[13],test_size=0.25,random_state=33)

很多人會寫成第二種,但是這樣寫是錯的,可以試試;
查看訓練集和測試集情況:

print(X_train.shape)
print(X_test.shape)

(22621, 12)
(7541, 12)

測試集同上,具體來看:

X_train.head()

部分輸出:
    workclass   education   education-num   marital-status  occupation  relationship    race    sex capital-gain    capital-loss    hours-per-week  native-country
20607   Private Some-college    10  Married-civ-spouse  Craft-repair    Husband White   Male    0   0   50  United-States
31257   Private HS-grad 9   Married-civ-spouse  Other-service   Husband Black   Male    0   0   50  United-States
31892   Private HS-grad 9   Never-married   Adm-clerical    Not-in-family   White   Female  0   0   45  United-States
20220   Private HS-grad 9   Divorced    Machine-op-inspct   Unmarried   Black   Female  0   0   40  United-States
24044   Private Some-college    10  Divorced    Sales   Not-in-family   White   Female  0   0   45  United-States
y_train.head()

20607      >50K
31257     <=50K
31892     <=50K
20220     <=50K
24044      >50K
Name: result, dtype: object

到此爲止,訓練和測試集就分開了,那麼是不是可以直接開始隨機森林大法了呢?且慢,最重要的特徵尚未處理,至於如何處理,下次見分曉;
第三節終結篇來嘍

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章