機器學習實例（2）

接上節機器學習實例(1)：
接下來我們要做的工作是屢清楚數據集的各個特徵的特點，針對不同的特徵提出不同的處理方法：

由於機器學習處理的都是都是數值信息，但是數據集有一部分是文本信息，這就需要對不同的文本信息進行不同的處理了。即下一步工作：
1. 特徵的類別信息

age:連續性數值變量；可能的處理方法：分年齡段；
workcass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov,                    Local-gov,State-gov, Without-pay, Never-worked.
 ：僱主類型，多類別，一般處理方法：化爲數值類別，比如以上八個可以分別表示爲1-8（僅爲示例，本文並不推薦）；
fnlwgt: 連續性數值變量；人口普查員認爲觀察值的人數。該變量在本文不被使用，筆者認爲該特徵並不重要。
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm,Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th,Preschool.受教育程度，多類別數據，處理方法同workcass；
education-num: 連續性數值變量，受教育水平年限，一般來講，該值越大，工資越高；
marital-status: Married-civ-spouse, Divorced, Never-married, Separated,Widowed, Married-spouse-absent, Married-AF-spouse.婚姻狀況，多類別數據；
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial,Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical,Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv,Armed-Forces.職業，多類別數據；
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative,Unmarried.羣體性關係，多類別數據；
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.種族，多類別數據，雖然美國反對種族歧視，但是實際上這個在區分美國工資的時候蠻重要；
sex: Female, Male.性別，最簡單的二分法（0&1）；
capital-gain: 資本收益，連續數值；
capital-loss: 資本損失，連續數值；
hours-per-week: 工作時間，連續數值；
native-country: United-States, Cambodia, England, Puerto-Rico, Canada,Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China,Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico,Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti,Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia,El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.國籍，多類別數據；
result：結果：“>50K”或“<=50K”，二分類數據，也是本文機器學習的目的（0&1）；

2.特徵處理：
可以看到目前，特徵無非分三類：
1. 連續性數值特徵，如age,最好處理；
2. 二分類文本信息，二分法處理；
3. 多類別文本信息，
現在進行特徵處理：
先查看數據集的缺失情況：

adult.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
result            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB

或者第二種方法：

adult.isnull().any()

age               False
workclass         False
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country    False
result            False
dtype: bool

但是實際情況呢：

打開實際的csv文件進行查看，很容易可以看到紅色部分“？”其實屬於無效字符，雖未缺失（nan），實則無用，也是我們要想辦法去掉的；
這裏有個小技巧：
先來看

adult.shape

(32561, 15)

可以看到目前32561行數據，我們藉助正則表達式，用nan代替‘？’:

adult_clean=adult.replace(regex=[r'\?|\.|\$'],value=np.nan)
這裏用nan代替 ?  .   $ 三種符號，按照自己寫法和數據替換

此時再來查看：

adult_clean.isnull().any()

age               False
workclass          True
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation         True
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country     True
result            False
dtype: bool

然後我們將所有含有缺失值的行都去掉，對於有些數據集，有的爲空值填充均值，但是此數據集要預測收入，所以我在這裏直接去掉，並不影響結果，如果此處填充均值，訓練集反而會受到精度影響；

adult=adult_clean.dropna(how='any')
#凡是含有nan的行一律去掉

再來查看：

adult.shape

(30162, 15)

有大概2000行的某些特徵存在數據不齊整，到此爲止，數據缺失整理完畢；
因爲要預測，而且給出的數據屬於有監督數據，那麼我們使用監督學習；
首先分離訓練集和測試集；
先將“最沒用”的特徵剔除：

adult=adult.drop(['fnlwgt'],axis=1)
adult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 14 columns):
age               30162 non-null int64
workclass         30162 non-null object
education         30162 non-null object
education-num     30162 non-null int64
marital-status    30162 non-null object
occupation        30162 non-null object
relationship      30162 non-null object
race              30162 non-null object
sex               30162 non-null object
capital-gain      30162 non-null int64
capital-loss      30162 non-null int64
hours-per-week    30162 non-null int64
native-country    30162 non-null object
result            30162 non-null object
dtypes: int64(5), object(9)
memory usage: 4.7+ MB

導入分離訓練和測試集需要的包：

from sklearn.cross_validation import train_test_split
col_names = ["age", "workclass", "education", "education-num", "marital-status", "occupation", 
             "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "result"]

數據分離：按照75%的訓練集和25%的測試集；random_state隨便輸入一個數字；
X_train,X_test,y_train,y_test=train_test_split(adult[col_names[1:13]],adult[col_names[13]],test_size=0.25,random_state=33)
# X_train,X_test,y_train,y_test=train_test_split(adult[1:13],adult[13],test_size=0.25,random_state=33)

很多人會寫成第二種，但是這樣寫是錯的，可以試試；
查看訓練集和測試集情況：

print(X_train.shape)
print(X_test.shape)

(22621, 12)
(7541, 12)

測試集同上，具體來看：

X_train.head()

部分輸出：
    workclass   education   education-num   marital-status  occupation  relationship    race    sex capital-gain    capital-loss    hours-per-week  native-country
20607   Private Some-college    10  Married-civ-spouse  Craft-repair    Husband White   Male    0   0   50  United-States
31257   Private HS-grad 9   Married-civ-spouse  Other-service   Husband Black   Male    0   0   50  United-States
31892   Private HS-grad 9   Never-married   Adm-clerical    Not-in-family   White   Female  0   0   45  United-States
20220   Private HS-grad 9   Divorced    Machine-op-inspct   Unmarried   Black   Female  0   0   40  United-States
24044   Private Some-college    10  Divorced    Sales   Not-in-family   White   Female  0   0   45  United-States

y_train.head()

20607      >50K
31257     <=50K
31892     <=50K
20220     <=50K
24044      >50K
Name: result, dtype: object

到此爲止，訓練和測試集就分開了，那麼是不是可以直接開始隨機森林大法了呢？且慢，最重要的特徵尚未處理，至於如何處理，下次見分曉；
第三節終結篇來嘍

機器學習實例（2）

Mac系統下Selenium4.0使用二三事

機器學習實例----美國人口收入分析

利用Matlab實現圖像配準

辣雞劉的Leetcode之旅6【移除元素, 實現strStr() ，尋找插入的位置，最長無重複子串, 最長迴文字符】

辣雞劉的Leetcode之旅7【Length of Last Word，Maximum Subarray，count-and-say，Plus One】

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結