作爲一個機器學習小白,之前拿titanic數據集練過手,遇到波士頓房價數據集(81個特徵)剛開始是有點懵,主要就懵在不知道如何下手處理數據,參考一些資料後,勉強跑通了流程,在此記錄一下。大神請自動繞過。
1、加載數據集
數據集可以到kaggle官網下載
%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
df_train = pd.read_csv("./datasets/housing_price/train.csv")
df_test = pd.read_csv("./datasets/housing_price/test.csv")
df_train.info()
各個屬性的數值類型以及是否有缺失值統計如下:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
2、統計特徵缺失比例
missing = df_train.isnull().sum()
fig, ax = plt.subplots(1, 2, figsize=(10, 6))
# print(ax.shape)
# 顯示缺失值的數量
ax[0].set_ylabel("missing count")
missing[missing > 0].sort_values().plot.bar(ax=ax[0])
# 顯示缺失值的比例
ax[1].set_ylabel("missing percent")
missing_percent = missing[missing > 0].sort_values() / len(df_train)
missing_percent.plot.bar(ax=ax[1])
可以看到,有幾個特徵的缺失值比例還是很高的,超過了50%。
3、刪除缺失值數量超過50%的屬性
missing_percent_gt_50 = missing_percent[missing_percent > 0.5].index.values
df_train_drop_missing = df_train.drop(missing_percent_gt_50, axis=1)
df_train_drop_missing.drop(["Id"], axis=1, inplace=True)
df_train_drop_missing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(34), object(39)
memory usage: 867.0+ KB
4、劃分數值特徵和類別特徵
因爲數值特徵和分類特徵需要進行不同的處理,所以在這裏將他們分開,這裏只是簡單的依據特徵的類別是不是object來判斷是否爲數值型特徵。在上面的info統計結果中,MSSubClass的類型孫然是int,但實際上它是一個類別特徵。
num_features = [feature for feature in df_train_drop_missing.columns if df_train_drop_missing.dtypes[feature] != "object"]
cate_features = [feature for feature in df_train_drop_missing.columns if df_train_drop_missing.dtypes[feature] == "object"]
num_features.remove("MSSubClass")
cate_features.append("MSSubClass")
df_train_num = df_train_drop_missing[num_features].copy()
df_train_cate = df_train_drop_missing[cate_features].copy()
# 將MSSubClass特徵由int轉爲str,使其變成分類特徵
df_train_cate["MSSubClass"] = df_train_cate["MSSubClass"].apply(lambda x: str(x))
5、查看數值特徵和SalePrice房價之間的關係
for feature in df_train_num.columns[:-1]:
df_train_num.plot.scatter(x=feature, y="SalePrice")
圖比較長,這裏就只貼出LotFrontage、OverallQual與SalePrice的關係圖:
由上圖1可以看出,LotFrontage與SalePrice之間存在明顯的線性關係的,而且LotFrontage好像有兩個異常值,OverallQual與SalePrice之間也存在明顯的線性關係,而且OverallQual也可以轉換爲分類特徵。
6、查看數值特徵之間的相關關係
num_corr = df_train_num.corr()
import seaborn as sns
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(num_corr, square=True, ax=ax)
7、調整目標變量,使目標變量分佈更接近正態分佈
查看SalePrice房價的數值分佈:
df_train_drop_missing["SalePrice"].hist(bins=30)
可以看到SalePrice的數值分佈柱狀圖尾巴拖的比較長(重垂尾分佈),可以使用log變化,使其分佈更接近正態分佈。由於log和exp是可以進行相互變換的,所以這裏對目標變量的值進行log變換並不會影響模型最終的預測結果,使用log變換後的目標變量進行模型訓練,對模型的預測結果進行exp指數變換得到的就是真實數值了。
df_train_drop_missing["SalePrice"].apply(lambda x: np.log1p(x)).hist(bins=30)
在這裏驗證一下log和exp是不是可以相互變換的?先進行log變換,在進行exp變換,得到的結果應該是不變的。
df_train_drop_missing["SalePrice"].apply(lambda x: np.log1p(x)).apply(lambda x: np.expm1(x)).hist(bins=30)
結果符合預期。
8、處理類別特徵
df_train_cate.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 40 columns):
MSZoning 1460 non-null object
Street 1460 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinType2 1422 non-null object
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
KitchenQual 1460 non-null object
Functional 1460 non-null object
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageFinish 1379 non-null object
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
SaleType 1460 non-null object
SaleCondition 1460 non-null object
MSSubClass 1460 non-null object
dtypes: object(40)
memory usage: 456.3+ KB
查看類別特徵中各種屬性數量額柱狀圖:
import warnings
warnings.filterwarnings("ignore")
for cate_col in df_train_cate.columns.values:
fig = plt.figure(figsize=(3, 3))
df_train_cate[cate_col].value_counts().plot.bar(legend=True)
圖很長。你應該看到了。。。。
9、查看存在缺失值的類別屬性
df_train_cate_missing_sum = df_train_cate.isnull().sum()
cate_missing_features = df_train_cate_missing_sum[df_train_cate_missing_sum > 0].index.values
cate_missing_features
array(['MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'], dtype=object)。這些傢伙存在缺失值
10、填充缺失值,這裏簡單的使用每個特徵中出現最多的類別進行填充
fill_missing_value = df_train_cate.mode().loc[0].values
for index, feature in enumerate(cate_missing_features):
print(feature, fill_missing_value[index])
df_train_cate[feature].fillna(fill_missing_value[index], inplace=True)
df_train_cate.info()
看結果:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 40 columns):
MSZoning 1460 non-null object
Street 1460 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1460 non-null object
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1460 non-null object
BsmtCond 1460 non-null object
BsmtExposure 1460 non-null object
BsmtFinType1 1460 non-null object
BsmtFinType2 1460 non-null object
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1460 non-null object
KitchenQual 1460 non-null object
Functional 1460 non-null object
FireplaceQu 1460 non-null object
GarageType 1460 non-null object
GarageFinish 1460 non-null object
GarageQual 1460 non-null object
GarageCond 1460 non-null object
PavedDrive 1460 non-null object
SaleType 1460 non-null object
SaleCondition 1460 non-null object
MSSubClass 1460 non-null object
dtypes: object(40)
memory usage: 456.3+ KB
缺失值都補上了。
11、對類別特徵進行OneHot編碼,因爲有些模型智能處理數值類型的特徵,所以要進行轉換
df_train_cate_dummies = pd.get_dummies(df_train_cate)
df_train_cate_dummies.head()
12、處理數值特徵的缺失值,這裏簡單的使用數值特徵的均值進行填充
# 計算特徵均值
num_feature_mean = df_train_num.mean(axis=0)
# 使用均值填充每列的缺失值
df_train_num = df_train_num.apply(lambda x: x.fillna(num_feature_mean.loc[x.index.values[0]]), axis=1)
df_train_num.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 36 columns):
LotFrontage 1460 non-null float64
LotArea 1460 non-null float64
OverallQual 1460 non-null float64
OverallCond 1460 non-null float64
YearBuilt 1460 non-null float64
YearRemodAdd 1460 non-null float64
MasVnrArea 1460 non-null float64
BsmtFinSF1 1460 non-null float64
BsmtFinSF2 1460 non-null float64
BsmtUnfSF 1460 non-null float64
TotalBsmtSF 1460 non-null float64
1stFlrSF 1460 non-null float64
2ndFlrSF 1460 non-null float64
LowQualFinSF 1460 non-null float64
GrLivArea 1460 non-null float64
BsmtFullBath 1460 non-null float64
BsmtHalfBath 1460 non-null float64
FullBath 1460 non-null float64
HalfBath 1460 non-null float64
BedroomAbvGr 1460 non-null float64
KitchenAbvGr 1460 non-null float64
TotRmsAbvGrd 1460 non-null float64
Fireplaces 1460 non-null float64
GarageYrBlt 1460 non-null float64
GarageCars 1460 non-null float64
GarageArea 1460 non-null float64
WoodDeckSF 1460 non-null float64
OpenPorchSF 1460 non-null float64
EnclosedPorch 1460 non-null float64
3SsnPorch 1460 non-null float64
ScreenPorch 1460 non-null float64
PoolArea 1460 non-null float64
MiscVal 1460 non-null float64
MoSold 1460 non-null float64
YrSold 1460 non-null float64
SalePrice 1460 non-null float64
dtypes: float64(36)
memory usage: 410.7 KB
數值型特徵的缺失值也都補上了。下面看下數值特徵的數值範圍:
df_train_num.describe()
LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 70.049958 10516.828082 6.099315 5.575342 1971.267808 1984.865753 103.500959 443.639726 46.549315 567.240411 1057.429452 1162.626712 346.992466 5.844521 1515.463699 0.425342 0.057534 1.565068 0.382877 2.866438 1.046575 6.517808 0.613014 1872.626059 1.767123 472.980137 94.244521 46.660274 21.954110 3.409589 15.060959 2.758904 43.489041 6.321918 2007.815753 180921.195890
std 22.024023 9981.264932 1.382997 1.112799 30.202904 20.645407 180.586195 456.098091 161.319273 441.866955 438.705324 386.587738 436.528436 48.623081 525.480383 0.518911 0.238753 0.550916 0.502885 0.815778 0.220338 1.625393 0.644666 437.679677 0.747315 213.804841 125.338794 66.256028 61.119149 29.317331 55.757415 40.177307 496.123024 2.703626 1.328095 79442.502883
min 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 0.000000 0.000000 0.000000 334.000000 0.000000 0.000000 334.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2.000000 0.000000 70.049958 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 34900.000000
25% 60.000000 7553.500000 5.000000 5.000000 1954.000000 1967.000000 0.000000 0.000000 0.000000 223.000000 795.750000 882.000000 0.000000 0.000000 1129.500000 0.000000 0.000000 1.000000 0.000000 2.000000 1.000000 5.000000 0.000000 1958.000000 1.000000 334.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2007.000000 129975.000000
50% 70.049958 9478.500000 6.000000 5.000000 1973.000000 1994.000000 0.000000 383.500000 0.000000 477.500000 991.500000 1087.000000 0.000000 0.000000 1464.000000 0.000000 0.000000 2.000000 0.000000 3.000000 1.000000 6.000000 1.000000 1977.000000 2.000000 480.000000 0.000000 25.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 163000.000000
75% 79.000000 11601.500000 7.000000 6.000000 2000.000000 2004.000000 164.250000 712.250000 0.000000 808.000000 1298.250000 1391.250000 728.000000 0.000000 1776.750000 1.000000 0.000000 2.000000 1.000000 3.000000 1.000000 7.000000 1.000000 2001.000000 2.000000 576.000000 168.000000 68.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 214000.000000
max 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 5644.000000 1474.000000 2336.000000 6110.000000 4692.000000 2065.000000 572.000000 5642.000000 3.000000 2.000000 3.000000 2.000000 8.000000 3.000000 14.000000 3.000000 2010.000000 4.000000 1418.000000 857.000000 547.000000 552.000000 508.000000 480.000000 738.000000 15500.000000 12.000000 2010.000000 755000.000000
數值範圍參差不齊,處理一下吧。
13、數值特徵標準化,標準化的目的是使各個特徵的數值範圍相近,使得各個特徵對模型的影響也相當
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_train_num_scaled = scaler.fit_transform(df_train_num.as_matrix()[:, :-1])
df_train_num_scaled = pd.DataFrame(df_train_num_scaled, columns=df_train_num.columns[:-1])
df_train_num_scaled.describe()
LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
count 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03 1.460000e+03
mean 4.075887e-16 -5.840077e-17 1.387018e-16 3.540547e-16 1.046347e-15 4.496860e-15 -5.840077e-17 -2.433366e-17 -3.406712e-17 -6.600504e-17 2.457699e-16 6.509253e-17 -1.825024e-17 1.216683e-17 -1.277517e-16 2.311697e-17 2.433366e-17 1.180182e-16 2.083569e-17 2.141362e-16 4.501726e-16 -1.022014e-16 -4.866731e-18 -3.394545e-16 1.216683e-16 -1.216683e-17 5.596741e-17 3.041707e-17 -2.311697e-17 4.866731e-18 5.475072e-17 1.946692e-17 -2.676702e-17 7.543433e-17 3.567436e-14
std 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00 1.000343e+00
min -2.227875e+00 -9.237292e-01 -3.688413e+00 -4.112970e+00 -3.287824e+00 -1.689368e+00 -5.733352e-01 -9.730182e-01 -2.886528e-01 -1.284176e+00 -2.411167e+00 -2.144172e+00 -7.951632e-01 -1.202417e-01 -2.249120e+00 -8.199644e-01 -2.410610e-01 -2.841822e+00 -7.616207e-01 -3.514952e+00 -4.751486e+00 -2.780469e+00 -9.512265e-01 -4.119894e+00 -2.365440e+00 -2.212963e+00 -7.521758e-01 -7.044833e-01 -3.593249e-01 -1.163393e-01 -2.702084e-01 -6.869175e-02 -8.768781e-02 -1.969111e+00 -1.367655e+00
25% -4.564744e-01 -2.969908e-01 -7.951515e-01 -5.171998e-01 -5.719226e-01 -8.656586e-01 -5.733352e-01 -9.730182e-01 -2.886528e-01 -7.793259e-01 -5.966855e-01 -7.261556e-01 -7.951632e-01 -1.202417e-01 -7.347485e-01 -8.199644e-01 -2.410610e-01 -1.026041e+00 -7.616207e-01 -1.062465e+00 -2.114536e-01 -9.341298e-01 -9.512265e-01 1.951272e-01 -1.026858e+00 -6.479160e-01 -7.521758e-01 -7.044833e-01 -3.593249e-01 -1.163393e-01 -2.702084e-01 -6.869175e-02 -8.768781e-02 -4.891101e-01 -6.144386e-01
50% 6.454645e-16 -1.040633e-01 -7.183611e-02 -5.171998e-01 5.737148e-02 4.425864e-01 -5.733352e-01 -1.319022e-01 -2.886528e-01 -2.031633e-01 -1.503334e-01 -1.956933e-01 -7.951632e-01 -1.202417e-01 -9.797004e-02 -8.199644e-01 -2.410610e-01 7.897405e-01 -7.616207e-01 1.637791e-01 -2.114536e-01 -3.186833e-01 6.004949e-01 2.385528e-01 3.117246e-01 3.284429e-02 -7.521758e-01 -3.270298e-01 -3.593249e-01 -1.163393e-01 -2.702084e-01 -6.869175e-02 -8.768781e-02 -1.191097e-01 1.387775e-01
75% 4.065156e-01 1.087080e-01 6.514792e-01 3.817427e-01 9.516316e-01 9.271216e-01 3.365144e-01 5.891327e-01 -2.886528e-01 5.450557e-01 5.491227e-01 5.915905e-01 8.731117e-01 -1.202417e-01 4.974036e-01 1.107810e+00 -2.410610e-01 7.897405e-01 1.227585e+00 1.637791e-01 -2.114536e-01 2.967633e-01 6.004949e-01 2.934062e-01 3.117246e-01 4.820057e-01 5.886506e-01 3.221901e-01 -3.593249e-01 -1.163393e-01 -2.702084e-01 -6.869175e-02 -8.768781e-02 6.208910e-01 8.919936e-01
max 1.103492e+01 2.051827e+01 2.821425e+00 3.078570e+00 1.282839e+00 1.217843e+00 8.289736e+00 1.140575e+01 8.851638e+00 4.004295e+00 1.152095e+01 9.132681e+00 3.936963e+00 1.164775e+01 7.855574e+00 4.963359e+00 8.138680e+00 2.605522e+00 3.216791e+00 6.294997e+00 8.868612e+00 4.604889e+00 3.703938e+00 3.139762e-01 2.988889e+00 4.421526e+00 6.087635e+00 7.554198e+00 8.675309e+00 1.721723e+01 8.341462e+00 1.830618e+01 3.116527e+01 2.100892e+00 1.645210e+00
經過標準化後的數值特徵範圍就比較統一了,均值爲0,方差爲1。
14、準備訓練集
df_train_y = df_train_num["SalePrice"].apply(lambda x: np.log1p(x))
df_train_y_matrix = df_train_y.as_matrix()
df_train_x_matrix = pd.concat([df_train_num_scaled, df_train_cate_dummies], axis=1).as_matrix()
train_X = df_train_x_matrix.copy()
train_y = df_train_y_matrix.copy()
print(train_X.shape, train_y.shape)
(1460, 301) (1460,)
訓練集準備好了之後,這就要進入主題了嗎?莫急,莫急!原本數據集加上SalePrice屬性只有81個特徵,經過上面一系列的操作之後,現在的數據集有301個特徵,這些特徵都是啥?都有什麼用?他們對最終的訓練結果會帶來好的還是壞的影響?多了這麼多特徵訓練的時候計算量不是大了好多?
15、特徵選擇,這裏使用Lasso模型(加入L1正則化的線性迴歸模型)來進行特徵選擇,使用sklearn進行特徵選擇的方法可以參考:scikit-learn工具包中常用的特徵選擇方法介紹
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV
lasso = LassoCV()
lasso.fit(train_X, train_y)
model = SelectFromModel(lasso, prefit=True)
train_X_new = model.transform(train_X)
print(train_X_new.shape)
print("特徵選擇標記:", model.get_support())
(1460, 108)
特徵選擇標記: [False True True True True True True True True False True True False True True True True True True True True True True True True True True True True True True True True True True True True False False True False False False True False False True False False False False False True True False False False False True False False False False True True False True True False False True True True False False True True False False False False True True False False True False True False False False False False False False False False False True False False False True False False True False False False False False False False False False False True False False False False True False False False False False False False False False False True False False True False True False False False True True False False False False False False False False False False False False False False True False False False False False False True False False True True False False False False True True False True False False False True False False False True False False False False True False True True False True True False False True False False True False True False False False False True False False True False False False True True True False False True False False False False False False False
True False False True False True False False False False True False False True True False False False True False False False False False False False False True False False True False False False False False True False False False True False True True False False False False False True False True True False False True True False False True False False True True False False False True False False False False False]
輸出結果爲True對應的特徵是被選擇的特徵。經過特徵選擇後,保留了108個特徵。
16、下面進入正題,模型訓練
1)、首先使用RandomForest模型來試試:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
rfr_params = {
"n_estimators": [100, 200, 300],
# "criterion": ["gini"],
"max_depth": [2, 3, 4],
"min_samples_split": [2, 3, 5],
"min_samples_leaf": [2, 3, 5],
"max_features": ["sqrt", "log2", "auto"]
}
rfr = RandomForestRegressor(n_jobs=-1)
rfr_grid_search = GridSearchCV(rfr, rfr_params, cv=5, scoring="neg_mean_squared_error")
rfr_grid_search.fit(train_X_new, train_y)
best_params = rfr_grid_search.best_params_
best_scores = rfr_grid_search.best_score_
使用網格搜索的RandomForest最佳超參數看下在訓練集上交叉驗證的結果:
from sklearn.model_selection import cross_val_score
best_rfr = RandomForestRegressor(**best_params)
scores = cross_val_score(best_rfr, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error")
scores_rmse = np.sqrt(-scores)
print("score_rmse: {}, {} +/ {}".format(scores_rmse, np.mean(scores_rmse), np.std(scores_rmse)))
score_rmse: [0.16991974 0.18253832 0.17872277 0.16521411 0.17719227], 0.17471744250579152 +/ 0.006271601381559145
- 繪製RandomForest的誤差學習曲線
from sklearn.model_selection import learning_curve
best_params = {
'max_depth': 4,
'max_features': 'auto',
'min_samples_leaf': 10,
'min_samples_split': 10,
'n_estimators': 200
}
best_rfr = RandomForestRegressor(**best_params)
train_sizes, train_neg_mse, test_neg_mse = learning_curve(best_rfr, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error",
train_sizes=np.linspace(0.1, 1.0, 20))
train_rmse = np.mean(np.sqrt(-train_neg_mse), axis=1)
test_rmse = np.mean(np.sqrt(-test_neg_mse), axis=1)
plt.plot(train_sizes, train_rmse, color="r", marker="o", label="train error")
plt.plot(train_sizes, test_rmse, color="b", marker="^", label="test error")
plt.xlabel("Train Size")
plt.ylabel("rmse")
plt.legend(loc="upper right")
通過誤差曲線判斷,模型訓練基本沒有出現過擬合,結果也還可以。
2)、使用GBDT進行模型訓練
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
gbdt_params = {
"n_estimators": [50, 100],
"max_depth": [3, 5, 7],
"min_samples_split": [3, 5, 7],
"min_samples_leaf": [3, 5, 7],
"max_features": ["sqrt", "log2"],
# "loss": ["deviance", "exponential"],
"learning_rate": [0.1, 0.5, 0.05, 0.01, 0.005],
"subsample": [0.5, 0.7]
}
gbdt = GradientBoostingRegressor()
gbdt_grid_search = GridSearchCV(gbdt, gbdt_params, cv=5, scoring="neg_mean_squared_error")
gbdt_grid_search.fit(train_X_new, train_y)
gbdt_best_params = gbdt_grid_search.best_params_
gbdt_best_scores = gbdt_grid_search.best_score_
gbdt_best_params, gbdt_best_scores
看下GBDT在訓練集上交叉驗證結果:
from sklearn.model_selection import cross_val_score
best_gbdt = GradientBoostingRegressor(**gbdt_best_params)
scores = cross_val_score(best_gbdt, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error")
scores_rmse = np.sqrt(-scores)
print("score_rmse: {}, {} +/ {}".format(scores_rmse, np.mean(scores_rmse), np.std(scores_rmse)))
score_rmse: [0.11242787 0.13998242 0.13887749 0.11183915 0.13257266], 0.1271399183125363 +/ 0.012512177359738105
RMSE=0.127,比RandomForest提升不少。
- 繪製誤差學習曲線
由於發現GBDT網格搜索得到的最佳差參數繪製的誤差曲線存在過擬合,所以手動調整了部分參數:
from sklearn.model_selection import learning_curve
best_params = {
'learning_rate': 0.03,
'max_depth': 5,
'max_features': 'sqrt',
'min_samples_leaf': 7,
'min_samples_split': 7,
'n_estimators': 100,
'subsample': 0.7
}
best_gbdt = GradientBoostingRegressor(**best_params)
train_sizes, train_neg_mse, test_neg_mse = learning_curve(best_gbdt, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error",
train_sizes=np.linspace(0.1, 1.0, 20))
train_rmse = np.mean(np.sqrt(-train_neg_mse), axis=1)
test_rmse = np.mean(np.sqrt(-test_neg_mse), axis=1)
plt.plot(train_sizes, train_rmse, color="r", marker="o", label="train error")
plt.plot(train_sizes, test_rmse, color="b", marker="^", label="test error")
plt.xlabel("Train Size")
plt.ylabel("rmse")
plt.legend(loc="upper right")
結果看上去還是稍微有點過擬合,不過也還行。
17、模型訓練完成後,就是要在測試集上進行預測了,Let's Go !
1)、首先要使用相同的方式填充測試數據中的缺失值。
# 刪除缺失值過多的屬性
df_test_drop_missing = df_test.drop(missing_percent_gt_50, axis=1)
df_test_drop_missing.drop(["Id"], axis=1, inplace=True)
# df_test_drop_missing.info()
df_test_num = df_test_drop_missing[num_features[:-1]].copy()
df_test_cate = df_test_drop_missing[cate_features].copy()
df_test_cate["MSSubClass"] = df_test_cate["MSSubClass"].apply(lambda x: str(x))
# 處理類別屬性缺失值
for index, feature in enumerate(df_test_cate.columns.values):
print(feature, fill_missing_value[index])
df_test_cate[feature].fillna(fill_missing_value[index], inplace=True)
# 處理數值屬性缺失值
df_test_num = df_test_num.apply(lambda x: x.fillna(num_feature_mean.loc[x.index.values[0]]), axis=1)
2)、然後,將測試數據中的類別特徵進行OneHot編碼
df_test_cate_dummies = pd.get_dummies(df_test_cate)
df_test_cate_dummies.info()
3)、對測試數據的數值特徵進行與訓練數據的數值特徵相同的標準化處理
df_test_num_scaled = scaler.transform(df_test_num.as_matrix())
df_test_num_scaled = pd.DataFrame(df_test_num_scaled, columns=df_test_num.columns)
df_test_num_scaled.info()
4)、將數值特徵和類別特徵組裝成測試集
df_test_x_matrix = pd.concat([df_test_num_scaled, df_test_cate_dummies], axis=1).as_matrix()
df_test_x_matrix.shape
(1459, 274)
什麼???你以爲這就完了,這就可以輸入到模型中進行預測了?還記得前面說過訓練數據經過處理後生成了301個特徵,這裏怎麼才274個特徵,缺少的27個特徵去哪了?他們都是那些特徵?爲什麼會缺少這些特徵?
導致缺少這些特徵的原因是因爲測試集中的分類特徵的分類可能比訓練集的少。是什麼意思呢,就是加入訓練集和測試集中都有一個特徵叫color顏色,訓練集中的顏色有“紅、黃、藍”三種顏色,但是測試集中的顏色只有“紅、黃”兩種,那麼在進行get_dummies進行OneHot編碼時,訓練集就會生成"color_紅、color_黃、color_藍"三個特徵,測試集只會生成“color_紅、color_黃”兩個特徵,這樣測試集就比訓練集少了一個特徵。少的特徵就是這麼來的,知道了原因,那現在把缺少的特徵補上不就得了嗎。來吧,繼續。
index = 0
for feature in df_train_cate_dummies.columns:
if feature not in df_test_cate_dummies:
index += 1
print(index, feature)
df_test_cate_dummies[feature] = 0
df_test_cate_dummies.info()
把缺少的特徵都補上0。
什麼?你以爲這樣就可以了?想一想,既然訓練集能比測試集的特徵多,那有沒有可能測試集的特徵比訓練集多呢?當然有可能,對於這種情況就把測試集多出來的特徵刪除掉就行了。
你以爲現在就可以進行預測了嗎?預測結果肯定慘不忍睹。現在雖然測試集的特徵數量和訓練集的一樣,測試裏面的特徵類型和訓練集的也一樣,但是測試集裏面特徵的順序和訓練集肯定不一樣。需要將測試集的特徵順序和訓練集對齊。
df_test_cate_dummies = df_test_cate_dummies.reindex(columns=df_train_cate_dummies.columns)
df_test_x_matrix = pd.concat([df_test_num_scaled, df_test_cate_dummies], axis=1).as_matrix()
df_test_x_matrix.shape
(1459, 301)
然後對測試集進行相同的特徵選擇:
test_X_new = model.transform(df_test_x_matrix)
test_X_new.shape
(1459, 108)
18、預測結果,終於到最後一步,哈哈哈啊哈
y_pred = best_rfr.predict(test_X_new)
y_pred.shape
y_pred[:10]
array([11.74767867, 11.87027992, 12.03688675, 12.09013487, 12.26120576, 12.07010157, 11.89011736, 12.05931896, 12.14915021, 11.73820217])
還記得前面對SalePrice目標變量進行了log變換吧,現在要把結果還原回來。
y_pred = np.expm1(y_pred)
y_pred[:10]
array([126458.66356193, 142953.24835482, 168869.38535629, 178105.12766462, 211335.25272186, 174572.57668388, 145817.40958437, 172700.3291329 , 188932.46869204, 125265.92920833])
保存預測結果到csv文件:
output = pd.DataFrame(
{
"Id": df_test["Id"],
"SalePrice": y_pred
}
)
from datetime import datetime
now_time = datetime.now().strftime("%Y%m%d%H%M%S%f")
output.to_csv("./outputs/random_forest_%s.csv" % now_time, index=False)
最後,就是把預測結果文件提交到kaggle試試吧!