kaggle波士頓房價預測,score=0.12986

作爲一個機器學習小白,之前拿titanic數據集練過手,遇到波士頓房價數據集(81個特徵)剛開始是有點懵,主要就懵在不知道如何下手處理數據,參考一些資料後,勉強跑通了流程,在此記錄一下。大神請自動繞過。

1、加載數據集

數據集可以到kaggle官網下載

%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

df_train = pd.read_csv("./datasets/housing_price/train.csv")
df_test = pd.read_csv("./datasets/housing_price/test.csv")

df_train.info()

各個屬性的數值類型以及是否有缺失值統計如下:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
PoolQC           7 non-null object
Fence            281 non-null object
MiscFeature      54 non-null object
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

2、統計特徵缺失比例

missing = df_train.isnull().sum()
fig, ax = plt.subplots(1, 2, figsize=(10, 6))
# print(ax.shape)
# 顯示缺失值的數量
ax[0].set_ylabel("missing count")
missing[missing > 0].sort_values().plot.bar(ax=ax[0])
# 顯示缺失值的比例
ax[1].set_ylabel("missing percent")
missing_percent = missing[missing > 0].sort_values() / len(df_train)
missing_percent.plot.bar(ax=ax[1])

可以看到,有幾個特徵的缺失值比例還是很高的,超過了50%。

3、刪除缺失值數量超過50%的屬性

missing_percent_gt_50 = missing_percent[missing_percent > 0.5].index.values
df_train_drop_missing = df_train.drop(missing_percent_gt_50, axis=1)
df_train_drop_missing.drop(["Id"], axis=1, inplace=True)
df_train_drop_missing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 76 columns):
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
MasVnrArea       1452 non-null float64
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinSF1       1460 non-null int64
BsmtFinType2     1422 non-null object
BsmtFinSF2       1460 non-null int64
BsmtUnfSF        1460 non-null int64
TotalBsmtSF      1460 non-null int64
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
1stFlrSF         1460 non-null int64
2ndFlrSF         1460 non-null int64
LowQualFinSF     1460 non-null int64
GrLivArea        1460 non-null int64
BsmtFullBath     1460 non-null int64
BsmtHalfBath     1460 non-null int64
FullBath         1460 non-null int64
HalfBath         1460 non-null int64
BedroomAbvGr     1460 non-null int64
KitchenAbvGr     1460 non-null int64
KitchenQual      1460 non-null object
TotRmsAbvGrd     1460 non-null int64
Functional       1460 non-null object
Fireplaces       1460 non-null int64
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageYrBlt      1379 non-null float64
GarageFinish     1379 non-null object
GarageCars       1460 non-null int64
GarageArea       1460 non-null int64
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
WoodDeckSF       1460 non-null int64
OpenPorchSF      1460 non-null int64
EnclosedPorch    1460 non-null int64
3SsnPorch        1460 non-null int64
ScreenPorch      1460 non-null int64
PoolArea         1460 non-null int64
MiscVal          1460 non-null int64
MoSold           1460 non-null int64
YrSold           1460 non-null int64
SaleType         1460 non-null object
SaleCondition    1460 non-null object
SalePrice        1460 non-null int64
dtypes: float64(3), int64(34), object(39)
memory usage: 867.0+ KB

4、劃分數值特徵和類別特徵

因爲數值特徵和分類特徵需要進行不同的處理,所以在這裏將他們分開,這裏只是簡單的依據特徵的類別是不是object來判斷是否爲數值型特徵。在上面的info統計結果中,MSSubClass的類型孫然是int,但實際上它是一個類別特徵。

num_features = [feature for feature in df_train_drop_missing.columns if df_train_drop_missing.dtypes[feature] != "object"]
cate_features = [feature for feature in df_train_drop_missing.columns if df_train_drop_missing.dtypes[feature] == "object"]
num_features.remove("MSSubClass")
cate_features.append("MSSubClass")
df_train_num = df_train_drop_missing[num_features].copy()
df_train_cate = df_train_drop_missing[cate_features].copy()

# 將MSSubClass特徵由int轉爲str,使其變成分類特徵
df_train_cate["MSSubClass"] = df_train_cate["MSSubClass"].apply(lambda x: str(x))

5、查看數值特徵和SalePrice房價之間的關係

for feature in df_train_num.columns[:-1]:
    df_train_num.plot.scatter(x=feature, y="SalePrice")

圖比較長,這裏就只貼出LotFrontage、OverallQual與SalePrice的關係圖:

由上圖1可以看出,LotFrontage與SalePrice之間存在明顯的線性關係的,而且LotFrontage好像有兩個異常值,OverallQual與SalePrice之間也存在明顯的線性關係,而且OverallQual也可以轉換爲分類特徵。

6、查看數值特徵之間的相關關係

num_corr = df_train_num.corr()
import seaborn as sns

fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(num_corr, square=True, ax=ax)

7、調整目標變量,使目標變量分佈更接近正態分佈

查看SalePrice房價的數值分佈:

df_train_drop_missing["SalePrice"].hist(bins=30)

可以看到SalePrice的數值分佈柱狀圖尾巴拖的比較長(重垂尾分佈),可以使用log變化,使其分佈更接近正態分佈。由於log和exp是可以進行相互變換的,所以這裏對目標變量的值進行log變換並不會影響模型最終的預測結果,使用log變換後的目標變量進行模型訓練,對模型的預測結果進行exp指數變換得到的就是真實數值了。

df_train_drop_missing["SalePrice"].apply(lambda x: np.log1p(x)).hist(bins=30)

在這裏驗證一下log和exp是不是可以相互變換的?先進行log變換,在進行exp變換,得到的結果應該是不變的。

df_train_drop_missing["SalePrice"].apply(lambda x: np.log1p(x)).apply(lambda x: np.expm1(x)).hist(bins=30)

結果符合預期。

8、處理類別特徵

df_train_cate.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 40 columns):
MSZoning         1460 non-null object
Street           1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1452 non-null object
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1423 non-null object
BsmtCond         1423 non-null object
BsmtExposure     1422 non-null object
BsmtFinType1     1423 non-null object
BsmtFinType2     1422 non-null object
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1459 non-null object
KitchenQual      1460 non-null object
Functional       1460 non-null object
FireplaceQu      770 non-null object
GarageType       1379 non-null object
GarageFinish     1379 non-null object
GarageQual       1379 non-null object
GarageCond       1379 non-null object
PavedDrive       1460 non-null object
SaleType         1460 non-null object
SaleCondition    1460 non-null object
MSSubClass       1460 non-null object
dtypes: object(40)
memory usage: 456.3+ KB

查看類別特徵中各種屬性數量額柱狀圖:

import warnings

warnings.filterwarnings("ignore")
for cate_col in df_train_cate.columns.values:
    fig = plt.figure(figsize=(3, 3))
    df_train_cate[cate_col].value_counts().plot.bar(legend=True)

圖很長。你應該看到了。。。。

9、查看存在缺失值的類別屬性

df_train_cate_missing_sum = df_train_cate.isnull().sum()
cate_missing_features = df_train_cate_missing_sum[df_train_cate_missing_sum > 0].index.values
cate_missing_features
array(['MasVnrType', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical', FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'], dtype=object)。這些傢伙存在缺失值

10、填充缺失值,這裏簡單的使用每個特徵中出現最多的類別進行填充

fill_missing_value = df_train_cate.mode().loc[0].values

for index, feature in enumerate(cate_missing_features):
    print(feature, fill_missing_value[index])
    df_train_cate[feature].fillna(fill_missing_value[index], inplace=True)
df_train_cate.info()

看結果:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 40 columns):
MSZoning         1460 non-null object
Street           1460 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-null object
Exterior2nd      1460 non-null object
MasVnrType       1460 non-null object
ExterQual        1460 non-null object
ExterCond        1460 non-null object
Foundation       1460 non-null object
BsmtQual         1460 non-null object
BsmtCond         1460 non-null object
BsmtExposure     1460 non-null object
BsmtFinType1     1460 non-null object
BsmtFinType2     1460 non-null object
Heating          1460 non-null object
HeatingQC        1460 non-null object
CentralAir       1460 non-null object
Electrical       1460 non-null object
KitchenQual      1460 non-null object
Functional       1460 non-null object
FireplaceQu      1460 non-null object
GarageType       1460 non-null object
GarageFinish     1460 non-null object
GarageQual       1460 non-null object
GarageCond       1460 non-null object
PavedDrive       1460 non-null object
SaleType         1460 non-null object
SaleCondition    1460 non-null object
MSSubClass       1460 non-null object
dtypes: object(40)
memory usage: 456.3+ KB

缺失值都補上了。

11、對類別特徵進行OneHot編碼,因爲有些模型智能處理數值類型的特徵,所以要進行轉換

df_train_cate_dummies = pd.get_dummies(df_train_cate)
df_train_cate_dummies.head()

12、處理數值特徵的缺失值,這裏簡單的使用數值特徵的均值進行填充

# 計算特徵均值
num_feature_mean = df_train_num.mean(axis=0)

# 使用均值填充每列的缺失值
df_train_num = df_train_num.apply(lambda x: x.fillna(num_feature_mean.loc[x.index.values[0]]), axis=1)
df_train_num.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 36 columns):
LotFrontage      1460 non-null float64
LotArea          1460 non-null float64
OverallQual      1460 non-null float64
OverallCond      1460 non-null float64
YearBuilt        1460 non-null float64
YearRemodAdd     1460 non-null float64
MasVnrArea       1460 non-null float64
BsmtFinSF1       1460 non-null float64
BsmtFinSF2       1460 non-null float64
BsmtUnfSF        1460 non-null float64
TotalBsmtSF      1460 non-null float64
1stFlrSF         1460 non-null float64
2ndFlrSF         1460 non-null float64
LowQualFinSF     1460 non-null float64
GrLivArea        1460 non-null float64
BsmtFullBath     1460 non-null float64
BsmtHalfBath     1460 non-null float64
FullBath         1460 non-null float64
HalfBath         1460 non-null float64
BedroomAbvGr     1460 non-null float64
KitchenAbvGr     1460 non-null float64
TotRmsAbvGrd     1460 non-null float64
Fireplaces       1460 non-null float64
GarageYrBlt      1460 non-null float64
GarageCars       1460 non-null float64
GarageArea       1460 non-null float64
WoodDeckSF       1460 non-null float64
OpenPorchSF      1460 non-null float64
EnclosedPorch    1460 non-null float64
3SsnPorch        1460 non-null float64
ScreenPorch      1460 non-null float64
PoolArea         1460 non-null float64
MiscVal          1460 non-null float64
MoSold           1460 non-null float64
YrSold           1460 non-null float64
SalePrice        1460 non-null float64
dtypes: float64(36)
memory usage: 410.7 KB

數值型特徵的缺失值也都補上了。下面看下數值特徵的數值範圍:

df_train_num.describe()
	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	TotRmsAbvGrd	Fireplaces	GarageYrBlt	GarageCars	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SalePrice
count	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000	1460.000000
mean	70.049958	10516.828082	6.099315	5.575342	1971.267808	1984.865753	103.500959	443.639726	46.549315	567.240411	1057.429452	1162.626712	346.992466	5.844521	1515.463699	0.425342	0.057534	1.565068	0.382877	2.866438	1.046575	6.517808	0.613014	1872.626059	1.767123	472.980137	94.244521	46.660274	21.954110	3.409589	15.060959	2.758904	43.489041	6.321918	2007.815753	180921.195890
std	22.024023	9981.264932	1.382997	1.112799	30.202904	20.645407	180.586195	456.098091	161.319273	441.866955	438.705324	386.587738	436.528436	48.623081	525.480383	0.518911	0.238753	0.550916	0.502885	0.815778	0.220338	1.625393	0.644666	437.679677	0.747315	213.804841	125.338794	66.256028	61.119149	29.317331	55.757415	40.177307	496.123024	2.703626	1.328095	79442.502883
min	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	0.000000	0.000000	0.000000	0.000000	334.000000	0.000000	0.000000	334.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	2.000000	0.000000	70.049958	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000	34900.000000
25%	60.000000	7553.500000	5.000000	5.000000	1954.000000	1967.000000	0.000000	0.000000	0.000000	223.000000	795.750000	882.000000	0.000000	0.000000	1129.500000	0.000000	0.000000	1.000000	0.000000	2.000000	1.000000	5.000000	0.000000	1958.000000	1.000000	334.500000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	5.000000	2007.000000	129975.000000
50%	70.049958	9478.500000	6.000000	5.000000	1973.000000	1994.000000	0.000000	383.500000	0.000000	477.500000	991.500000	1087.000000	0.000000	0.000000	1464.000000	0.000000	0.000000	2.000000	0.000000	3.000000	1.000000	6.000000	1.000000	1977.000000	2.000000	480.000000	0.000000	25.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000	163000.000000
75%	79.000000	11601.500000	7.000000	6.000000	2000.000000	2004.000000	164.250000	712.250000	0.000000	808.000000	1298.250000	1391.250000	728.000000	0.000000	1776.750000	1.000000	0.000000	2.000000	1.000000	3.000000	1.000000	7.000000	1.000000	2001.000000	2.000000	576.000000	168.000000	68.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000	214000.000000
max	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	5644.000000	1474.000000	2336.000000	6110.000000	4692.000000	2065.000000	572.000000	5642.000000	3.000000	2.000000	3.000000	2.000000	8.000000	3.000000	14.000000	3.000000	2010.000000	4.000000	1418.000000	857.000000	547.000000	552.000000	508.000000	480.000000	738.000000	15500.000000	12.000000	2010.000000	755000.000000

數值範圍參差不齊,處理一下吧。

13、數值特徵標準化,標準化的目的是使各個特徵的數值範圍相近,使得各個特徵對模型的影響也相當

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_train_num_scaled = scaler.fit_transform(df_train_num.as_matrix()[:, :-1])

df_train_num_scaled = pd.DataFrame(df_train_num_scaled, columns=df_train_num.columns[:-1])

df_train_num_scaled.describe()
	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	BsmtUnfSF	TotalBsmtSF	1stFlrSF	2ndFlrSF	LowQualFinSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	TotRmsAbvGrd	Fireplaces	GarageYrBlt	GarageCars	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold
count	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03	1.460000e+03
mean	4.075887e-16	-5.840077e-17	1.387018e-16	3.540547e-16	1.046347e-15	4.496860e-15	-5.840077e-17	-2.433366e-17	-3.406712e-17	-6.600504e-17	2.457699e-16	6.509253e-17	-1.825024e-17	1.216683e-17	-1.277517e-16	2.311697e-17	2.433366e-17	1.180182e-16	2.083569e-17	2.141362e-16	4.501726e-16	-1.022014e-16	-4.866731e-18	-3.394545e-16	1.216683e-16	-1.216683e-17	5.596741e-17	3.041707e-17	-2.311697e-17	4.866731e-18	5.475072e-17	1.946692e-17	-2.676702e-17	7.543433e-17	3.567436e-14
std	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00	1.000343e+00
min	-2.227875e+00	-9.237292e-01	-3.688413e+00	-4.112970e+00	-3.287824e+00	-1.689368e+00	-5.733352e-01	-9.730182e-01	-2.886528e-01	-1.284176e+00	-2.411167e+00	-2.144172e+00	-7.951632e-01	-1.202417e-01	-2.249120e+00	-8.199644e-01	-2.410610e-01	-2.841822e+00	-7.616207e-01	-3.514952e+00	-4.751486e+00	-2.780469e+00	-9.512265e-01	-4.119894e+00	-2.365440e+00	-2.212963e+00	-7.521758e-01	-7.044833e-01	-3.593249e-01	-1.163393e-01	-2.702084e-01	-6.869175e-02	-8.768781e-02	-1.969111e+00	-1.367655e+00
25%	-4.564744e-01	-2.969908e-01	-7.951515e-01	-5.171998e-01	-5.719226e-01	-8.656586e-01	-5.733352e-01	-9.730182e-01	-2.886528e-01	-7.793259e-01	-5.966855e-01	-7.261556e-01	-7.951632e-01	-1.202417e-01	-7.347485e-01	-8.199644e-01	-2.410610e-01	-1.026041e+00	-7.616207e-01	-1.062465e+00	-2.114536e-01	-9.341298e-01	-9.512265e-01	1.951272e-01	-1.026858e+00	-6.479160e-01	-7.521758e-01	-7.044833e-01	-3.593249e-01	-1.163393e-01	-2.702084e-01	-6.869175e-02	-8.768781e-02	-4.891101e-01	-6.144386e-01
50%	6.454645e-16	-1.040633e-01	-7.183611e-02	-5.171998e-01	5.737148e-02	4.425864e-01	-5.733352e-01	-1.319022e-01	-2.886528e-01	-2.031633e-01	-1.503334e-01	-1.956933e-01	-7.951632e-01	-1.202417e-01	-9.797004e-02	-8.199644e-01	-2.410610e-01	7.897405e-01	-7.616207e-01	1.637791e-01	-2.114536e-01	-3.186833e-01	6.004949e-01	2.385528e-01	3.117246e-01	3.284429e-02	-7.521758e-01	-3.270298e-01	-3.593249e-01	-1.163393e-01	-2.702084e-01	-6.869175e-02	-8.768781e-02	-1.191097e-01	1.387775e-01
75%	4.065156e-01	1.087080e-01	6.514792e-01	3.817427e-01	9.516316e-01	9.271216e-01	3.365144e-01	5.891327e-01	-2.886528e-01	5.450557e-01	5.491227e-01	5.915905e-01	8.731117e-01	-1.202417e-01	4.974036e-01	1.107810e+00	-2.410610e-01	7.897405e-01	1.227585e+00	1.637791e-01	-2.114536e-01	2.967633e-01	6.004949e-01	2.934062e-01	3.117246e-01	4.820057e-01	5.886506e-01	3.221901e-01	-3.593249e-01	-1.163393e-01	-2.702084e-01	-6.869175e-02	-8.768781e-02	6.208910e-01	8.919936e-01
max	1.103492e+01	2.051827e+01	2.821425e+00	3.078570e+00	1.282839e+00	1.217843e+00	8.289736e+00	1.140575e+01	8.851638e+00	4.004295e+00	1.152095e+01	9.132681e+00	3.936963e+00	1.164775e+01	7.855574e+00	4.963359e+00	8.138680e+00	2.605522e+00	3.216791e+00	6.294997e+00	8.868612e+00	4.604889e+00	3.703938e+00	3.139762e-01	2.988889e+00	4.421526e+00	6.087635e+00	7.554198e+00	8.675309e+00	1.721723e+01	8.341462e+00	1.830618e+01	3.116527e+01	2.100892e+00	1.645210e+00

經過標準化後的數值特徵範圍就比較統一了,均值爲0,方差爲1。

14、準備訓練集

df_train_y = df_train_num["SalePrice"].apply(lambda x: np.log1p(x))
df_train_y_matrix = df_train_y.as_matrix()

df_train_x_matrix = pd.concat([df_train_num_scaled, df_train_cate_dummies], axis=1).as_matrix()

train_X = df_train_x_matrix.copy()
train_y = df_train_y_matrix.copy()
print(train_X.shape, train_y.shape)

(1460, 301) (1460,)

訓練集準備好了之後,這就要進入主題了嗎?莫急,莫急!原本數據集加上SalePrice屬性只有81個特徵,經過上面一系列的操作之後,現在的數據集有301個特徵,這些特徵都是啥?都有什麼用?他們對最終的訓練結果會帶來好的還是壞的影響?多了這麼多特徵訓練的時候計算量不是大了好多?

15、特徵選擇,這裏使用Lasso模型(加入L1正則化的線性迴歸模型)來進行特徵選擇,使用sklearn進行特徵選擇的方法可以參考:scikit-learn工具包中常用的特徵選擇方法介紹

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

lasso = LassoCV()
lasso.fit(train_X, train_y)

model = SelectFromModel(lasso, prefit=True)
train_X_new = model.transform(train_X)
print(train_X_new.shape)
print("特徵選擇標記:", model.get_support())

(1460, 108)
特徵選擇標記: [False  True  True  True  True  True  True  True  True False  True  True False  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True  True False False  True False False False  True False False  True False False False False False  True  True False False False False  True False False False False  True  True False  True  True False False  True  True  True False False  True  True False False False False  True  True False False  True False  True False False False False False False False False False False  True False False False  True False False  True False False False False False False False False False False  True False False False False  True False False False False False False False False False False True False False  True False  True False False False  True  True False False False False False False False False False False False False False False  True False False False False False False  True False False  True  True False False False False  True  True False  True False False False   True False False False  True False False False False  True False  True  True False  True  True False False  True False False  True False  True False False False False  True False False  True False False False  True  True  True False False  True False False False False False False False
  True False False  True False  True False False False False  True False False  True  True False False False  True False False False False False False False False  True False False  True False False False False False  True False False False  True False  True  True False False False False False  True False  True  True False False  True  True False False  True False False  True  True False False False  True False False False False False]

輸出結果爲True對應的特徵是被選擇的特徵。經過特徵選擇後,保留了108個特徵。

16、下面進入正題,模型訓練

1)、首先使用RandomForest模型來試試:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

rfr_params = {
    "n_estimators": [100, 200, 300],
#     "criterion": ["gini"],
    "max_depth": [2, 3, 4],
    "min_samples_split": [2, 3, 5],
    "min_samples_leaf": [2, 3, 5],
    "max_features": ["sqrt", "log2", "auto"]
}
rfr = RandomForestRegressor(n_jobs=-1)
rfr_grid_search = GridSearchCV(rfr, rfr_params, cv=5, scoring="neg_mean_squared_error")
rfr_grid_search.fit(train_X_new, train_y)

best_params = rfr_grid_search.best_params_
best_scores = rfr_grid_search.best_score_

使用網格搜索的RandomForest最佳超參數看下在訓練集上交叉驗證的結果:

from sklearn.model_selection import cross_val_score

best_rfr = RandomForestRegressor(**best_params)
scores = cross_val_score(best_rfr, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error")
scores_rmse = np.sqrt(-scores)
print("score_rmse: {}, {} +/ {}".format(scores_rmse, np.mean(scores_rmse), np.std(scores_rmse)))
score_rmse: [0.16991974 0.18253832 0.17872277 0.16521411 0.17719227], 0.17471744250579152 +/ 0.006271601381559145
  • 繪製RandomForest的誤差學習曲線
from sklearn.model_selection import learning_curve

best_params = {
    'max_depth': 4,
    'max_features': 'auto',
    'min_samples_leaf': 10,
    'min_samples_split': 10,
    'n_estimators': 200
}
best_rfr = RandomForestRegressor(**best_params)

train_sizes, train_neg_mse, test_neg_mse = learning_curve(best_rfr, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error",
                                                         train_sizes=np.linspace(0.1, 1.0, 20))
train_rmse = np.mean(np.sqrt(-train_neg_mse), axis=1)
test_rmse = np.mean(np.sqrt(-test_neg_mse), axis=1)

plt.plot(train_sizes, train_rmse, color="r", marker="o", label="train error")
plt.plot(train_sizes, test_rmse, color="b", marker="^", label="test error")
plt.xlabel("Train Size")
plt.ylabel("rmse")
plt.legend(loc="upper right")

通過誤差曲線判斷,模型訓練基本沒有出現過擬合,結果也還可以。

2)、使用GBDT進行模型訓練

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import GridSearchCV

gbdt_params = {
    "n_estimators": [50, 100],
    "max_depth": [3, 5, 7],
    "min_samples_split": [3, 5, 7],
    "min_samples_leaf": [3, 5, 7],
    "max_features": ["sqrt", "log2"],
#     "loss": ["deviance", "exponential"],
    "learning_rate": [0.1, 0.5, 0.05, 0.01, 0.005],
    "subsample": [0.5, 0.7]
}
gbdt = GradientBoostingRegressor()
gbdt_grid_search = GridSearchCV(gbdt, gbdt_params, cv=5, scoring="neg_mean_squared_error")
gbdt_grid_search.fit(train_X_new, train_y)

gbdt_best_params = gbdt_grid_search.best_params_
gbdt_best_scores = gbdt_grid_search.best_score_
gbdt_best_params, gbdt_best_scores

看下GBDT在訓練集上交叉驗證結果:

from sklearn.model_selection import cross_val_score

best_gbdt = GradientBoostingRegressor(**gbdt_best_params)
scores = cross_val_score(best_gbdt, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error")
scores_rmse = np.sqrt(-scores)
print("score_rmse: {}, {} +/ {}".format(scores_rmse, np.mean(scores_rmse), np.std(scores_rmse)))
score_rmse: [0.11242787 0.13998242 0.13887749 0.11183915 0.13257266], 0.1271399183125363 +/ 0.012512177359738105

RMSE=0.127,比RandomForest提升不少。

  • 繪製誤差學習曲線

由於發現GBDT網格搜索得到的最佳差參數繪製的誤差曲線存在過擬合,所以手動調整了部分參數:

from sklearn.model_selection import learning_curve

best_params = {
    'learning_rate': 0.03,
    'max_depth': 5,
    'max_features': 'sqrt',
    'min_samples_leaf': 7,
    'min_samples_split': 7,
    'n_estimators': 100,
    'subsample': 0.7
}
best_gbdt = GradientBoostingRegressor(**best_params)

train_sizes, train_neg_mse, test_neg_mse = learning_curve(best_gbdt, train_X_new, train_y, cv=5, scoring="neg_mean_squared_error",
                                                         train_sizes=np.linspace(0.1, 1.0, 20))
train_rmse = np.mean(np.sqrt(-train_neg_mse), axis=1)
test_rmse = np.mean(np.sqrt(-test_neg_mse), axis=1)

plt.plot(train_sizes, train_rmse, color="r", marker="o", label="train error")
plt.plot(train_sizes, test_rmse, color="b", marker="^", label="test error")
plt.xlabel("Train Size")
plt.ylabel("rmse")
plt.legend(loc="upper right")

結果看上去還是稍微有點過擬合,不過也還行。

17、模型訓練完成後,就是要在測試集上進行預測了,Let's Go !

1)、首先要使用相同的方式填充測試數據中的缺失值。

# 刪除缺失值過多的屬性
df_test_drop_missing = df_test.drop(missing_percent_gt_50, axis=1)
df_test_drop_missing.drop(["Id"], axis=1, inplace=True)
# df_test_drop_missing.info()

df_test_num = df_test_drop_missing[num_features[:-1]].copy()
df_test_cate = df_test_drop_missing[cate_features].copy()
df_test_cate["MSSubClass"] = df_test_cate["MSSubClass"].apply(lambda x: str(x))

# 處理類別屬性缺失值
for index, feature in enumerate(df_test_cate.columns.values):
    print(feature, fill_missing_value[index])
    df_test_cate[feature].fillna(fill_missing_value[index], inplace=True)
    
# 處理數值屬性缺失值
df_test_num = df_test_num.apply(lambda x: x.fillna(num_feature_mean.loc[x.index.values[0]]), axis=1)

2)、然後,將測試數據中的類別特徵進行OneHot編碼

df_test_cate_dummies = pd.get_dummies(df_test_cate)
df_test_cate_dummies.info()

3)、對測試數據的數值特徵進行與訓練數據的數值特徵相同的標準化處理

df_test_num_scaled = scaler.transform(df_test_num.as_matrix())
df_test_num_scaled = pd.DataFrame(df_test_num_scaled, columns=df_test_num.columns)
df_test_num_scaled.info()

4)、將數值特徵和類別特徵組裝成測試集

df_test_x_matrix = pd.concat([df_test_num_scaled, df_test_cate_dummies], axis=1).as_matrix()
df_test_x_matrix.shape
(1459, 274)

什麼???你以爲這就完了,這就可以輸入到模型中進行預測了?還記得前面說過訓練數據經過處理後生成了301個特徵,這裏怎麼才274個特徵,缺少的27個特徵去哪了?他們都是那些特徵?爲什麼會缺少這些特徵?

導致缺少這些特徵的原因是因爲測試集中的分類特徵的分類可能比訓練集的少。是什麼意思呢,就是加入訓練集和測試集中都有一個特徵叫color顏色,訓練集中的顏色有“紅、黃、藍”三種顏色,但是測試集中的顏色只有“紅、黃”兩種,那麼在進行get_dummies進行OneHot編碼時,訓練集就會生成"color_紅、color_黃、color_藍"三個特徵,測試集只會生成“color_紅、color_黃”兩個特徵,這樣測試集就比訓練集少了一個特徵。少的特徵就是這麼來的,知道了原因,那現在把缺少的特徵補上不就得了嗎。來吧,繼續。

index = 0
for feature in df_train_cate_dummies.columns:
    if feature not in df_test_cate_dummies:
        index += 1
        print(index, feature)
        df_test_cate_dummies[feature] = 0
df_test_cate_dummies.info()

把缺少的特徵都補上0。

什麼?你以爲這樣就可以了?想一想,既然訓練集能比測試集的特徵多,那有沒有可能測試集的特徵比訓練集多呢?當然有可能,對於這種情況就把測試集多出來的特徵刪除掉就行了。

你以爲現在就可以進行預測了嗎?預測結果肯定慘不忍睹。現在雖然測試集的特徵數量和訓練集的一樣,測試裏面的特徵類型和訓練集的也一樣,但是測試集裏面特徵的順序和訓練集肯定不一樣。需要將測試集的特徵順序和訓練集對齊。

df_test_cate_dummies = df_test_cate_dummies.reindex(columns=df_train_cate_dummies.columns)

df_test_x_matrix = pd.concat([df_test_num_scaled, df_test_cate_dummies], axis=1).as_matrix()
df_test_x_matrix.shape

(1459, 301)

然後對測試集進行相同的特徵選擇:

test_X_new = model.transform(df_test_x_matrix)
test_X_new.shape

(1459, 108)

18、預測結果,終於到最後一步,哈哈哈啊哈

y_pred = best_rfr.predict(test_X_new)
y_pred.shape
y_pred[:10]

array([11.74767867, 11.87027992, 12.03688675, 12.09013487, 12.26120576, 12.07010157, 11.89011736, 12.05931896, 12.14915021, 11.73820217])

還記得前面對SalePrice目標變量進行了log變換吧,現在要把結果還原回來。

y_pred = np.expm1(y_pred)
y_pred[:10]

array([126458.66356193, 142953.24835482, 168869.38535629, 178105.12766462,  211335.25272186, 174572.57668388, 145817.40958437, 172700.3291329 , 188932.46869204, 125265.92920833])

保存預測結果到csv文件:

output = pd.DataFrame(
    {
        "Id": df_test["Id"],
        "SalePrice": y_pred
    }
)

from datetime import datetime

now_time = datetime.now().strftime("%Y%m%d%H%M%S%f")
output.to_csv("./outputs/random_forest_%s.csv" % now_time, index=False)

最後,就是把預測結果文件提交到kaggle試試吧!

參考鏈接

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章