Kaggle之房價問題
基於愛荷華州埃姆斯的住宅數據信息,預測每間房屋的銷售價格。
這是一個迴歸問題,評估方式是均方根誤差。
數據分析
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.special import boxcox1p
import missingno as msno
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
# 導入數據
house_train = pd.read_csv('/home/aistudio/data/data32288/train.csv')
house_test = pd.read_csv('/home/aistudio/data/data32288/test.csv')
house_train.shape,house_test.shape
((1460, 81), (1459, 80))
house_train.info()
print('-'*40)
house_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1459 non-null int64
1 MSSubClass 1459 non-null int64
2 MSZoning 1455 non-null object
3 LotFrontage 1232 non-null float64
4 LotArea 1459 non-null int64
5 Street 1459 non-null object
6 Alley 107 non-null object
7 LotShape 1459 non-null object
8 LandContour 1459 non-null object
9 Utilities 1457 non-null object
10 LotConfig 1459 non-null object
11 LandSlope 1459 non-null object
12 Neighborhood 1459 non-null object
13 Condition1 1459 non-null object
14 Condition2 1459 non-null object
15 BldgType 1459 non-null object
16 HouseStyle 1459 non-null object
17 OverallQual 1459 non-null int64
18 OverallCond 1459 non-null int64
19 YearBuilt 1459 non-null int64
20 YearRemodAdd 1459 non-null int64
21 RoofStyle 1459 non-null object
22 RoofMatl 1459 non-null object
23 Exterior1st 1458 non-null object
24 Exterior2nd 1458 non-null object
25 MasVnrType 1443 non-null object
26 MasVnrArea 1444 non-null float64
27 ExterQual 1459 non-null object
28 ExterCond 1459 non-null object
29 Foundation 1459 non-null object
30 BsmtQual 1415 non-null object
31 BsmtCond 1414 non-null object
32 BsmtExposure 1415 non-null object
33 BsmtFinType1 1417 non-null object
34 BsmtFinSF1 1458 non-null float64
35 BsmtFinType2 1417 non-null object
36 BsmtFinSF2 1458 non-null float64
37 BsmtUnfSF 1458 non-null float64
38 TotalBsmtSF 1458 non-null float64
39 Heating 1459 non-null object
40 HeatingQC 1459 non-null object
41 CentralAir 1459 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1459 non-null int64
44 2ndFlrSF 1459 non-null int64
45 LowQualFinSF 1459 non-null int64
46 GrLivArea 1459 non-null int64
47 BsmtFullBath 1457 non-null float64
48 BsmtHalfBath 1457 non-null float64
49 FullBath 1459 non-null int64
50 HalfBath 1459 non-null int64
51 BedroomAbvGr 1459 non-null int64
52 KitchenAbvGr 1459 non-null int64
53 KitchenQual 1458 non-null object
54 TotRmsAbvGrd 1459 non-null int64
55 Functional 1457 non-null object
56 Fireplaces 1459 non-null int64
57 FireplaceQu 729 non-null object
58 GarageType 1383 non-null object
59 GarageYrBlt 1381 non-null float64
60 GarageFinish 1381 non-null object
61 GarageCars 1458 non-null float64
62 GarageArea 1458 non-null float64
63 GarageQual 1381 non-null object
64 GarageCond 1381 non-null object
65 PavedDrive 1459 non-null object
66 WoodDeckSF 1459 non-null int64
67 OpenPorchSF 1459 non-null int64
68 EnclosedPorch 1459 non-null int64
69 3SsnPorch 1459 non-null int64
70 ScreenPorch 1459 non-null int64
71 PoolArea 1459 non-null int64
72 PoolQC 3 non-null object
73 Fence 290 non-null object
74 MiscFeature 51 non-null object
75 MiscVal 1459 non-null int64
76 MoSold 1459 non-null int64
77 YrSold 1459 non-null int64
78 SaleType 1458 non-null object
79 SaleCondition 1459 non-null object
dtypes: float64(11), int64(26), object(43)
memory usage: 912.0+ KB
# 統計性描述
house_train.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Id | 1460.0 | 730.500000 | 421.610009 | 1.0 | 365.75 | 730.5 | 1095.25 | 1460.0 |
MSSubClass | 1460.0 | 56.897260 | 42.300571 | 20.0 | 20.00 | 50.0 | 70.00 | 190.0 |
LotFrontage | 1201.0 | 70.049958 | 24.284752 | 21.0 | 59.00 | 69.0 | 80.00 | 313.0 |
LotArea | 1460.0 | 10516.828082 | 9981.264932 | 1300.0 | 7553.50 | 9478.5 | 11601.50 | 215245.0 |
OverallQual | 1460.0 | 6.099315 | 1.382997 | 1.0 | 5.00 | 6.0 | 7.00 | 10.0 |
OverallCond | 1460.0 | 5.575342 | 1.112799 | 1.0 | 5.00 | 5.0 | 6.00 | 9.0 |
YearBuilt | 1460.0 | 1971.267808 | 30.202904 | 1872.0 | 1954.00 | 1973.0 | 2000.00 | 2010.0 |
YearRemodAdd | 1460.0 | 1984.865753 | 20.645407 | 1950.0 | 1967.00 | 1994.0 | 2004.00 | 2010.0 |
MasVnrArea | 1452.0 | 103.685262 | 181.066207 | 0.0 | 0.00 | 0.0 | 166.00 | 1600.0 |
BsmtFinSF1 | 1460.0 | 443.639726 | 456.098091 | 0.0 | 0.00 | 383.5 | 712.25 | 5644.0 |
BsmtFinSF2 | 1460.0 | 46.549315 | 161.319273 | 0.0 | 0.00 | 0.0 | 0.00 | 1474.0 |
BsmtUnfSF | 1460.0 | 567.240411 | 441.866955 | 0.0 | 223.00 | 477.5 | 808.00 | 2336.0 |
TotalBsmtSF | 1460.0 | 1057.429452 | 438.705324 | 0.0 | 795.75 | 991.5 | 1298.25 | 6110.0 |
1stFlrSF | 1460.0 | 1162.626712 | 386.587738 | 334.0 | 882.00 | 1087.0 | 1391.25 | 4692.0 |
2ndFlrSF | 1460.0 | 346.992466 | 436.528436 | 0.0 | 0.00 | 0.0 | 728.00 | 2065.0 |
LowQualFinSF | 1460.0 | 5.844521 | 48.623081 | 0.0 | 0.00 | 0.0 | 0.00 | 572.0 |
GrLivArea | 1460.0 | 1515.463699 | 525.480383 | 334.0 | 1129.50 | 1464.0 | 1776.75 | 5642.0 |
BsmtFullBath | 1460.0 | 0.425342 | 0.518911 | 0.0 | 0.00 | 0.0 | 1.00 | 3.0 |
BsmtHalfBath | 1460.0 | 0.057534 | 0.238753 | 0.0 | 0.00 | 0.0 | 0.00 | 2.0 |
FullBath | 1460.0 | 1.565068 | 0.550916 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 |
HalfBath | 1460.0 | 0.382877 | 0.502885 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
BedroomAbvGr | 1460.0 | 2.866438 | 0.815778 | 0.0 | 2.00 | 3.0 | 3.00 | 8.0 |
KitchenAbvGr | 1460.0 | 1.046575 | 0.220338 | 0.0 | 1.00 | 1.0 | 1.00 | 3.0 |
TotRmsAbvGrd | 1460.0 | 6.517808 | 1.625393 | 2.0 | 5.00 | 6.0 | 7.00 | 14.0 |
Fireplaces | 1460.0 | 0.613014 | 0.644666 | 0.0 | 0.00 | 1.0 | 1.00 | 3.0 |
GarageYrBlt | 1379.0 | 1978.506164 | 24.689725 | 1900.0 | 1961.00 | 1980.0 | 2002.00 | 2010.0 |
GarageCars | 1460.0 | 1.767123 | 0.747315 | 0.0 | 1.00 | 2.0 | 2.00 | 4.0 |
GarageArea | 1460.0 | 472.980137 | 213.804841 | 0.0 | 334.50 | 480.0 | 576.00 | 1418.0 |
WoodDeckSF | 1460.0 | 94.244521 | 125.338794 | 0.0 | 0.00 | 0.0 | 168.00 | 857.0 |
OpenPorchSF | 1460.0 | 46.660274 | 66.256028 | 0.0 | 0.00 | 25.0 | 68.00 | 547.0 |
EnclosedPorch | 1460.0 | 21.954110 | 61.119149 | 0.0 | 0.00 | 0.0 | 0.00 | 552.0 |
3SsnPorch | 1460.0 | 3.409589 | 29.317331 | 0.0 | 0.00 | 0.0 | 0.00 | 508.0 |
ScreenPorch | 1460.0 | 15.060959 | 55.757415 | 0.0 | 0.00 | 0.0 | 0.00 | 480.0 |
PoolArea | 1460.0 | 2.758904 | 40.177307 | 0.0 | 0.00 | 0.0 | 0.00 | 738.0 |
MiscVal | 1460.0 | 43.489041 | 496.123024 | 0.0 | 0.00 | 0.0 | 0.00 | 15500.0 |
MoSold | 1460.0 | 6.321918 | 2.703626 | 1.0 | 5.00 | 6.0 | 8.00 | 12.0 |
YrSold | 1460.0 | 2007.815753 | 1.328095 | 2006.0 | 2007.00 | 2008.0 | 2009.00 | 2010.0 |
SalePrice | 1460.0 | 180921.195890 | 79442.502883 | 34900.0 | 129975.00 | 163000.0 | 214000.00 | 755000.0 |
缺失值
msno.matrix(house_train, labels=True)
msno.bar(house_train)
msno.heatmap(house_train)
data_null = house_train.isnull().sum()
data_null[data_null>0].sort_values(ascending=False)
PoolQC 1453
MiscFeature 1406
Alley 1369
Fence 1179
FireplaceQu 690
LotFrontage 259
GarageYrBlt 81
GarageType 81
GarageFinish 81
GarageQual 81
GarageCond 81
BsmtFinType2 38
BsmtExposure 38
BsmtFinType1 37
BsmtCond 37
BsmtQual 37
MasVnrArea 8
MasVnrType 8
Electrical 1
dtype: int64
可視化
# numeric features
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric = []
for col in house_train.columns:
if house_train[col].dtype in numeric_dtypes:
if col in ['TotalSF', 'Total_Bathrooms','Total_porch_sf','haspool','hasgarage','hasbsmt','hasfireplace']:
pass
else:
numeric.append(col)
fig, axs = plt.subplots(ncols=2, nrows=0, figsize=(12, 120))
# 調整子圖佈局
plt.subplots_adjust(right=2,top=2)
# 顯示husl顏色空間平均分佈的8個顏色
sns.color_palette("husl", 8)
# 從1開始
for i, feature in enumerate(list(house_train[numeric]),1):
if(feature=='MiscVal'):
break
plt.subplot(len(list(numeric)), 3, i)
sns.scatterplot(x=feature,y='SalePrice',hue='SalePrice',palette='Blues',data=house_train)
# labelpad:控制刻度標註的上下位置
plt.xlabel('{}'.format(feature),size=15,labelpad=12.5)
plt.ylabel('SalePrice', size=15, labelpad=12.5)
for j in range(2):
plt.tick_params(axis='x', labelsize=12)
plt.tick_params(axis='y', labelsize=12)
plt.legend(loc='best', prop={'size': 10})
plt.show()
查看目標數據
- SalePrice不是正太分佈,高度偏斜
- 平均售價爲180921美元,拉向離羣值的上端。
- 中位數163000美元,低於平均值。
- 上端有異常值
house_train['SalePrice'].describe()
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.distplot(house_train['SalePrice'],fit=stats.norm,ax=ax[0])
sns.boxplot(house_train['SalePrice'])
#skewness and kurtosis
print("Skewness: {}".format(house_train['SalePrice'].skew()))
print("Kurtosis: {}".format(house_train['SalePrice'].kurt()))
Skewness: 1.8828757597682129
Kurtosis: 6.536281860064529
fig = plt.figure()
stats.probplot(house_train['SalePrice'],plot=plt)
((array([-3.30513952, -3.04793228, -2.90489705, ..., 2.90489705,
3.04793228, 3.30513952]),
array([ 34900, 35311, 37900, ..., 625000, 745000, 755000])),
(74160.16474519414, 180921.19589041095, 0.9319665641512983))
# 數據變換
house_train['SalePrice'] = np.log1p(house_train['SalePrice'])
fig = plt.figure()
stats.probplot(house_train['SalePrice'],plot=plt)
((array([-3.30513952, -3.04793228, -2.90489705, ..., 2.90489705,
3.04793228, 3.30513952]),
array([10.46027076, 10.47197813, 10.54273278, ..., 13.34550853,
13.52114084, 13.53447435])),
(0.398259646654151, 12.024057394918403, 0.9953761551826702))
特徵相關性
def draw_corr(data):
corr = data.corr()
plt.subplots(figsize=(12,12))
sns.heatmap(corr,vmax=1,square=True,cmap='Blues')
plt.show()
draw_corr(house_train)
# 相關性最大的10個特徵
corrmat = house_train.corr()
plt.subplots(figsize=(10,8))
k = 10
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(house_train[cols].values.T)
# annot_kws:當annot爲True時,可設置各個參數,包括大小size,顏色color,加粗,斜體字等
# fmt:格式設置 這裏保留2位小數
sns.heatmap(cm,cbar=True, annot=True, square=True,
fmt='.2f', annot_kws={'size': 10},
yticklabels=cols.values, xticklabels=cols.values)
1.OverallQual
f,ax = plt.subplots(figsize=(8,6))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=house_train)
# 刪除異常值
mask = (house_train['OverallQual']<5)&(house_train['SalePrice']>12)
house_train.drop(house_train[mask].index, inplace=True)
house_train.plot.scatter(x='OverallQual', y='SalePrice')
2.GrLivArea
house_train.plot.scatter(x='GrLivArea', y='SalePrice',alpha=0.3)
# 刪除右下角兩個異常值
mask = (house_train['GrLivArea']>4000)&(house_train['SalePrice']<12.5)
house_train= house_train.drop(house_train[mask].index)
# 刪除異常值後
fig,ax = plt.subplots()
ax.scatter(x=house_train['GrLivArea'],y=house_train['SalePrice'])
plt.xlabel('GrLivArea',fontsize=13)
plt.ylabel('SalePrice',fontsize=13)
3.GarageCars
house_train.plot.scatter(x='GarageCars', y='SalePrice', alpha=0.3)
4.GarageArea
house_train.plot.scatter(x='GarageArea', y='SalePrice')
# 刪除異常值
mask = (house_train['GarageArea']>1100)&(house_train['SalePrice']<12.5)
house_train.drop(house_train[mask].index, inplace=True)
house_train.plot.scatter(x='GarageArea', y='SalePrice')
5.TotalBsmtSF
house_train.plot.scatter(x='TotalBsmtSF', y='SalePrice')
6.1stFlrSF
house_train.plot.scatter(x='1stFlrSF', y='SalePrice')
7.FullBath
house_train.plot.scatter(x='FullBath', y='SalePrice')
8.YearBuilt
house_train.plot.scatter(x='YearBuilt', y='SalePrice')
# 刪除異常值
mask = (house_train['YearBuilt']<1900)&(house_train['SalePrice']>12.3)
house_train= house_train.drop(house_train[mask].index)
# 刪除異常值後
house_train.plot.scatter(x='YearBuilt', y='SalePrice')
9.YearRemodAdd
house_train.plot.scatter(x='YearRemodAdd', y='SalePrice')
# 重置索引
house_train.reset_index(drop=True,inplace=True)
特徵工程
合併測試集和訓練集,對整體數據做特徵工程
train_num = house_train.shape[0]
test_num = house_test.shape[0]
train_y = house_train.SalePrice.values
all_data = pd.concat((house_train,house_test)).reset_index(drop=True)
all_data.drop(['SalePrice','Id'],axis=1,inplace=True)
all_data.shape,train_num,test_num
((2908, 79), 1449, 1459)
缺失值處理
- 將數值型特徵的缺失值填充爲他們對應的衆數或0
- 將類別型feautre裏的缺失值全部填補爲“None”
- 刪除無用特徵
- 對類別型特徵編碼,get_dummies
count = all_data.isnull().sum().sort_values(ascending=False)
ratio = count/len(all_data)*100
cols_type = all_data[count.index].dtypes
missing_data = pd.concat([count,ratio,cols_type],axis=1,keys=['count','ratio','cols_type'])
missing_data=missing_data[missing_data.ratio>0]
missing_data
count | ratio | cols_type | |
---|---|---|---|
PoolQC | 2899 | 99.690509 | object |
MiscFeature | 2804 | 96.423659 | object |
Alley | 2711 | 93.225585 | object |
Fence | 2338 | 80.398900 | object |
FireplaceQu | 1418 | 48.762036 | object |
LotFrontage | 484 | 16.643741 | float64 |
GarageCond | 159 | 5.467675 | object |
GarageQual | 159 | 5.467675 | object |
GarageYrBlt | 159 | 5.467675 | float64 |
GarageFinish | 159 | 5.467675 | object |
GarageType | 157 | 5.398900 | object |
BsmtCond | 82 | 2.819807 | object |
BsmtExposure | 82 | 2.819807 | object |
BsmtQual | 81 | 2.785420 | object |
BsmtFinType2 | 80 | 2.751032 | object |
BsmtFinType1 | 79 | 2.716644 | object |
MasVnrType | 24 | 0.825309 | object |
MasVnrArea | 23 | 0.790922 | float64 |
MSZoning | 4 | 0.137552 | object |
BsmtHalfBath | 2 | 0.068776 | float64 |
Utilities | 2 | 0.068776 | object |
Functional | 2 | 0.068776 | object |
BsmtFullBath | 2 | 0.068776 | float64 |
BsmtFinSF2 | 1 | 0.034388 | float64 |
BsmtFinSF1 | 1 | 0.034388 | float64 |
Exterior2nd | 1 | 0.034388 | object |
BsmtUnfSF | 1 | 0.034388 | float64 |
TotalBsmtSF | 1 | 0.034388 | float64 |
Exterior1st | 1 | 0.034388 | object |
SaleType | 1 | 0.034388 | object |
Electrical | 1 | 0.034388 | object |
KitchenQual | 1 | 0.034388 | object |
GarageArea | 1 | 0.034388 | float64 |
GarageCars | 1 | 0.034388 | float64 |
# 可視化
f,axis = plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index,y=missing_data.ratio)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
在賽事方提供的特徵描述中說明了一部分特徵值數據的缺失是由於房屋確實不存在此種類型的特徵,因此對於這一部分特徵的缺失值,根據特徵的數據類型分別進行插補,類別特徵的缺失值以一種新類別插補,數值特徵以0值插補,剩餘的那一部分缺失的特徵值採用衆數插補
None填充
str_cols = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageType", "GarageFinish", "GarageQual", "GarageCond", \
"BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "MasVnrType", "MSSubClass"]
for col in str_cols:
all_data[col].fillna("None",inplace=True)
del str_cols, col
0填充
num_cols=["BsmtUnfSF","TotalBsmtSF","BsmtFinSF2","BsmtFinSF1","BsmtFullBath","BsmtHalfBath", \
"MasVnrArea","GarageCars","GarageArea","GarageYrBlt"]
for col in num_cols:
all_data[col].fillna(0, inplace=True)
del num_cols, col
衆數填充
other_cols = ["MSZoning", "Electrical", "KitchenQual", "Exterior1st", "Exterior2nd", "SaleType"]
for col in other_cols:
all_data[col].fillna(all_data[col].mode()[0], inplace=True)
del other_cols, col
lotfrontage
位於同一街道的相鄰的房屋往往具有相同的街區面積屬性
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
Utilities
缺失值爲2,而幾乎所有的值都是AllPub,考慮刪除
all_data["Utilities"].isnull().sum()
2
all_data["Utilities"].value_counts()
AllPub 2905
NoSeWa 1
Name: Utilities, dtype: int64
# 刪除特徵
all_data.drop(['Utilities'],axis=1,inplace=True)
Functional
all_data["Functional"] = all_data["Functional"].fillna("Typ")
# 查看缺失值
mask =all_data.isnull().sum().sort_values(ascending=False)>0
all_data.columns[mask]
Index([], dtype='object')
編碼
- 順序變量編碼
- LabelEncoder編碼
- 獨熱編碼(get_dummies)
# 對順序變量進行編碼
def custom_coding(x):
if(x=='Ex'):
r = 0
elif(x=='Gd'):
r = 1
elif(x=='TA'):
r = 2
elif(x=='Fa'):
r = 3
elif(x=='None'):
r = 4
else:
r = 5
return r
## 順序變量特徵編碼
cols = ['BsmtCond','BsmtQual','ExterCond','ExterQual','FireplaceQu','GarageCond','GarageQual','HeatingQC','KitchenQual','PoolQC']
for col in cols:
all_data[col] = all_data[col].apply(custom_coding)
del cols, col
一些特徵其被表示成數值特徵缺乏意義,例如年份還有類別,這裏將其轉換爲字符串,即類別型變量
cols = ['MSSubClass', 'YrSold', 'MoSold', 'OverallCond', "MSZoning", "BsmtFullBath", "BsmtHalfBath", "HalfBath",\
"Functional", "Electrical", "KitchenQual","KitchenAbvGr", "SaleType", "Exterior1st", "Exterior2nd", "YearBuilt", \
"YearRemodAdd", "GarageYrBlt","BedroomAbvGr","LowQualFinSF"]
for col in cols:
all_data[col] = all_data[col].astype(str)
del cols, col
# 對年份類的數據等進行LabelEncoder編碼
from sklearn.preprocessing import LabelEncoder
# 年份等特徵的標籤編碼
str_cols = ["YearBuilt", "YearRemodAdd", 'GarageYrBlt', "YrSold", 'MoSold']
for col in str_cols:
all_data[col] = LabelEncoder().fit_transform(all_data[col])
# 爲了後續構建有意義的其他特徵而進行標籤編碼
lab_cols = ['Heating','BsmtFinType1', 'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope', \
'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 'YrSold', 'MoSold', \
'MSZoning','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','Exterior1st','MasVnrType',\
'Foundation', 'GarageType','SaleType','SaleCondition']
for col in lab_cols:
new_col = "labfit_" + col
all_data[new_col] = LabelEncoder().fit_transform(all_data[col])
del col,str_cols,lab_cols,new_col
構建新特徵
# 區域相關特徵對於確定房價非常重要,增加了一個總面積的特徵
all_data['TotalHouseArea'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
# 房屋改造時間(YearsSinceRemodel)與房屋出售時間(YrSold)間隔時間的長短通常也會影響房價
all_data['YearsSinceRemodel'] = all_data['YrSold'].astype(int) - all_data['YearRemodAdd'].astype(int)
# 房屋的整體質量特徵也是影響房價的重要要因素
all_data['Total_Home_Quality'] = all_data['OverallQual'].astype(int) + all_data['OverallCond'].astype(int)
房屋內某些區域空間的有無通常也是影響房屋價格的重要因素,例如有三季門廊區(3SsnPorch)、開放式門廊(OpenPorchSF)的房屋可能就比沒有三級門廊區的房屋價格貴。因此這裏我們再增添幾個特徵用於描述房屋內是否存在這些區域空間
all_data['HasWoodDeck'] = (all_data['WoodDeckSF'] == 0) * 1
all_data['HasOpenPorch'] = (all_data['OpenPorchSF'] == 0) * 1
all_data['HasEnclosedPorch'] = (all_data['EnclosedPorch'] == 0) * 1
all_data['Has3SsnPorch'] = (all_data['3SsnPorch'] == 0) * 1
all_data['HasScreenPorch'] = (all_data['ScreenPorch'] == 0) * 1
# 房屋總面積加車庫面積
all_data["TotalAllArea"] = all_data["TotalHouseArea"] + all_data["GarageArea"]
# 房屋總面積和房屋材質指標乘積
all_data["TotalHouse_and_OverallQual"] = all_data["TotalHouseArea"] * all_data["OverallQual"]
# 地面上居住總面積和房屋材質指標乘積
all_data["GrLivArea_and_OverallQual"] = all_data["GrLivArea"] * all_data["OverallQual"]
# 地段總面積和房屋材質指標乘積
all_data["LotArea_and_OverallQual"] = all_data["LotArea"] * all_data["OverallQual"]
# 一般區域分類與房屋總面積的乘積
all_data["MSZoning_and_TotalHouse"] = all_data["labfit_MSZoning"] * all_data["TotalHouseArea"]
# 一般區域分類指標與房屋材質指標之和
all_data["MSZoning_and_OverallQual"] = all_data["labfit_MSZoning"] + all_data["OverallQual"]
# 一般區域分類指標與初始建設年份之和
all_data["MSZoning_and_YearBuilt"] = all_data["labfit_MSZoning"] + all_data["YearBuilt"]
# 地理鄰近環境位置指標與總房屋面積之積
all_data["Neighborhood_and_TotalHouse"] = all_data["labfit_Neighborhood"] * all_data["TotalHouseArea"]
all_data["Neighborhood_and_OverallQual"] = all_data["labfit_Neighborhood"] + all_data["OverallQual"]
all_data["Neighborhood_and_YearBuilt"] = all_data["labfit_Neighborhood"] + all_data["YearBuilt"]
# 1型成品的面積和房屋材質指標乘積
all_data["BsmtFinSF1_and_OverallQual"] = all_data["BsmtFinSF1"] * all_data["OverallQual"]
## 家庭功能評級指標與房屋總面積的乘積
all_data["Functional_and_TotalHouse"] = all_data["labfit_Functional"] * all_data["TotalHouseArea"]
all_data["Functional_and_OverallQual"] = all_data["labfit_Functional"] + all_data["OverallQual"]
all_data["TotalHouse_and_LotArea"] = all_data["TotalHouseArea"] + all_data["LotArea"]
# 房屋與靠近公路或鐵路指標乘積係數
all_data["Condition1_and_TotalHouse"] = all_data["labfit_Condition1"] * all_data["TotalHouseArea"]
all_data["Condition1_and_OverallQual"] = all_data["labfit_Condition1"] + all_data["OverallQual"]
# 地下室相關面積總和指標
all_data["Bsmt"] = all_data["BsmtFinSF1"] + all_data["BsmtFinSF2"] + all_data["BsmtUnfSF"]
# 地面上全浴室和地面上房間總數量之和
all_data["Rooms"] = all_data["FullBath"]+all_data["TotRmsAbvGrd"]
# 開放式門廊、圍廊、三季門廊、屏風玄關總面積
all_data["PorchArea"] = all_data["OpenPorchSF"]+all_data["EnclosedPorch"]+all_data["3SsnPorch"]+all_data["ScreenPorch"]
## 全部功能區總面積(房屋、地下室、車庫、門廊等)
all_data["TotalPlace"] = all_data["TotalAllArea"] + all_data["PorchArea"]
Log變換
將數值型feature裏skew(偏度)絕對值大於0.75的特徵進行一個log變換,將非正態的數據修正爲接近正態分佈的數據,以便滿足線性模型的需要。
爲什麼要通過函數變換來改變原始數值型特徵的分佈呢?
- 變換後可以更加便捷的發現數據之間的關係:從沒有關係變成有關係,使得模型更好利用數據;
- 很多特徵的數據呈現嚴重的偏態分佈(例如:很多偏小的值聚在一起),變換後可以拉開差異;
- 讓數據符合模型理論所需要的假設,然後對其進行分析,例如變換後的數據呈現正態分佈;
常用數據轉換方法的有:對數轉換,box-cox轉換等變換方式,其中對數轉換的方式是最爲常用的,取對數之後數據的性質和相關關係不會發生改變,但壓縮了變量的尺度,大大方便了計算。
from scipy.stats import norm, skew
# 計算各數值型特徵變量的偏度
num_features = all_data.select_dtypes(include=['int64','float64','int32']).copy()
num_feature_names = list(num_features.columns)
skewed_feats = all_data[num_feature_names].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness[skewness["Skew"].abs()>0.75]
Skew | |
---|---|
MiscVal | 21.915535 |
PoolArea | 17.661095 |
LotArea | 13.334935 |
labfit_Condition2 | 12.437736 |
TotalHouse_and_LotArea | 12.380094 |
labfit_Heating | 12.136394 |
LotArea_and_OverallQual | 11.799484 |
3SsnPorch | 11.354131 |
labfit_LandSlope | 5.009358 |
BsmtFinSF2 | 4.137116 |
EnclosedPorch | 4.005089 |
ScreenPorch | 3.926054 |
GarageCond | 3.153395 |
labfit_Condition1 | 3.005668 |
GarageQual | 2.863557 |
MasVnrArea | 2.619878 |
Condition1_and_TotalHouse | 2.544979 |
BsmtCond | 2.542349 |
OpenPorchSF | 2.493685 |
PorchArea | 2.232411 |
labfit_BldgType | 2.186631 |
BsmtFinSF1_and_OverallQual | 2.017572 |
WoodDeckSF | 1.852261 |
TotalHouse_and_OverallQual | 1.615116 |
GrLivArea_and_OverallQual | 1.485190 |
1stFlrSF | 1.264660 |
LotFrontage | 1.106714 |
GrLivArea | 1.048644 |
TotalHouseArea | 1.012116 |
BsmtFinSF1 | 0.982488 |
BsmtUnfSF | 0.919524 |
TotalAllArea | 0.891388 |
TotalPlace | 0.887892 |
2ndFlrSF | 0.853227 |
Neighborhood_and_TotalHouse | 0.852391 |
ExterQual | -0.784824 |
ExterCond | -0.838720 |
Functional_and_OverallQual | -0.920453 |
labfit_BsmtExposure | -1.116930 |
labfit_MSZoning | -1.745237 |
HasEnclosedPorch | -1.880501 |
labfit_Fence | -1.990335 |
labfit_SaleCondition | -2.785113 |
HasScreenPorch | -2.915483 |
labfit_PavedDrive | -2.979584 |
labfit_BsmtFinType2 | -3.036904 |
labfit_CentralAir | -3.461892 |
labfit_SaleType | -3.737598 |
labfit_Functional | -4.062504 |
Has3SsnPorch | -8.695256 |
labfit_Street | -16.166862 |
PoolQC | -20.309793 |
設置閾值爲1,對偏度大於閾值的特徵進行log函數變換操作以提升質量
skew_cols = list(skewness[skewness["Skew"].abs()>1].index)
for col in skew_cols:
# 偏度超過閾值的特徵做box-cox變換
# all_data[col] = boxcox1p(all_data[col], 0.15)
# 偏度超過閾值的特徵對數變換
all_data[col] = np.log1p(all_data[col])
# 查看字符特徵變量
all_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2908 entries, 0 to 2907
Columns: 135 entries, MSSubClass to TotalPlace
dtypes: float64(54), int64(40), object(41)
memory usage: 3.0+ MB
# 對於這些剩下的字符型特徵,採用獨熱編碼的方式,將其轉化爲數值型的特徵
all_data = pd.get_dummies(all_data)
all_data.head()
LotFrontage | LotArea | OverallQual | YearBuilt | YearRemodAdd | MasVnrArea | ExterQual | ExterCond | BsmtQual | BsmtCond | ... | SaleType_ConLw | SaleType_New | SaleType_Oth | SaleType_WD | SaleCondition_Abnorml | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.189655 | 9.042040 | 7 | 109 | 53 | 5.283204 | 1 | 2 | 1 | 1.098612 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 4.394449 | 9.169623 | 6 | 82 | 26 | 0.000000 | 2 | 2 | 1 | 1.098612 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 4.234107 | 9.328212 | 7 | 107 | 52 | 5.093750 | 1 | 2 | 1 | 1.098612 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 4.110874 | 9.164401 | 7 | 24 | 20 | 0.000000 | 2 | 2 | 2 | 0.693147 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 4.442651 | 9.565284 | 8 | 106 | 50 | 5.860786 | 1 | 2 | 1 | 1.098612 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 393 columns
all_data.info()
all_data.shape
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2908 entries, 0 to 2907
Columns: 393 entries, LotFrontage to SaleCondition_Partial
dtypes: float64(54), int64(40), uint8(299)
memory usage: 2.9 MB
(2908, 393)
特徵降維
可以看到總共才2908行數據,就有390維特徵,特徵比較多,我們在這裏使用Lasso進行降維
# 劃分數據集
def split_data(all_data,train_index):
cols = list(all_data.columns)
# 用衆數填充特徵工程中產生的異常值(正負無窮大)
for col in cols:
all_data[col].values[np.isinf(all_data[col].values)]=all_data[col].median()
del cols,col
train_data = all_data[:train_index]
test_data = all_data[train_index:]
return train_data,test_data
train_X,test_X = split_data(all_data,train_num)
train_X.shape,test_X.shape,train_y.shape
((1449, 393), (1459, 393), (1449,))
1.針對離羣點做標準化處理
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
# 訓練集特徵歸一化
train_X = scaler.fit_transform(train_X)
# 測試集特徵歸一化
test_X = scaler.transform(test_X)
2.建模
from sklearn.linear_model import Lasso
lasso_model = Lasso(alpha=0.001)
lasso_model.fit(train_X,train_y)
Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
# 顯示所有列
pd.set_option('display.max_columns', None)
# 顯示所有行
pd.set_option('display.max_rows', None)
# 設置value的顯示長度爲100,默認爲50
pd.set_option('max_colwidth',100)
# 索引和特徵重要性
FI_lasso = pd.DataFrame({"Feature Importance":lasso_model.coef_},
index=all_data.columns)
# 由高到低進行排序
FI_lasso.sort_values("Feature Importance",ascending=False).round(5).head(10)
Feature Importance | |
---|---|
Neighborhood_Crawfor | 0.09052 |
Total_Home_Quality | 0.08677 |
TotalPlace | 0.07877 |
GrLivArea | 0.06999 |
KitchenQual_0 | 0.05483 |
Functional_and_TotalHouse | 0.04605 |
labfit_SaleCondition | 0.04488 |
Exterior1st_BrkFace | 0.04458 |
YearBuilt | 0.03844 |
MSZoning_and_YearBuilt | 0.03626 |
3.可視化
# 不爲0的特徵
FI_lasso=FI_lasso[FI_lasso["Feature Importance"] !=0 ].sort_values("Feature Importance")
FI_lasso.plot(kind="barh",figsize=(12,40), color='g')
plt.xticks(rotation=90)
display(FI_lasso.shape)
4.特徵選擇
# 挑選特徵
choose_cols = FI_lasso.index.tolist()
choose_data = all_data[choose_cols].copy()
choose_data.shape
(2908, 86)
數據建模
# 劃分數據集
train_X, test_X = choose_data[:train_num], choose_data[train_num:]
# 標準化處理
scaler = RobustScaler()
train_X = scaler.fit_transform(train_X)
test_X = scaler.transform(test_X)
train_X.shape,test_X.shape,train_y.shape
((1449, 86), (1459, 86), (1449,))
# Models
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,ExtraTreesRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge,RidgeCV,Lasso,LinearRegression
from sklearn.linear_model import ElasticNet,ElasticNetCV,SGDRegressor,BayesianRidge
from sklearn.svm import SVR,LinearSVR
from mlxtend.regressor import StackingCVRegressor
from sklearn.kernel_ridge import KernelRidge
import lightgbm as lgb
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
# Misc
from sklearn.model_selection import GridSearchCV,KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
# 12折交叉驗證
kf = KFold(n_splits=12,random_state=42,shuffle=True)
評分函數
# 均方根誤差
def rmse(y,y_pred):
return np.sqrt(mean_squared_error(y,y_pred))
def cv_rmse(model,X,y):
# neg_mean_squared_error 負均方根誤差
rmse = np.sqrt(-cross_val_score(model,X,y,
scoring="neg_mean_squared_error",cv=kf))
return rmse
主成分分析
前面新建的特徵和原始特徵存在相關性,這可能導致較強的多重共線性 (Multicollinearity)
pca_model = PCA(n_components=60)
train_X = pca_model.fit_transform(train_X)
test_X = pca_model.transform(test_X)
集成學習
通常對於一個問題,我們可以依據數據構建不同的模型去求解,這些模型站在不同的角度去解釋數據的內部結構。我們可以融合不同的求解方法,得到更優的求解結果。在集成學習中,我們要做的事情就是構建不同的個體學習器,並將它們很好的進行集成。關鍵在於同一個訓練集上訓練出來的模型相關性很高,而我們希望它們“不同”,這樣才能查漏補缺,取得更好的結果。
- Bagging 基於數據去做,採用從訓練集中有放回的採樣方式,得到一個新的訓練集,去訓練個體學習器
- Boosting 串行的去訓練個體學習器,使用個體學習器對數據進行學習,將數據中做錯的樣本權值增大,做對的樣本權值減小,然後繼續訓練處下一個個體學習,以此類推,直到我們的錯誤率低於我們的閾值。
Stacking和Blending屬於Bagging方法,兩者的不同之處在於採用不同的方式融合個體學習器,前者非線性,後者線性。
# 網格搜索
def get_best_model_and_accuracy(model, params, X, y):
# 如果報錯,結果是0
grid = GridSearchCV(model,params,scoring='neg_mean_squared_error',
cv=5,n_jobs=-1,error_score=0.)
grid.fit(X, y) # 擬合模型和參數
# 經典的性能指標
print("Best Score: {}".format(np.sqrt(-grid.best_score_)))
# 得到最佳準確率的最佳參數
print("Best Parameters: {}".format(grid.best_params_))
# 擬合的平均時間(秒)
print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
# 預測的平均時間(秒)
# 從該指標可以看出模型在真實世界的性能
print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
grid.cv_results_['mean_test_score'] = np.sqrt(-grid.cv_results_['mean_test_score'])
# 打印單獨的各參數組合參數及對應的評估指標
print(pd.DataFrame(grid.cv_results_)[['params','mean_test_score','std_test_score']])
return grid
Lasso
param_Lasso = {'alpha': [0.0004,0.0005,0.0006],
'max_iter':[10000],'random_state':[1]}
Lasso_grid =get_best_model_and_accuracy(Lasso(),param_Lasso,train_X,train_y)
Best Score: 0.11233809637926326
Best Parameters: {'alpha': 0.0004, 'max_iter': 10000, 'random_state': 1}
Average Time to Fit (s): 0.002
Average Time to Score (s): 0.0
params mean_test_score \
0 {'alpha': 0.0004, 'max_iter': 10000, 'random_state': 1} 0.112338
1 {'alpha': 0.0005, 'max_iter': 10000, 'random_state': 1} 0.112341
2 {'alpha': 0.0006, 'max_iter': 10000, 'random_state': 1} 0.112416
std_test_score
0 0.000861
1 0.000884
2 0.000907
Ridge
param_Ridge = {'alpha':[35,40,45,50,55]}
Ridge_grid =get_best_model_and_accuracy(Ridge(),param_Ridge,train_X,train_y)
Best Score: 0.11201108834987004
Best Parameters: {'alpha': 35}
Average Time to Fit (s): 0.001
Average Time to Score (s): 0.0
params mean_test_score std_test_score
0 {'alpha': 35} 0.112011 0.000953
1 {'alpha': 40} 0.112035 0.000967
2 {'alpha': 45} 0.112073 0.000980
3 {'alpha': 50} 0.112122 0.000991
4 {'alpha': 55} 0.112180 0.001001
SVR
param_SVR = {'C':[11,12,13,14,15],'kernel':["rbf"],"gamma":[0.0003,0.0004],
"epsilon":[0.008,0.009]}
SVR_grid =get_best_model_and_accuracy(SVR(),param_SVR,train_X,train_y)
Best Score: 0.11185206657627142
Best Parameters: {'C': 15, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
Average Time to Fit (s): 0.317
Average Time to Score (s): 0.044
params \
0 {'C': 11, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
1 {'C': 11, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
2 {'C': 11, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
3 {'C': 11, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
4 {'C': 12, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
5 {'C': 12, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
6 {'C': 12, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
7 {'C': 12, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
8 {'C': 13, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
9 {'C': 13, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
10 {'C': 13, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
11 {'C': 13, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
12 {'C': 14, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
13 {'C': 14, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
14 {'C': 14, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
15 {'C': 14, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
16 {'C': 15, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
17 {'C': 15, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
18 {'C': 15, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
19 {'C': 15, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
mean_test_score std_test_score
0 0.112221 0.001143
1 0.111954 0.001126
2 0.112240 0.001131
3 0.112010 0.001115
4 0.112148 0.001147
5 0.111916 0.001105
6 0.112193 0.001135
7 0.111954 0.001103
8 0.112077 0.001141
9 0.111902 0.001092
10 0.112097 0.001137
11 0.111994 0.001098
12 0.112045 0.001135
13 0.111888 0.001081
14 0.112054 0.001127
15 0.111958 0.001082
16 0.112021 0.001123
17 0.111852 0.001068
18 0.112056 0.001113
19 0.111902 0.001071
KernelRidge
param_KernelRidge = {'alpha':[0.3,0.4,0.5], 'kernel':["polynomial"],
'degree':[3],'coef0':[0.8,1,1.2]}
KernelRidge_grid =get_best_model_and_accuracy(KernelRidge(),param_KernelRidge,train_X,train_y)
Best Score: 0.12053877269961878
Best Parameters: {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
Average Time to Fit (s): 0.207
Average Time to Score (s): 0.037
params \
0 {'alpha': 0.3, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
1 {'alpha': 0.3, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
2 {'alpha': 0.3, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
3 {'alpha': 0.4, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
4 {'alpha': 0.4, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
5 {'alpha': 0.4, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
6 {'alpha': 0.5, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
7 {'alpha': 0.5, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
8 {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
mean_test_score std_test_score
0 0.131492 0.001534
1 0.124723 0.001179
2 0.123360 0.001052
3 0.132097 0.001687
4 0.123652 0.001257
5 0.121633 0.001096
6 0.133186 0.001837
7 0.123168 0.001331
8 0.120539 0.001138
ElasticNet
ElasticNet可以看做Lasso和Ridge的中庸化的產物。它也是對普通的線性迴歸做了正則化,但是它的損失函數既不全是L1的正則化,也不全是L2的正則化,而是用一個權重參數ρ來平衡L1和L2正則化的比重
機器學習算法之嶺迴歸、Lasso迴歸和ElasticNet迴歸
param_ElasticNet = {'alpha':[0.0008,0.004,0.005],'l1_ratio':[0.08,0.1,0.3,0.5],
'max_iter':[10000],'random_state':[3]}
ElasticNet_grid =get_best_model_and_accuracy(ElasticNet(),param_ElasticNet,train_X,train_y)
Best Score: 0.11223819703859092
Best Parameters: {'alpha': 0.005, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
Average Time to Fit (s): 0.001
Average Time to Score (s): 0.0
params \
0 {'alpha': 0.0008, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
1 {'alpha': 0.0008, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}
2 {'alpha': 0.0008, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}
3 {'alpha': 0.0008, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}
4 {'alpha': 0.004, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
5 {'alpha': 0.004, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}
6 {'alpha': 0.004, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}
7 {'alpha': 0.004, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}
8 {'alpha': 0.005, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
9 {'alpha': 0.005, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}
10 {'alpha': 0.005, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}
11 {'alpha': 0.005, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}
mean_test_score std_test_score
0 0.112599 0.000791
1 0.112573 0.000795
2 0.112379 0.000828
3 0.112327 0.000865
4 0.112244 0.000872
5 0.112254 0.000888
6 0.113251 0.001022
7 0.114522 0.001099
8 0.112238 0.000895
9 0.112282 0.000914
10 0.113737 0.001056
11 0.115224 0.001138
bay = BayesianRidge()
xgb = XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
learning_rate=0.05, max_depth=3,
min_child_weight=1.7817, n_estimators=2200,
reg_alpha=0.4640, reg_lambda=0.8571,subsample=0.5213,
silent=1,random_state =7, nthread = -1)
lgbm = LGBMRegressor(objective='regression',num_leaves=5,learning_rate=0.05,
n_estimators=700,max_bin = 55,
bagging_fraction = 0.8,bagging_freq = 5,
feature_fraction = 0.25,feature_fraction_seed=9,
bagging_seed=9,min_data_in_leaf = 6, min_sum_hessian_in_leaf = 11)
GBR = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
max_depth=4, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10,
loss='huber', random_state =5)
Stacking
Stacking的學習器分兩層,第一層是若干個弱學習器,它們分別進行預測,然後把預測結果傳遞給第二層;第二層學習器基於第一層的預測結果預測。
Stacking容易過擬合,採用K-Fold方法進行訓練:
-
將訓練集分成5份,迭代5次,每次迭代都將4份數據作爲Train Set對每個Base Model進行訓練,然後剩下的一份作爲Hold-out Set進行預測。同時,每個Base Model在Test Set的預測值也要保存下來。經過5-Flod迭代後,我們獲得了一個:訓練樣本行數 * 模型數量 的矩陣(每個Base Model在進行cv 的過程,分別會對每一份Hold-out-set進行預測,彙總產生一個對所有訓練集的預測),這個矩陣作爲第二層的訓練數據進行訓練,得到model B。
-
將之前保存的每個Base Model對測試集進行的預測的平均值拼成一個:測試樣本行數 * 模型數量 的矩陣(每個Base Model會對測試集進行5-fold次預測,所以在拼測試數據的預測結果之前,需要對每個Base Model預測5-Fold次的預測結果求均值)。
-
model B對測試集的預測進行預測。
class stacking(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self,mod,meta_model):
self.mod = mod # 首層學習器模型
self.meta_model = meta_model # 次學習器模型
# 堆疊的最大特徵劃分折數
self.k=5
self.kf = KFold(n_splits=self.k, random_state=42, shuffle=True)
# 訓練數據
def fit(self,X,y):
# self.saved_model包含所有第一層學習器
self.saved_model = [list() for i in self.mod]
# 維度:訓練樣本行數*模型數量
oof_train = np.zeros((X.shape[0], len(self.mod)))
for i,model in enumerate(self.mod): # 返回索引和模型本身
#返回數據分割成分(訓練集和驗證集對應元素)的索引
for train_index, val_index in self.kf.split(X,y):
renew_model = clone(model) # 模型的複製
# 對分割出來的訓練集數據進行訓練
renew_model.fit(X[train_index], y[train_index])
# 添加模型
self.saved_model[i].append(renew_model)
# 保存對應模型的驗證集預測值
oof_train[val_index,i] = renew_model.predict(X[val_index])
# 次學習器模型訓練,這裏只是用到了首層預測值作爲特徵
self.meta_model.fit(oof_train,y)
return self
# 測試數據
def predict(self,X):
# 得到的是整個測試集的首層預測值,np.column_stack:左右根據列拼接 mean(axis=1):跨列求和
whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1)
for single_model in self.saved_model])
# 返回次學習器模型對整個測試集的首層預測值的最終預測
return self.meta_model.predict(whole_test)
## 獲取首層學習結果的堆疊特徵
def get_oof(self,X,y,test_X):
oof = np.zeros((X.shape[0],len(self.mod)))
test_single = np.zeros((test_X.shape[0],self.k))
test_mean = np.zeros((test_X.shape[0],len(self.mod)))
for i,model in enumerate(self.mod):
for j, (train_index,val_index) in enumerate(self.kf.split(X,y)):
clone_model = clone(model)
clone_model.fit(X[train_index],y[train_index])
# 預測結果保存
oof[val_index,i] = clone_model.predict(X[val_index])
test_single[:,j] = clone_model.predict(test_X)
# 對每個模型的測試集預測K-Fold結果取均值
test_mean[:,i] = test_single.mean(axis=1)
return oof, test_mean
lasso = Lasso_grid.best_estimator_
ridge = Ridge_grid.best_estimator_
svr = SVR_grid.best_estimator_
ker = KernelRidge_grid.best_estimator_
ela= ElasticNet_grid.best_estimator_
stack_model = stacking(mod=[bay,lasso,ridge,svr,ker,ela], meta_model=ker)
# 查看訓練集精度
score = cv_rmse(stack_model,train_X,train_y)
display(score.mean())
0.10746634249868159
# 第二層學習器特徵獲取
x_train_stack, x_test_stack = stack_model.get_oof(train_X,train_y,test_X)
train_X.shape,train_y.shape,test_X.shape
((1449, 60), (1449,), (1459, 60))
# 第一層總共6個模型
x_train_stack.shape, x_test_stack.shape
((1449, 6), (1459, 6))
stacking的一般來說得到初級模型的預測值後,用次級模型訓練預測就可以了。不過在本案例中,我們嘗試將第一層得到的堆疊特徵與初始特徵進行合併,最後利用這些合併後的特徵再次投入stacking裏面進行訓練。
# 將stacking特徵和數據原始的特徵拼接,水平方向上
x_train_add = np.hstack((train_X,x_train_stack))
x_test_add = np.hstack((test_X,x_test_stack))
x_train_add.shape,x_test_add.shape
((1449, 66), (1459, 66))
# 查看拼接特徵後的精度,發現效果有所提升
score = cv_rmse(stack_model,x_train_add,train_y)
print(score.mean())
0.10195220877304757
將x_train_add,train_y和x_test_add通過stacking重新進行訓練
param_Lasso = {'alpha': [0.0004,0.0005,0.0006],
'max_iter':[10000],'random_state':[1]}
Lasso_2 =get_best_model_and_accuracy(Lasso(),param_Lasso,x_train_add,train_y)
Best Score: 0.11162310214215297
Best Parameters: {'alpha': 0.0005, 'max_iter': 10000, 'random_state': 1}
Average Time to Fit (s): 0.009
Average Time to Score (s): 0.0
params mean_test_score \
0 {'alpha': 0.0004, 'max_iter': 10000, 'random_state': 1} 0.111637
1 {'alpha': 0.0005, 'max_iter': 10000, 'random_state': 1} 0.111623
2 {'alpha': 0.0006, 'max_iter': 10000, 'random_state': 1} 0.111662
std_test_score
0 0.000880
1 0.000896
2 0.000909
param_Ridge = {'alpha':[35,40,45,50,55]}
Ridge_2 =get_best_model_and_accuracy(Ridge(),param_Ridge,x_train_add,train_y)
Best Score: 0.1118608032209135
Best Parameters: {'alpha': 35}
Average Time to Fit (s): 0.002
Average Time to Score (s): 0.0
params mean_test_score std_test_score
0 {'alpha': 35} 0.111861 0.000949
1 {'alpha': 40} 0.111892 0.000962
2 {'alpha': 45} 0.111924 0.000973
3 {'alpha': 50} 0.111960 0.000983
4 {'alpha': 55} 0.111999 0.000992
param_SVR = {'C':[11,12,13,14,15],'kernel':["rbf"],"gamma":[0.0003,0.0004],
"epsilon":[0.008,0.009]}
SVR_2 =get_best_model_and_accuracy(SVR(),param_SVR,x_train_add,train_y)
Best Score: 0.11187202151025108
Best Parameters: {'C': 15, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
Average Time to Fit (s): 0.316
Average Time to Score (s): 0.044
params \
0 {'C': 11, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
1 {'C': 11, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
2 {'C': 11, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
3 {'C': 11, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
4 {'C': 12, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
5 {'C': 12, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
6 {'C': 12, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
7 {'C': 12, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
8 {'C': 13, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
9 {'C': 13, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
10 {'C': 13, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
11 {'C': 13, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
12 {'C': 14, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
13 {'C': 14, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
14 {'C': 14, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
15 {'C': 14, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
16 {'C': 15, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}
17 {'C': 15, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
18 {'C': 15, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}
19 {'C': 15, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}
mean_test_score std_test_score
0 0.112114 0.001168
1 0.111980 0.001131
2 0.112167 0.001164
3 0.112013 0.001132
4 0.112075 0.001161
5 0.111909 0.001112
6 0.112136 0.001158
7 0.111960 0.001113
8 0.112050 0.001159
9 0.111898 0.001082
10 0.112133 0.001152
11 0.111930 0.001096
12 0.112024 0.001159
13 0.111873 0.001057
14 0.112087 0.001149
15 0.111928 0.001074
16 0.111989 0.001150
17 0.111872 0.001046
18 0.112041 0.001143
19 0.111910 0.001060
param_KernelRidge = {'alpha':[0.3,0.4,0.5], 'kernel':["polynomial"],
'degree':[3],'coef0':[0.8,1,1.2]}
KernelRidge_2 =get_best_model_and_accuracy(KernelRidge(),param_KernelRidge,x_train_add,train_y)
Best Score: 0.11754411372302964
Best Parameters: {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
Average Time to Fit (s): 0.184
Average Time to Score (s): 0.032
params \
0 {'alpha': 0.3, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
1 {'alpha': 0.3, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
2 {'alpha': 0.3, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
3 {'alpha': 0.4, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
4 {'alpha': 0.4, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
5 {'alpha': 0.4, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
6 {'alpha': 0.5, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
7 {'alpha': 0.5, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
8 {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
mean_test_score std_test_score
0 0.121835 0.002417
1 0.119743 0.002347
2 0.118019 0.002291
3 0.121416 0.002253
4 0.119359 0.002201
5 0.117628 0.002159
6 0.121293 0.002123
7 0.119272 0.002083
8 0.117544 0.002051
param_ElasticNet = {'alpha':[0.0008,0.004,0.005],'l1_ratio':[0.08,0.1,0.3,0.5],
'max_iter':[10000],'random_state':[3]}
ElasticNet_2 =get_best_model_and_accuracy(ElasticNet(),param_ElasticNet,x_train_add,train_y)
Best Score: 0.10667612140906058
Best Parameters: {'alpha': 0.0008, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
Average Time to Fit (s): 0.025
Average Time to Score (s): 0.0
params \
0 {'alpha': 0.0008, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
1 {'alpha': 0.0008, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}
2 {'alpha': 0.0008, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}
3 {'alpha': 0.0008, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}
4 {'alpha': 0.004, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
5 {'alpha': 0.004, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}
6 {'alpha': 0.004, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}
7 {'alpha': 0.004, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}
8 {'alpha': 0.005, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
9 {'alpha': 0.005, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}
10 {'alpha': 0.005, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}
11 {'alpha': 0.005, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}
mean_test_score std_test_score
0 0.106676 0.000741
1 0.107021 0.000758
2 0.111335 0.000889
3 0.111619 0.000880
4 0.111584 0.000877
5 0.111586 0.000891
6 0.112205 0.001007
7 0.113027 0.001072
8 0.111594 0.000896
9 0.111623 0.000914
10 0.112603 0.001041
11 0.113622 0.001111
bay_2 = BayesianRidge()
xgb_2 = XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,learning_rate=0.05,
max_depth=3,min_child_weight=1.7817, n_estimators=2200,reg_alpha=0.4640,
reg_lambda=0.8571,subsample=0.5213, silent=1,random_state =7, nthread = -1)
lgbm_2 = LGBMRegressor(objective='regression',num_leaves=5,learning_rate=0.05,
n_estimators=700,max_bin = 55,bagging_fraction = 0.8,bagging_freq = 5,
feature_fraction = 0.25,feature_fraction_seed=9,
bagging_seed=9,min_data_in_leaf = 6,
min_sum_hessian_in_leaf = 11)
GBR_2 = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,max_depth=4,
max_features='sqrt',min_samples_leaf=15,
min_samples_split=10,
loss='huber',
random_state =5)
lasso_2 = Lasso_2.best_estimator_
ridge_2 = Ridge_2.best_estimator_
svr_2 = SVR_2.best_estimator_
ker_2 = KernelRidge_2.best_estimator_
ela_2 = ElasticNet_2.best_estimator_
stack_model_2 = stacking(mod=[bay_2,lasso_2,ridge_2,svr_2,ker_2,ela_2], meta_model=ker_2)
last_x_train_stack, last_x_test_stack = stack_model_2.get_oof(x_train_add,train_y,x_test_add)
last_x_train_stack.shape, last_x_test_stack.shape
((1449, 6), (1459, 6))
第二層模型KernelRidge的參數搜索
param_ker = {'alpha':[0.2,0.3,0.4,0.5], 'kernel':["polynomial"],
'degree':[3,4],'coef0':[0.8,1,1.2]}
Ker_stack_model=get_best_model_and_accuracy(KernelRidge(),
param_ker,last_x_train_stack,train_y).best_estimator_
Best Score: 0.08808555947636867
Best Parameters: {'alpha': 0.2, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}
Average Time to Fit (s): 0.186
Average Time to Score (s): 0.03
params \
0 {'alpha': 0.2, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
1 {'alpha': 0.2, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}
2 {'alpha': 0.2, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
3 {'alpha': 0.2, 'coef0': 1, 'degree': 4, 'kernel': 'polynomial'}
4 {'alpha': 0.2, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
5 {'alpha': 0.2, 'coef0': 1.2, 'degree': 4, 'kernel': 'polynomial'}
6 {'alpha': 0.3, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
7 {'alpha': 0.3, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}
8 {'alpha': 0.3, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
9 {'alpha': 0.3, 'coef0': 1, 'degree': 4, 'kernel': 'polynomial'}
10 {'alpha': 0.3, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
11 {'alpha': 0.3, 'coef0': 1.2, 'degree': 4, 'kernel': 'polynomial'}
12 {'alpha': 0.4, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
13 {'alpha': 0.4, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}
14 {'alpha': 0.4, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
15 {'alpha': 0.4, 'coef0': 1, 'degree': 4, 'kernel': 'polynomial'}
16 {'alpha': 0.4, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
17 {'alpha': 0.4, 'coef0': 1.2, 'degree': 4, 'kernel': 'polynomial'}
18 {'alpha': 0.5, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}
19 {'alpha': 0.5, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}
20 {'alpha': 0.5, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}
21 {'alpha': 0.5, 'coef0': 1, 'degree': 4, 'kernel': 'polynomial'}
22 {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
23 {'alpha': 0.5, 'coef0': 1.2, 'degree': 4, 'kernel': 'polynomial'}
mean_test_score std_test_score
0 0.089836 0.000473
1 0.088086 0.000600
2 0.089773 0.000480
3 0.088102 0.000599
4 0.089749 0.000485
5 0.088118 0.000599
6 0.090043 0.000456
7 0.088470 0.000586
8 0.089935 0.000462
9 0.088486 0.000586
10 0.089885 0.000468
11 0.088501 0.000586
12 0.090214 0.000443
13 0.088732 0.000579
14 0.090070 0.000449
15 0.088748 0.000580
16 0.089994 0.000455
17 0.088762 0.000581
18 0.090358 0.000434
19 0.088920 0.000576
20 0.090184 0.000439
21 0.088936 0.000577
22 0.090086 0.000445
23 0.088949 0.000578
cv_rmse(Ker_stack_model,last_x_train_stack,train_y).mean()
0.08791312508608311
# 注意之前對目標數據做過log變換
y_pred_stack = np.expm1(Ker_stack_model.predict(last_x_test_stack))
可以直接用stack_model類的函數擬合併預測數據
stack_model = stacking(mod=[lgbm,ela,svr,ridge,lasso,bay,xgb,GBR,ker],
meta_model=KernelRidge(alpha=0.2 ,kernel='polynomial',
degree=4, coef0=0.8))
stack_model.fit(x_train_add,train_y)
y_pred_stack_2 = np.exp(stack_model.predict(x_test_add))
XGBoost建模預測
xgb.fit(last_x_train_stack,train_y)
y_pred_xgb = np.expm1(xgb.predict(last_x_test_stack))
# 交叉驗證
cv_rmse(xgb,x_train_stack,train_y).mean()
0.1139198877562616
# 訓練集誤差
y_train_xgb = xgb.predict(last_x_train_stack)
rmse(y_train_xgb,train_y)
0.08778404527191365
LightGBM建模預測
lgbm.fit(last_x_train_stack,train_y)
y_pred_lgbm = np.expm1(lgbm.predict(last_x_test_stack))
cv_rmse(lgbm,x_train_stack,train_y).mean()
0.1161628433489873
y_train_lgbm = xgb.predict(x_train_stack)
rmse(y_train_lgbm,train_y)
0.10937253913955777
# 模型融合
y_pred = (0.7*y_pred_stack)+(0.15*y_pred_xgb)+(0.15*y_pred_lgbm)
submission = pd.read_csv("/home/aistudio/data/data32288/submission.csv")
submission.shape,y_pred.shape
((1459, 2), (1459,))
submission.iloc[:,1] = y_pred
submission.to_csv(r'./house_submission.csv',index=False)
submission.head()
Id | SalePrice | |
---|---|---|
0 | 1461 | 119962.721230 |
1 | 1462 | 161987.446003 |
2 | 1463 | 188901.912081 |
3 | 1464 | 194701.643631 |
4 | 1465 | 194480.370160 |
Blending
Blending與Stacking主要區別在於訓練集不是通過K-Fold來獲得預測值從而生成第二階段模型的特徵,而是建立一個Holdout集,第二階段的stacker模型就基於第一階段模型對驗證集的預測值進行擬合。也就是就是把Stacking流程中的K-Fold CV 改成 HoldOut CV。
步驟:
- 把原始的訓練集先分成兩部分,比如70%的數據作爲訓練集,剩下30%的數據作爲驗證集。第一輪訓練: 我們在這70%的數據上訓練多個模型,然後去預測那30%驗證數據的label,得到pre_val_set;同時也用這些模型去預測測試集得到pre_test_set。
- 第二輪訓練,我們用pre_val_set做爲新特徵繼續訓練第二層的模型Model B
- 用Model B對pre_test_set進行預測,得到最終結果
Blending的優點:
- 比stacking簡單(因爲不用進行k次的交叉驗證來獲得stacker feature)
- 避開了一個信息泄露問題:generlizers和stacker使用了不一樣的數據集
- 在團隊建模過程中,不需要給隊友分享自己的隨機種子
Blending的缺點:
- 使用了很少的數據
- blender可能會過擬合(其實大概率是第一點導致的)
- stacking使用多次的CV會比較穩健
from sklearn.model_selection import StratifiedKFold,train_test_split
# 模型融合中使用到的各個單模型
clfs = [BayesianRidge(),Lasso(),Ridge(),SVR(),KernelRidge(),ElasticNet()]
# 切分訓練數據集爲train,val兩部分
X_train, X_val, y_train, y_val = train_test_split(train_X,train_y,test_size=0.33, random_state=1855)
dataset_val = np.zeros((X_val.shape[0], len(clfs))) # 對驗證集的預測
dataset_test = np.zeros((test_X.shape[0], len(clfs))) #對測試集的預測
# 依次訓練各個單模型
for j, clf in enumerate(clfs):
# 使用train_X訓練模型,獲得其預測的輸出作爲第2部分的新特徵
clf.fit(X_train, y_train)
dataset_val[:, j] = clf.predict(X_val)
# 對於測試集,直接用這k個模型的預測值作爲新的特徵
dataset_test[:, j] = clf.predict(test_X)
# 融合使用的模型
clf = XGBRegressor()
clf.fit(dataset_val, y_val)
# 注意前面對目標數據做過log變換
y_submission = np.expm1(clf.predict(dataset_test))
cv_rmse(clf,train_X,train_y).mean()
0.14310972129182878
y_submission
array([122274.41, 142203.67, 176042.67, ..., 164987.31, 107128.92,
250321.12], dtype=float32)
y_pred_stack
array([118603.60717676, 162614.48976635, 190387.78002988, ...,
179561.60366542, 117042.61233382, 223750.10906997])
可視化模型預測精度
# 使用mlxtend包
stack_gen = StackingCVRegressor(regressors=(lgbm,ela,svr,ridge,lasso,bay,xgb,GBR,ker),
meta_regressor=ker,
use_features_in_secondary=True)# 元分類器將根據原始迴歸器和原始數據集的預測進行訓練
獲取每個模型的交叉驗證分數
scores = {}
score = cv_rmse(lgbm,train_X,train_y)
print("lightgbm: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['lgbm'] = (score.mean(), score.std())
lightgbm: 0.1280 (0.0148)
score = cv_rmse(ela,train_X,train_y)
print("ElasticNet: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['ela'] = (score.mean(), score.std())
ElasticNet: 0.1108 (0.0151)
score = cv_rmse(svr,train_X,train_y)
print("SVR: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['svr'] = (score.mean(), score.std())
SVR: 0.1096 (0.0172)
score = cv_rmse(ridge,train_X,train_y)
print("ridge: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['ridge'] = (score.mean(), score.std())
ridge: 0.1106 (0.0154)
score = cv_rmse(lasso,train_X,train_y)
print("Lasso: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['Lasso'] = (score.mean(), score.std())
Lasso: 0.1108 (0.0150)
score = cv_rmse(bay,train_X,train_y)
print("bay: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['bay'] = (score.mean(), score.std())
bay: 0.1106 (0.0152)
score = cv_rmse(xgb,train_X,train_y)
print("xgb: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['xgb'] = (score.mean(), score.std())
xgb: 0.1259 (0.0156)
score = cv_rmse(GBR,train_X,train_y)
print("GBR: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['GBR'] = (score.mean(), score.std())
GBR: 0.1326 (0.0189)
score = cv_rmse(ker,train_X,train_y)
print("ker: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['ker'] = (score.mean(), score.std())
ker: 0.1178 (0.0167)
score = cv_rmse(stack_gen,train_X,train_y)
print("stack_gen: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['stack_gen'] = (score.mean(), score.std())
stack_gen: 0.1338 (0.0191)
確定性能最佳的模型
sns.set_style("white")
fig = plt.figure(figsize=(24, 12))
ax = sns.pointplot(x=list(scores.keys()), y=[score for score, _ in scores.values()],
markers=['o'], linestyles=['-'])
for i, score in enumerate(scores.values()):
ax.text(i, score[0] + 0.002, '{:.6f}'.format(score[0]),
horizontalalignment='left', size='large', color='black', weight='semibold')
plt.ylabel('Score (RMSE)', size=20, labelpad=12.5)
plt.xlabel('Model', size=20, labelpad=12.5)
plt.tick_params(axis='x', labelsize=13.5)
plt.tick_params(axis='y', labelsize=12.5)
plt.title('Scores of Models', size=20)
plt.show()