Kaggle之房價問題

基於愛荷華州埃姆斯的住宅數據信息，預測每間房屋的銷售價格。
這是一個迴歸問題,評估方式是均方根誤差。

數據分析

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.special import boxcox1p
import missingno as msno
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

# 導入數據
house_train = pd.read_csv('/home/aistudio/data/data32288/train.csv')
house_test  = pd.read_csv('/home/aistudio/data/data32288/test.csv')

house_train.shape,house_test.shape

((1460, 81), (1459, 80))

house_train.info()
print('-'*40)
house_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     1452 non-null   object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallCond    1459 non-null   int64  
 19  YearBuilt      1459 non-null   int64  
 20  YearRemodAdd   1459 non-null   int64  
 21  RoofStyle      1459 non-null   object 
 22  RoofMatl       1459 non-null   object 
 23  Exterior1st    1458 non-null   object 
 24  Exterior2nd    1458 non-null   object 
 25  MasVnrType     1443 non-null   object 
 26  MasVnrArea     1444 non-null   float64
 27  ExterQual      1459 non-null   object 
 28  ExterCond      1459 non-null   object 
 29  Foundation     1459 non-null   object 
 30  BsmtQual       1415 non-null   object 
 31  BsmtCond       1414 non-null   object 
 32  BsmtExposure   1415 non-null   object 
 33  BsmtFinType1   1417 non-null   object 
 34  BsmtFinSF1     1458 non-null   float64
 35  BsmtFinType2   1417 non-null   object 
 36  BsmtFinSF2     1458 non-null   float64
 37  BsmtUnfSF      1458 non-null   float64
 38  TotalBsmtSF    1458 non-null   float64
 39  Heating        1459 non-null   object 
 40  HeatingQC      1459 non-null   object 
 41  CentralAir     1459 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1459 non-null   int64  
 44  2ndFlrSF       1459 non-null   int64  
 45  LowQualFinSF   1459 non-null   int64  
 46  GrLivArea      1459 non-null   int64  
 47  BsmtFullBath   1457 non-null   float64
 48  BsmtHalfBath   1457 non-null   float64
 49  FullBath       1459 non-null   int64  
 50  HalfBath       1459 non-null   int64  
 51  BedroomAbvGr   1459 non-null   int64  
 52  KitchenAbvGr   1459 non-null   int64  
 53  KitchenQual    1458 non-null   object 
 54  TotRmsAbvGrd   1459 non-null   int64  
 55  Functional     1457 non-null   object 
 56  Fireplaces     1459 non-null   int64  
 57  FireplaceQu    729 non-null    object 
 58  GarageType     1383 non-null   object 
 59  GarageYrBlt    1381 non-null   float64
 60  GarageFinish   1381 non-null   object 
 61  GarageCars     1458 non-null   float64
 62  GarageArea     1458 non-null   float64
 63  GarageQual     1381 non-null   object 
 64  GarageCond     1381 non-null   object 
 65  PavedDrive     1459 non-null   object 
 66  WoodDeckSF     1459 non-null   int64  
 67  OpenPorchSF    1459 non-null   int64  
 68  EnclosedPorch  1459 non-null   int64  
 69  3SsnPorch      1459 non-null   int64  
 70  ScreenPorch    1459 non-null   int64  
 71  PoolArea       1459 non-null   int64  
 72  PoolQC         3 non-null      object 
 73  Fence          290 non-null    object 
 74  MiscFeature    51 non-null     object 
 75  MiscVal        1459 non-null   int64  
 76  MoSold         1459 non-null   int64  
 77  YrSold         1459 non-null   int64  
 78  SaleType       1458 non-null   object 
 79  SaleCondition  1459 non-null   object 
dtypes: float64(11), int64(26), object(43)
memory usage: 912.0+ KB

# 統計性描述
house_train.describe().T

	count	mean	std	min	25%	50%	75%	max
Id	1460.0	730.500000	421.610009	1.0	365.75	730.5	1095.25	1460.0
MSSubClass	1460.0	56.897260	42.300571	20.0	20.00	50.0	70.00	190.0
LotFrontage	1201.0	70.049958	24.284752	21.0	59.00	69.0	80.00	313.0
LotArea	1460.0	10516.828082	9981.264932	1300.0	7553.50	9478.5	11601.50	215245.0
OverallQual	1460.0	6.099315	1.382997	1.0	5.00	6.0	7.00	10.0
OverallCond	1460.0	5.575342	1.112799	1.0	5.00	5.0	6.00	9.0
YearBuilt	1460.0	1971.267808	30.202904	1872.0	1954.00	1973.0	2000.00	2010.0
YearRemodAdd	1460.0	1984.865753	20.645407	1950.0	1967.00	1994.0	2004.00	2010.0
MasVnrArea	1452.0	103.685262	181.066207	0.0	0.00	0.0	166.00	1600.0
BsmtFinSF1	1460.0	443.639726	456.098091	0.0	0.00	383.5	712.25	5644.0
BsmtFinSF2	1460.0	46.549315	161.319273	0.0	0.00	0.0	0.00	1474.0
BsmtUnfSF	1460.0	567.240411	441.866955	0.0	223.00	477.5	808.00	2336.0
TotalBsmtSF	1460.0	1057.429452	438.705324	0.0	795.75	991.5	1298.25	6110.0
1stFlrSF	1460.0	1162.626712	386.587738	334.0	882.00	1087.0	1391.25	4692.0
2ndFlrSF	1460.0	346.992466	436.528436	0.0	0.00	0.0	728.00	2065.0
LowQualFinSF	1460.0	5.844521	48.623081	0.0	0.00	0.0	0.00	572.0
GrLivArea	1460.0	1515.463699	525.480383	334.0	1129.50	1464.0	1776.75	5642.0
BsmtFullBath	1460.0	0.425342	0.518911	0.0	0.00	0.0	1.00	3.0
BsmtHalfBath	1460.0	0.057534	0.238753	0.0	0.00	0.0	0.00	2.0
FullBath	1460.0	1.565068	0.550916	0.0	1.00	2.0	2.00	3.0
HalfBath	1460.0	0.382877	0.502885	0.0	0.00	0.0	1.00	2.0
BedroomAbvGr	1460.0	2.866438	0.815778	0.0	2.00	3.0	3.00	8.0
KitchenAbvGr	1460.0	1.046575	0.220338	0.0	1.00	1.0	1.00	3.0
TotRmsAbvGrd	1460.0	6.517808	1.625393	2.0	5.00	6.0	7.00	14.0
Fireplaces	1460.0	0.613014	0.644666	0.0	0.00	1.0	1.00	3.0
GarageYrBlt	1379.0	1978.506164	24.689725	1900.0	1961.00	1980.0	2002.00	2010.0
GarageCars	1460.0	1.767123	0.747315	0.0	1.00	2.0	2.00	4.0
GarageArea	1460.0	472.980137	213.804841	0.0	334.50	480.0	576.00	1418.0
WoodDeckSF	1460.0	94.244521	125.338794	0.0	0.00	0.0	168.00	857.0
OpenPorchSF	1460.0	46.660274	66.256028	0.0	0.00	25.0	68.00	547.0
EnclosedPorch	1460.0	21.954110	61.119149	0.0	0.00	0.0	0.00	552.0
3SsnPorch	1460.0	3.409589	29.317331	0.0	0.00	0.0	0.00	508.0
ScreenPorch	1460.0	15.060959	55.757415	0.0	0.00	0.0	0.00	480.0
PoolArea	1460.0	2.758904	40.177307	0.0	0.00	0.0	0.00	738.0
MiscVal	1460.0	43.489041	496.123024	0.0	0.00	0.0	0.00	15500.0
MoSold	1460.0	6.321918	2.703626	1.0	5.00	6.0	8.00	12.0
YrSold	1460.0	2007.815753	1.328095	2006.0	2007.00	2008.0	2009.00	2010.0
SalePrice	1460.0	180921.195890	79442.502883	34900.0	129975.00	163000.0	214000.00	755000.0

缺失值

msno.matrix(house_train, labels=True)

msno.bar(house_train)

msno.heatmap(house_train)

data_null = house_train.isnull().sum()
data_null[data_null>0].sort_values(ascending=False)

PoolQC          1453
MiscFeature     1406
Alley           1369
Fence           1179
FireplaceQu      690
LotFrontage      259
GarageYrBlt       81
GarageType        81
GarageFinish      81
GarageQual        81
GarageCond        81
BsmtFinType2      38
BsmtExposure      38
BsmtFinType1      37
BsmtCond          37
BsmtQual          37
MasVnrArea         8
MasVnrType         8
Electrical         1
dtype: int64

可視化

# numeric features
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numeric = []
for col in house_train.columns:
    if house_train[col].dtype in numeric_dtypes:
        if col in ['TotalSF', 'Total_Bathrooms','Total_porch_sf','haspool','hasgarage','hasbsmt','hasfireplace']:
            pass
        else:
            numeric.append(col)     

fig, axs = plt.subplots(ncols=2, nrows=0, figsize=(12, 120))
# 調整子圖佈局
plt.subplots_adjust(right=2,top=2)
# 顯示husl顏色空間平均分佈的8個顏色
sns.color_palette("husl", 8)
# 從1開始
for i, feature in enumerate(list(house_train[numeric]),1):
    if(feature=='MiscVal'):
        break
    plt.subplot(len(list(numeric)), 3, i)
    sns.scatterplot(x=feature,y='SalePrice',hue='SalePrice',palette='Blues',data=house_train)
    # labelpad:控制刻度標註的上下位置
    plt.xlabel('{}'.format(feature),size=15,labelpad=12.5)
    plt.ylabel('SalePrice', size=15, labelpad=12.5)
    
    for j in range(2):
        plt.tick_params(axis='x', labelsize=12)
        plt.tick_params(axis='y', labelsize=12)
    
    plt.legend(loc='best', prop={'size': 10})
        
plt.show()

查看目標數據

SalePrice不是正太分佈,高度偏斜
平均售價爲180921美元，拉向離羣值的上端。
中位數163000美元，低於平均值。
上端有異常值

house_train['SalePrice'].describe()

count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

f,ax = plt.subplots(1,2,figsize=(16,6))
sns.distplot(house_train['SalePrice'],fit=stats.norm,ax=ax[0])
sns.boxplot(house_train['SalePrice'])

#skewness and kurtosis
print("Skewness: {}".format(house_train['SalePrice'].skew()))
print("Kurtosis: {}".format(house_train['SalePrice'].kurt()))

Skewness: 1.8828757597682129
Kurtosis: 6.536281860064529

fig = plt.figure()
stats.probplot(house_train['SalePrice'],plot=plt)

((array([-3.30513952, -3.04793228, -2.90489705, ...,  2.90489705,
          3.04793228,  3.30513952]),
  array([ 34900,  35311,  37900, ..., 625000, 745000, 755000])),
 (74160.16474519414, 180921.19589041095, 0.9319665641512983))

# 數據變換
house_train['SalePrice'] = np.log1p(house_train['SalePrice'])

fig = plt.figure()
stats.probplot(house_train['SalePrice'],plot=plt)

((array([-3.30513952, -3.04793228, -2.90489705, ...,  2.90489705,
          3.04793228,  3.30513952]),
  array([10.46027076, 10.47197813, 10.54273278, ..., 13.34550853,
         13.52114084, 13.53447435])),
 (0.398259646654151, 12.024057394918403, 0.9953761551826702))

特徵相關性

def draw_corr(data):
    corr = data.corr()
    plt.subplots(figsize=(12,12))
    sns.heatmap(corr,vmax=1,square=True,cmap='Blues')
    plt.show()

draw_corr(house_train)

# 相關性最大的10個特徵
corrmat = house_train.corr()
plt.subplots(figsize=(10,8))
k = 10
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(house_train[cols].values.T)

# annot_kws:當annot爲True時，可設置各個參數，包括大小size，顏色color，加粗，斜體字等
# fmt:格式設置 這裏保留2位小數
sns.heatmap(cm,cbar=True, annot=True, square=True,
            fmt='.2f', annot_kws={'size': 10}, 
            yticklabels=cols.values, xticklabels=cols.values)

1.OverallQual

f,ax = plt.subplots(figsize=(8,6))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=house_train)

# 刪除異常值
mask = (house_train['OverallQual']<5)&(house_train['SalePrice']>12)
house_train.drop(house_train[mask].index, inplace=True)

house_train.plot.scatter(x='OverallQual', y='SalePrice')

2.GrLivArea

house_train.plot.scatter(x='GrLivArea', y='SalePrice',alpha=0.3)

# 刪除右下角兩個異常值
mask = (house_train['GrLivArea']>4000)&(house_train['SalePrice']<12.5)
house_train= house_train.drop(house_train[mask].index)

# 刪除異常值後
fig,ax = plt.subplots()
ax.scatter(x=house_train['GrLivArea'],y=house_train['SalePrice'])
plt.xlabel('GrLivArea',fontsize=13)
plt.ylabel('SalePrice',fontsize=13)

3.GarageCars

house_train.plot.scatter(x='GarageCars', y='SalePrice', alpha=0.3)

4.GarageArea

house_train.plot.scatter(x='GarageArea', y='SalePrice')

# 刪除異常值
mask = (house_train['GarageArea']>1100)&(house_train['SalePrice']<12.5)
house_train.drop(house_train[mask].index, inplace=True)

house_train.plot.scatter(x='GarageArea', y='SalePrice')

5.TotalBsmtSF

house_train.plot.scatter(x='TotalBsmtSF', y='SalePrice')

6.1stFlrSF

house_train.plot.scatter(x='1stFlrSF', y='SalePrice')

7.FullBath

house_train.plot.scatter(x='FullBath', y='SalePrice')

8.YearBuilt

house_train.plot.scatter(x='YearBuilt', y='SalePrice')

# 刪除異常值
mask = (house_train['YearBuilt']<1900)&(house_train['SalePrice']>12.3)
house_train= house_train.drop(house_train[mask].index)

# 刪除異常值後
house_train.plot.scatter(x='YearBuilt', y='SalePrice')

9.YearRemodAdd

house_train.plot.scatter(x='YearRemodAdd', y='SalePrice')

# 重置索引
house_train.reset_index(drop=True,inplace=True)

特徵工程

合併測試集和訓練集，對整體數據做特徵工程

train_num = house_train.shape[0]
test_num  = house_test.shape[0]

train_y = house_train.SalePrice.values

all_data = pd.concat((house_train,house_test)).reset_index(drop=True)
all_data.drop(['SalePrice','Id'],axis=1,inplace=True)
all_data.shape,train_num,test_num

((2908, 79), 1449, 1459)

缺失值處理

將數值型特徵的缺失值填充爲他們對應的衆數或0
將類別型feautre裏的缺失值全部填補爲“None”
刪除無用特徵
對類別型特徵編碼，get_dummies

count = all_data.isnull().sum().sort_values(ascending=False)
ratio = count/len(all_data)*100
cols_type = all_data[count.index].dtypes

missing_data = pd.concat([count,ratio,cols_type],axis=1,keys=['count','ratio','cols_type'])
missing_data=missing_data[missing_data.ratio>0]
missing_data

	count	ratio	cols_type
PoolQC	2899	99.690509	object
MiscFeature	2804	96.423659	object
Alley	2711	93.225585	object
Fence	2338	80.398900	object
FireplaceQu	1418	48.762036	object
LotFrontage	484	16.643741	float64
GarageCond	159	5.467675	object
GarageQual	159	5.467675	object
GarageYrBlt	159	5.467675	float64
GarageFinish	159	5.467675	object
GarageType	157	5.398900	object
BsmtCond	82	2.819807	object
BsmtExposure	82	2.819807	object
BsmtQual	81	2.785420	object
BsmtFinType2	80	2.751032	object
BsmtFinType1	79	2.716644	object
MasVnrType	24	0.825309	object
MasVnrArea	23	0.790922	float64
MSZoning	4	0.137552	object
BsmtHalfBath	2	0.068776	float64
Utilities	2	0.068776	object
Functional	2	0.068776	object
BsmtFullBath	2	0.068776	float64
BsmtFinSF2	1	0.034388	float64
BsmtFinSF1	1	0.034388	float64
Exterior2nd	1	0.034388	object
BsmtUnfSF	1	0.034388	float64
TotalBsmtSF	1	0.034388	float64
Exterior1st	1	0.034388	object
SaleType	1	0.034388	object
Electrical	1	0.034388	object
KitchenQual	1	0.034388	object
GarageArea	1	0.034388	float64
GarageCars	1	0.034388	float64

# 可視化
f,axis = plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=missing_data.index,y=missing_data.ratio)

plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

在賽事方提供的特徵描述中說明了一部分特徵值數據的缺失是由於房屋確實不存在此種類型的特徵，因此對於這一部分特徵的缺失值，根據特徵的數據類型分別進行插補，類別特徵的缺失值以一種新類別插補，數值特徵以0值插補，剩餘的那一部分缺失的特徵值採用衆數插補

None填充

str_cols = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageType", "GarageFinish", "GarageQual", "GarageCond",  \
            "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "MasVnrType", "MSSubClass"]
for col in str_cols:
    all_data[col].fillna("None",inplace=True)
    
del str_cols, col

0填充

num_cols=["BsmtUnfSF","TotalBsmtSF","BsmtFinSF2","BsmtFinSF1","BsmtFullBath","BsmtHalfBath", \
          "MasVnrArea","GarageCars","GarageArea","GarageYrBlt"]
for col in num_cols:
    all_data[col].fillna(0, inplace=True)
del num_cols, col

衆數填充

other_cols = ["MSZoning", "Electrical", "KitchenQual", "Exterior1st", "Exterior2nd", "SaleType"]
for col in other_cols:
    all_data[col].fillna(all_data[col].mode()[0], inplace=True)
    
del other_cols, col

lotfrontage

位於同一街道的相鄰的房屋往往具有相同的街區面積屬性

all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
                                                  lambda x: x.fillna(x.median()))

Utilities

缺失值爲2，而幾乎所有的值都是AllPub，考慮刪除

all_data["Utilities"].isnull().sum()

all_data["Utilities"].value_counts()

AllPub    2905
NoSeWa       1
Name: Utilities, dtype: int64

# 刪除特徵
all_data.drop(['Utilities'],axis=1,inplace=True)

Functional

all_data["Functional"] = all_data["Functional"].fillna("Typ")

# 查看缺失值
mask =all_data.isnull().sum().sort_values(ascending=False)>0
all_data.columns[mask]

Index([], dtype='object')

編碼

順序變量編碼
LabelEncoder編碼
獨熱編碼(get_dummies)

# 對順序變量進行編碼
def custom_coding(x):
    if(x=='Ex'):
        r = 0
    elif(x=='Gd'):
        r = 1
    elif(x=='TA'):
        r = 2
    elif(x=='Fa'):
        r = 3
    elif(x=='None'):
        r = 4
    else:
        r = 5
    return r
## 順序變量特徵編碼
cols = ['BsmtCond','BsmtQual','ExterCond','ExterQual','FireplaceQu','GarageCond','GarageQual','HeatingQC','KitchenQual','PoolQC']
for col in cols:
    all_data[col] = all_data[col].apply(custom_coding)
    
del cols, col

一些特徵其被表示成數值特徵缺乏意義，例如年份還有類別，這裏將其轉換爲字符串，即類別型變量

cols = ['MSSubClass', 'YrSold', 'MoSold', 'OverallCond', "MSZoning", "BsmtFullBath", "BsmtHalfBath", "HalfBath",\
        "Functional", "Electrical", "KitchenQual","KitchenAbvGr", "SaleType", "Exterior1st", "Exterior2nd", "YearBuilt", \
        "YearRemodAdd", "GarageYrBlt","BedroomAbvGr","LowQualFinSF"]
for col in cols:
    all_data[col] = all_data[col].astype(str)    
del cols, col

# 對年份類的數據等進行LabelEncoder編碼
from sklearn.preprocessing import LabelEncoder

# 年份等特徵的標籤編碼
str_cols = ["YearBuilt", "YearRemodAdd", 'GarageYrBlt', "YrSold", 'MoSold']
for col in str_cols:
    all_data[col] = LabelEncoder().fit_transform(all_data[col])

# 爲了後續構建有意義的其他特徵而進行標籤編碼
lab_cols = ['Heating','BsmtFinType1', 'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope', \
            'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 'YrSold', 'MoSold', \
            'MSZoning','Neighborhood','Condition1','Condition2','BldgType','HouseStyle','Exterior1st','MasVnrType',\
            'Foundation', 'GarageType','SaleType','SaleCondition']

for col in lab_cols:
    new_col = "labfit_" + col
    all_data[new_col] = LabelEncoder().fit_transform(all_data[col]) 
        
del col,str_cols,lab_cols,new_col

構建新特徵

# 區域相關特徵對於確定房價非常重要，增加了一個總面積的特徵
all_data['TotalHouseArea'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

# 房屋改造時間（YearsSinceRemodel）與房屋出售時間（YrSold）間隔時間的長短通常也會影響房價
all_data['YearsSinceRemodel'] = all_data['YrSold'].astype(int) - all_data['YearRemodAdd'].astype(int)

# 房屋的整體質量特徵也是影響房價的重要要因素
all_data['Total_Home_Quality'] = all_data['OverallQual'].astype(int) + all_data['OverallCond'].astype(int)

房屋內某些區域空間的有無通常也是影響房屋價格的重要因素，例如有三季門廊區(3SsnPorch)、開放式門廊（OpenPorchSF）的房屋可能就比沒有三級門廊區的房屋價格貴。因此這裏我們再增添幾個特徵用於描述房屋內是否存在這些區域空間

all_data['HasWoodDeck'] = (all_data['WoodDeckSF'] == 0) * 1
all_data['HasOpenPorch'] = (all_data['OpenPorchSF'] == 0) * 1
all_data['HasEnclosedPorch'] = (all_data['EnclosedPorch'] == 0) * 1
all_data['Has3SsnPorch'] = (all_data['3SsnPorch'] == 0) * 1
all_data['HasScreenPorch'] = (all_data['ScreenPorch'] == 0) * 1

# 房屋總面積加車庫面積
all_data["TotalAllArea"] = all_data["TotalHouseArea"] + all_data["GarageArea"]
# 房屋總面積和房屋材質指標乘積
all_data["TotalHouse_and_OverallQual"] = all_data["TotalHouseArea"] * all_data["OverallQual"]
 # 地面上居住總面積和房屋材質指標乘積
all_data["GrLivArea_and_OverallQual"] = all_data["GrLivArea"] * all_data["OverallQual"]
 # 地段總面積和房屋材質指標乘積
all_data["LotArea_and_OverallQual"] = all_data["LotArea"] * all_data["OverallQual"]
 # 一般區域分類與房屋總面積的乘積
all_data["MSZoning_and_TotalHouse"] = all_data["labfit_MSZoning"] * all_data["TotalHouseArea"]
# 一般區域分類指標與房屋材質指標之和
all_data["MSZoning_and_OverallQual"] = all_data["labfit_MSZoning"] + all_data["OverallQual"]
# 一般區域分類指標與初始建設年份之和
all_data["MSZoning_and_YearBuilt"] = all_data["labfit_MSZoning"] + all_data["YearBuilt"]      
# 地理鄰近環境位置指標與總房屋面積之積
all_data["Neighborhood_and_TotalHouse"] = all_data["labfit_Neighborhood"] * all_data["TotalHouseArea"]
all_data["Neighborhood_and_OverallQual"] = all_data["labfit_Neighborhood"] + all_data["OverallQual"]  
all_data["Neighborhood_and_YearBuilt"] = all_data["labfit_Neighborhood"] + all_data["YearBuilt"]
 # 1型成品的面積和房屋材質指標乘積
all_data["BsmtFinSF1_and_OverallQual"] = all_data["BsmtFinSF1"] * all_data["OverallQual"]
## 家庭功能評級指標與房屋總面積的乘積
all_data["Functional_and_TotalHouse"] = all_data["labfit_Functional"] * all_data["TotalHouseArea"]
all_data["Functional_and_OverallQual"] = all_data["labfit_Functional"] + all_data["OverallQual"]
all_data["TotalHouse_and_LotArea"] = all_data["TotalHouseArea"] + all_data["LotArea"]
# 房屋與靠近公路或鐵路指標乘積係數
all_data["Condition1_and_TotalHouse"] = all_data["labfit_Condition1"] * all_data["TotalHouseArea"]
all_data["Condition1_and_OverallQual"] = all_data["labfit_Condition1"] + all_data["OverallQual"]
# 地下室相關面積總和指標
all_data["Bsmt"] = all_data["BsmtFinSF1"] + all_data["BsmtFinSF2"] + all_data["BsmtUnfSF"]
# 地面上全浴室和地面上房間總數量之和
all_data["Rooms"] = all_data["FullBath"]+all_data["TotRmsAbvGrd"]
# 開放式門廊、圍廊、三季門廊、屏風玄關總面積
all_data["PorchArea"] = all_data["OpenPorchSF"]+all_data["EnclosedPorch"]+all_data["3SsnPorch"]+all_data["ScreenPorch"]    
## 全部功能區總面積（房屋、地下室、車庫、門廊等）
all_data["TotalPlace"] = all_data["TotalAllArea"] + all_data["PorchArea"]

Log變換

將數值型feature裏skew（偏度）絕對值大於0.75的特徵進行一個log變換，將非正態的數據修正爲接近正態分佈的數據，以便滿足線性模型的需要。

爲什麼要通過函數變換來改變原始數值型特徵的分佈呢？

變換後可以更加便捷的發現數據之間的關係：從沒有關係變成有關係，使得模型更好利用數據；
很多特徵的數據呈現嚴重的偏態分佈（例如：很多偏小的值聚在一起），變換後可以拉開差異；
讓數據符合模型理論所需要的假設，然後對其進行分析，例如變換後的數據呈現正態分佈；

常用數據轉換方法的有：對數轉換，box-cox轉換等變換方式，其中對數轉換的方式是最爲常用的，取對數之後數據的性質和相關關係不會發生改變，但壓縮了變量的尺度，大大方便了計算。

from scipy.stats import norm, skew

# 計算各數值型特徵變量的偏度
num_features = all_data.select_dtypes(include=['int64','float64','int32']).copy()
num_feature_names = list(num_features.columns)

skewed_feats = all_data[num_feature_names].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness[skewness["Skew"].abs()>0.75]

	Skew
MiscVal	21.915535
PoolArea	17.661095
LotArea	13.334935
labfit_Condition2	12.437736
TotalHouse_and_LotArea	12.380094
labfit_Heating	12.136394
LotArea_and_OverallQual	11.799484
3SsnPorch	11.354131
labfit_LandSlope	5.009358
BsmtFinSF2	4.137116
EnclosedPorch	4.005089
ScreenPorch	3.926054
GarageCond	3.153395
labfit_Condition1	3.005668
GarageQual	2.863557
MasVnrArea	2.619878
Condition1_and_TotalHouse	2.544979
BsmtCond	2.542349
OpenPorchSF	2.493685
PorchArea	2.232411
labfit_BldgType	2.186631
BsmtFinSF1_and_OverallQual	2.017572
WoodDeckSF	1.852261
TotalHouse_and_OverallQual	1.615116
GrLivArea_and_OverallQual	1.485190
1stFlrSF	1.264660
LotFrontage	1.106714
GrLivArea	1.048644
TotalHouseArea	1.012116
BsmtFinSF1	0.982488
BsmtUnfSF	0.919524
TotalAllArea	0.891388
TotalPlace	0.887892
2ndFlrSF	0.853227
Neighborhood_and_TotalHouse	0.852391
ExterQual	-0.784824
ExterCond	-0.838720
Functional_and_OverallQual	-0.920453
labfit_BsmtExposure	-1.116930
labfit_MSZoning	-1.745237
HasEnclosedPorch	-1.880501
labfit_Fence	-1.990335
labfit_SaleCondition	-2.785113
HasScreenPorch	-2.915483
labfit_PavedDrive	-2.979584
labfit_BsmtFinType2	-3.036904
labfit_CentralAir	-3.461892
labfit_SaleType	-3.737598
labfit_Functional	-4.062504
Has3SsnPorch	-8.695256
labfit_Street	-16.166862
PoolQC	-20.309793

設置閾值爲1，對偏度大於閾值的特徵進行log函數變換操作以提升質量

skew_cols = list(skewness[skewness["Skew"].abs()>1].index)
for col in skew_cols:
    # 偏度超過閾值的特徵做box-cox變換
    # all_data[col] = boxcox1p(all_data[col], 0.15)
    # 偏度超過閾值的特徵對數變換
    all_data[col] = np.log1p(all_data[col])

# 查看字符特徵變量
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2908 entries, 0 to 2907
Columns: 135 entries, MSSubClass to TotalPlace
dtypes: float64(54), int64(40), object(41)
memory usage: 3.0+ MB

# 對於這些剩下的字符型特徵，採用獨熱編碼的方式，將其轉化爲數值型的特徵
all_data = pd.get_dummies(all_data)
all_data.head()

	LotFrontage	LotArea	OverallQual	YearBuilt	YearRemodAdd	MasVnrArea	ExterQual	ExterCond	BsmtQual	BsmtCond	...	SaleType_WD	SaleCondition_Abnorml	SaleCondition_Normal
0	4.189655	9.042040	7	109	53	5.283204	1	2	1	1.098612	...	1	0	1
1	4.394449	9.169623	6	82	26	0.000000	2	2	1	1.098612	...	1	0	1
2	4.234107	9.328212	7	107	52	5.093750	1	2	1	1.098612	...	1	0	1
3	4.110874	9.164401	7	24	20	0.000000	2	2	2	0.693147	...	1	1	0
4	4.442651	9.565284	8	106	50	5.860786	1	2	1	1.098612	...	1	0	1

5 rows × 393 columns

all_data.info()
all_data.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2908 entries, 0 to 2907
Columns: 393 entries, LotFrontage to SaleCondition_Partial
dtypes: float64(54), int64(40), uint8(299)
memory usage: 2.9 MB





(2908, 393)

特徵降維

可以看到總共才2908行數據，就有390維特徵，特徵比較多，我們在這裏使用Lasso進行降維

# 劃分數據集
def split_data(all_data,train_index):
    cols = list(all_data.columns)
    # 用衆數填充特徵工程中產生的異常值(正負無窮大)
    for col in cols:
        all_data[col].values[np.isinf(all_data[col].values)]=all_data[col].median()
    del cols,col
    
    train_data = all_data[:train_index]
    test_data  = all_data[train_index:]
    return train_data,test_data

train_X,test_X = split_data(all_data,train_num)
train_X.shape,test_X.shape,train_y.shape

((1449, 393), (1459, 393), (1449,))

1.針對離羣點做標準化處理

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
# 訓練集特徵歸一化
train_X = scaler.fit_transform(train_X)
# 測試集特徵歸一化
test_X = scaler.transform(test_X)

2.建模

from sklearn.linear_model import Lasso

lasso_model = Lasso(alpha=0.001)
lasso_model.fit(train_X,train_y)

Lasso(alpha=0.001, copy_X=True, fit_intercept=True, max_iter=1000,
      normalize=False, positive=False, precompute=False, random_state=None,
      selection='cyclic', tol=0.0001, warm_start=False)

# 顯示所有列
pd.set_option('display.max_columns', None)
# 顯示所有行
pd.set_option('display.max_rows', None)
# 設置value的顯示長度爲100，默認爲50
pd.set_option('max_colwidth',100)

# 索引和特徵重要性
FI_lasso = pd.DataFrame({"Feature Importance":lasso_model.coef_},
                        index=all_data.columns)
# 由高到低進行排序
FI_lasso.sort_values("Feature Importance",ascending=False).round(5).head(10)

	Feature Importance
Neighborhood_Crawfor	0.09052
Total_Home_Quality	0.08677
TotalPlace	0.07877
GrLivArea	0.06999
KitchenQual_0	0.05483
Functional_and_TotalHouse	0.04605
labfit_SaleCondition	0.04488
Exterior1st_BrkFace	0.04458
YearBuilt	0.03844
MSZoning_and_YearBuilt	0.03626

3.可視化

# 不爲0的特徵
FI_lasso=FI_lasso[FI_lasso["Feature Importance"] !=0 ].sort_values("Feature Importance")
FI_lasso.plot(kind="barh",figsize=(12,40), color='g')
plt.xticks(rotation=90)

display(FI_lasso.shape)

4.特徵選擇

# 挑選特徵
choose_cols = FI_lasso.index.tolist()
choose_data = all_data[choose_cols].copy()

choose_data.shape

(2908, 86)

數據建模

# 劃分數據集
train_X, test_X = choose_data[:train_num], choose_data[train_num:]
# 標準化處理
scaler = RobustScaler()
train_X = scaler.fit_transform(train_X)
test_X = scaler.transform(test_X)

train_X.shape,test_X.shape,train_y.shape

((1449, 86), (1459, 86), (1449,))

# Models
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,ExtraTreesRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge,RidgeCV,Lasso,LinearRegression
from sklearn.linear_model import ElasticNet,ElasticNetCV,SGDRegressor,BayesianRidge
from sklearn.svm import SVR,LinearSVR
from mlxtend.regressor import StackingCVRegressor
from sklearn.kernel_ridge import KernelRidge
import lightgbm as lgb
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

# Misc
from sklearn.model_selection import GridSearchCV,KFold, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone

# 12折交叉驗證
kf = KFold(n_splits=12,random_state=42,shuffle=True)

評分函數

# 均方根誤差
def rmse(y,y_pred):
    return np.sqrt(mean_squared_error(y,y_pred))

def cv_rmse(model,X,y):
    # neg_mean_squared_error 負均方根誤差
    rmse = np.sqrt(-cross_val_score(model,X,y,
                                    scoring="neg_mean_squared_error",cv=kf))
    return rmse

主成分分析

前面新建的特徵和原始特徵存在相關性，這可能導致較強的多重共線性 (Multicollinearity)

pca_model = PCA(n_components=60)
train_X = pca_model.fit_transform(train_X)
test_X  = pca_model.transform(test_X)

集成學習

通常對於一個問題，我們可以依據數據構建不同的模型去求解，這些模型站在不同的角度去解釋數據的內部結構。我們可以融合不同的求解方法，得到更優的求解結果。在集成學習中，我們要做的事情就是構建不同的個體學習器，並將它們很好的進行集成。關鍵在於同一個訓練集上訓練出來的模型相關性很高，而我們希望它們“不同”，這樣才能查漏補缺，取得更好的結果。

Bagging 基於數據去做，採用從訓練集中有放回的採樣方式，得到一個新的訓練集，去訓練個體學習器
Boosting 串行的去訓練個體學習器，使用個體學習器對數據進行學習，將數據中做錯的樣本權值增大，做對的樣本權值減小，然後繼續訓練處下一個個體學習，以此類推，直到我們的錯誤率低於我們的閾值。

Stacking和Blending屬於Bagging方法，兩者的不同之處在於採用不同的方式融合個體學習器，前者非線性，後者線性。

# 網格搜索
def get_best_model_and_accuracy(model, params, X, y):
    # 如果報錯，結果是0
    grid = GridSearchCV(model,params,scoring='neg_mean_squared_error',
                                                 cv=5,n_jobs=-1,error_score=0.)
    grid.fit(X, y) # 擬合模型和參數
    # 經典的性能指標
    print("Best Score: {}".format(np.sqrt(-grid.best_score_)))
    # 得到最佳準確率的最佳參數
    print("Best Parameters: {}".format(grid.best_params_))
    # 擬合的平均時間（秒）
    print("Average Time to Fit (s): {}".format(round(grid.cv_results_['mean_fit_time'].mean(), 3)))
    # 預測的平均時間（秒）
    # 從該指標可以看出模型在真實世界的性能
    print("Average Time to Score (s): {}".format(round(grid.cv_results_['mean_score_time'].mean(), 3)))
    
    grid.cv_results_['mean_test_score'] = np.sqrt(-grid.cv_results_['mean_test_score'])
        
    # 打印單獨的各參數組合參數及對應的評估指標
    print(pd.DataFrame(grid.cv_results_)[['params','mean_test_score','std_test_score']])
    return grid

Lasso

param_Lasso = {'alpha': [0.0004,0.0005,0.0006],
               'max_iter':[10000],'random_state':[1]}
Lasso_grid =get_best_model_and_accuracy(Lasso(),param_Lasso,train_X,train_y)

Best Score: 0.11233809637926326
Best Parameters: {'alpha': 0.0004, 'max_iter': 10000, 'random_state': 1}
Average Time to Fit (s): 0.002
Average Time to Score (s): 0.0
                                                    params  mean_test_score  \
0  {'alpha': 0.0004, 'max_iter': 10000, 'random_state': 1}         0.112338   
1  {'alpha': 0.0005, 'max_iter': 10000, 'random_state': 1}         0.112341   
2  {'alpha': 0.0006, 'max_iter': 10000, 'random_state': 1}         0.112416   

   std_test_score  
0        0.000861  
1        0.000884  
2        0.000907

Ridge

param_Ridge = {'alpha':[35,40,45,50,55]}
Ridge_grid =get_best_model_and_accuracy(Ridge(),param_Ridge,train_X,train_y)

Best Score: 0.11201108834987004
Best Parameters: {'alpha': 35}
Average Time to Fit (s): 0.001
Average Time to Score (s): 0.0
          params  mean_test_score  std_test_score
0  {'alpha': 35}         0.112011        0.000953
1  {'alpha': 40}         0.112035        0.000967
2  {'alpha': 45}         0.112073        0.000980
3  {'alpha': 50}         0.112122        0.000991
4  {'alpha': 55}         0.112180        0.001001

SVR

param_SVR = {'C':[11,12,13,14,15],'kernel':["rbf"],"gamma":[0.0003,0.0004],
             "epsilon":[0.008,0.009]}
SVR_grid =get_best_model_and_accuracy(SVR(),param_SVR,train_X,train_y)

Best Score: 0.11185206657627142
Best Parameters: {'C': 15, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
Average Time to Fit (s): 0.317
Average Time to Score (s): 0.044
                                                           params  \
0   {'C': 11, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
1   {'C': 11, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
2   {'C': 11, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
3   {'C': 11, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   
4   {'C': 12, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
5   {'C': 12, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
6   {'C': 12, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
7   {'C': 12, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   
8   {'C': 13, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
9   {'C': 13, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
10  {'C': 13, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
11  {'C': 13, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   
12  {'C': 14, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
13  {'C': 14, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
14  {'C': 14, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
15  {'C': 14, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   
16  {'C': 15, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
17  {'C': 15, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
18  {'C': 15, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
19  {'C': 15, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   

    mean_test_score  std_test_score  
0          0.112221        0.001143  
1          0.111954        0.001126  
2          0.112240        0.001131  
3          0.112010        0.001115  
4          0.112148        0.001147  
5          0.111916        0.001105  
6          0.112193        0.001135  
7          0.111954        0.001103  
8          0.112077        0.001141  
9          0.111902        0.001092  
10         0.112097        0.001137  
11         0.111994        0.001098  
12         0.112045        0.001135  
13         0.111888        0.001081  
14         0.112054        0.001127  
15         0.111958        0.001082  
16         0.112021        0.001123  
17         0.111852        0.001068  
18         0.112056        0.001113  
19         0.111902        0.001071

KernelRidge

param_KernelRidge = {'alpha':[0.3,0.4,0.5], 'kernel':["polynomial"],
              'degree':[3],'coef0':[0.8,1,1.2]}

KernelRidge_grid =get_best_model_and_accuracy(KernelRidge(),param_KernelRidge,train_X,train_y)

Best Score: 0.12053877269961878
Best Parameters: {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
Average Time to Fit (s): 0.207
Average Time to Score (s): 0.037
                                                              params  \
0  {'alpha': 0.3, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
1    {'alpha': 0.3, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
2  {'alpha': 0.3, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   
3  {'alpha': 0.4, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
4    {'alpha': 0.4, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
5  {'alpha': 0.4, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   
6  {'alpha': 0.5, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
7    {'alpha': 0.5, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
8  {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   

   mean_test_score  std_test_score  
0         0.131492        0.001534  
1         0.124723        0.001179  
2         0.123360        0.001052  
3         0.132097        0.001687  
4         0.123652        0.001257  
5         0.121633        0.001096  
6         0.133186        0.001837  
7         0.123168        0.001331  
8         0.120539        0.001138

ElasticNet

ElasticNet可以看做Lasso和Ridge的中庸化的產物。它也是對普通的線性迴歸做了正則化，但是它的損失函數既不全是L1的正則化，也不全是L2的正則化，而是用一個權重參數ρ來平衡L1和L2正則化的比重

機器學習算法之嶺迴歸、Lasso迴歸和ElasticNet迴歸

param_ElasticNet = {'alpha':[0.0008,0.004,0.005],'l1_ratio':[0.08,0.1,0.3,0.5],
                    'max_iter':[10000],'random_state':[3]}
ElasticNet_grid =get_best_model_and_accuracy(ElasticNet(),param_ElasticNet,train_X,train_y)

Best Score: 0.11223819703859092
Best Parameters: {'alpha': 0.005, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
Average Time to Fit (s): 0.001
Average Time to Score (s): 0.0
                                                                       params  \
0   {'alpha': 0.0008, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}   
1    {'alpha': 0.0008, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}   
2    {'alpha': 0.0008, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}   
3    {'alpha': 0.0008, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}   
4    {'alpha': 0.004, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}   
5     {'alpha': 0.004, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}   
6     {'alpha': 0.004, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}   
7     {'alpha': 0.004, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}   
8    {'alpha': 0.005, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}   
9     {'alpha': 0.005, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}   
10    {'alpha': 0.005, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}   
11    {'alpha': 0.005, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}   

    mean_test_score  std_test_score  
0          0.112599        0.000791  
1          0.112573        0.000795  
2          0.112379        0.000828  
3          0.112327        0.000865  
4          0.112244        0.000872  
5          0.112254        0.000888  
6          0.113251        0.001022  
7          0.114522        0.001099  
8          0.112238        0.000895  
9          0.112282        0.000914  
10         0.113737        0.001056  
11         0.115224        0.001138

bay = BayesianRidge()
xgb = XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,
                   learning_rate=0.05, max_depth=3,
                   min_child_weight=1.7817, n_estimators=2200,
                   reg_alpha=0.4640, reg_lambda=0.8571,subsample=0.5213, 
                   silent=1,random_state =7, nthread = -1)

lgbm = LGBMRegressor(objective='regression',num_leaves=5,learning_rate=0.05, 
                     n_estimators=700,max_bin = 55,
                     bagging_fraction = 0.8,bagging_freq = 5, 
                     feature_fraction = 0.25,feature_fraction_seed=9, 
                     bagging_seed=9,min_data_in_leaf = 6, min_sum_hessian_in_leaf = 11)


GBR = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                max_depth=4, max_features='sqrt',
                                min_samples_leaf=15, min_samples_split=10,
                                loss='huber', random_state =5)

Stacking

Stacking的學習器分兩層，第一層是若干個弱學習器，它們分別進行預測，然後把預測結果傳遞給第二層；第二層學習器基於第一層的預測結果預測。

Stacking容易過擬合，採用K-Fold方法進行訓練：

將訓練集分成5份,迭代5次,每次迭代都將4份數據作爲Train Set對每個Base Model進行訓練，然後剩下的一份作爲Hold-out Set進行預測。同時，每個Base Model在Test Set的預測值也要保存下來。經過5-Flod迭代後，我們獲得了一個:訓練樣本行數 * 模型數量的矩陣（每個Base Model在進行cv 的過程，分別會對每一份Hold-out-set進行預測，彙總產生一個對所有訓練集的預測），這個矩陣作爲第二層的訓練數據進行訓練，得到model B。
將之前保存的每個Base Model對測試集進行的預測的平均值拼成一個：測試樣本行數 * 模型數量的矩陣（每個Base Model會對測試集進行5-fold次預測，所以在拼測試數據的預測結果之前，需要對每個Base Model預測5-Fold次的預測結果求均值）。
model B對測試集的預測進行預測。

class stacking(BaseEstimator, RegressorMixin, TransformerMixin):
    
    def __init__(self,mod,meta_model):
        self.mod = mod  # 首層學習器模型
        self.meta_model = meta_model  # 次學習器模型
        # 堆疊的最大特徵劃分折數
        self.k=5
        self.kf = KFold(n_splits=self.k, random_state=42, shuffle=True)
    
    # 訓練數據
    def fit(self,X,y):
        # self.saved_model包含所有第一層學習器
        self.saved_model = [list() for i in self.mod]
        # 維度：訓練樣本行數*模型數量
        oof_train = np.zeros((X.shape[0], len(self.mod)))
        
        for i,model in enumerate(self.mod): # 返回索引和模型本身
            #返回數據分割成分(訓練集和驗證集對應元素)的索引
            for train_index, val_index in self.kf.split(X,y):  
                renew_model = clone(model) # 模型的複製
                # 對分割出來的訓練集數據進行訓練
                renew_model.fit(X[train_index], y[train_index])
                # 添加模型 
                self.saved_model[i].append(renew_model)
                # 保存對應模型的驗證集預測值
                oof_train[val_index,i] = renew_model.predict(X[val_index])


        # 次學習器模型訓練，這裏只是用到了首層預測值作爲特徵
        self.meta_model.fit(oof_train,y)
        return self
    
    # 測試數據
    def predict(self,X):
        # 得到的是整個測試集的首層預測值,np.column_stack:左右根據列拼接 mean(axis=1):跨列求和
        whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1) 
                                      for single_model in self.saved_model])
        # 返回次學習器模型對整個測試集的首層預測值的最終預測
        return self.meta_model.predict(whole_test)            
    
    ## 獲取首層學習結果的堆疊特徵
    def get_oof(self,X,y,test_X):                 
        oof = np.zeros((X.shape[0],len(self.mod)))
        test_single = np.zeros((test_X.shape[0],self.k))
        test_mean = np.zeros((test_X.shape[0],len(self.mod)))
        
        for i,model in enumerate(self.mod):
            for j, (train_index,val_index) in enumerate(self.kf.split(X,y)):
                clone_model = clone(model)
                clone_model.fit(X[train_index],y[train_index])
                # 預測結果保存
                oof[val_index,i] = clone_model.predict(X[val_index])
                test_single[:,j] = clone_model.predict(test_X)
            # 對每個模型的測試集預測K-Fold結果取均值
            test_mean[:,i] = test_single.mean(axis=1)
        return oof, test_mean

lasso = Lasso_grid.best_estimator_
ridge = Ridge_grid.best_estimator_
svr = SVR_grid.best_estimator_
ker = KernelRidge_grid.best_estimator_
ela= ElasticNet_grid.best_estimator_

stack_model = stacking(mod=[bay,lasso,ridge,svr,ker,ela], meta_model=ker)

# 查看訓練集精度
score = cv_rmse(stack_model,train_X,train_y)
display(score.mean())

0.10746634249868159

# 第二層學習器特徵獲取
x_train_stack, x_test_stack = stack_model.get_oof(train_X,train_y,test_X)

train_X.shape,train_y.shape,test_X.shape

((1449, 60), (1449,), (1459, 60))

# 第一層總共6個模型
x_train_stack.shape, x_test_stack.shape

((1449, 6), (1459, 6))

stacking的一般來說得到初級模型的預測值後，用次級模型訓練預測就可以了。不過在本案例中，我們嘗試將第一層得到的堆疊特徵與初始特徵進行合併，最後利用這些合併後的特徵再次投入stacking裏面進行訓練。

# 將stacking特徵和數據原始的特徵拼接,水平方向上
x_train_add = np.hstack((train_X,x_train_stack))
x_test_add = np.hstack((test_X,x_test_stack))

x_train_add.shape,x_test_add.shape

((1449, 66), (1459, 66))

# 查看拼接特徵後的精度,發現效果有所提升
score =  cv_rmse(stack_model,x_train_add,train_y)
print(score.mean())

0.10195220877304757

將x_train_add,train_y和x_test_add通過stacking重新進行訓練

param_Lasso = {'alpha': [0.0004,0.0005,0.0006],
               'max_iter':[10000],'random_state':[1]}
Lasso_2 =get_best_model_and_accuracy(Lasso(),param_Lasso,x_train_add,train_y)

Best Score: 0.11162310214215297
Best Parameters: {'alpha': 0.0005, 'max_iter': 10000, 'random_state': 1}
Average Time to Fit (s): 0.009
Average Time to Score (s): 0.0
                                                    params  mean_test_score  \
0  {'alpha': 0.0004, 'max_iter': 10000, 'random_state': 1}         0.111637   
1  {'alpha': 0.0005, 'max_iter': 10000, 'random_state': 1}         0.111623   
2  {'alpha': 0.0006, 'max_iter': 10000, 'random_state': 1}         0.111662   

   std_test_score  
0        0.000880  
1        0.000896  
2        0.000909

param_Ridge = {'alpha':[35,40,45,50,55]}
Ridge_2 =get_best_model_and_accuracy(Ridge(),param_Ridge,x_train_add,train_y)

Best Score: 0.1118608032209135
Best Parameters: {'alpha': 35}
Average Time to Fit (s): 0.002
Average Time to Score (s): 0.0
          params  mean_test_score  std_test_score
0  {'alpha': 35}         0.111861        0.000949
1  {'alpha': 40}         0.111892        0.000962
2  {'alpha': 45}         0.111924        0.000973
3  {'alpha': 50}         0.111960        0.000983
4  {'alpha': 55}         0.111999        0.000992

param_SVR = {'C':[11,12,13,14,15],'kernel':["rbf"],"gamma":[0.0003,0.0004],
             "epsilon":[0.008,0.009]}
SVR_2 =get_best_model_and_accuracy(SVR(),param_SVR,x_train_add,train_y)

Best Score: 0.11187202151025108
Best Parameters: {'C': 15, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}
Average Time to Fit (s): 0.316
Average Time to Score (s): 0.044
                                                           params  \
0   {'C': 11, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
1   {'C': 11, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
2   {'C': 11, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
3   {'C': 11, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   
4   {'C': 12, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
5   {'C': 12, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
6   {'C': 12, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
7   {'C': 12, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   
8   {'C': 13, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
9   {'C': 13, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
10  {'C': 13, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
11  {'C': 13, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   
12  {'C': 14, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
13  {'C': 14, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
14  {'C': 14, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
15  {'C': 14, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   
16  {'C': 15, 'epsilon': 0.008, 'gamma': 0.0003, 'kernel': 'rbf'}   
17  {'C': 15, 'epsilon': 0.008, 'gamma': 0.0004, 'kernel': 'rbf'}   
18  {'C': 15, 'epsilon': 0.009, 'gamma': 0.0003, 'kernel': 'rbf'}   
19  {'C': 15, 'epsilon': 0.009, 'gamma': 0.0004, 'kernel': 'rbf'}   

    mean_test_score  std_test_score  
0          0.112114        0.001168  
1          0.111980        0.001131  
2          0.112167        0.001164  
3          0.112013        0.001132  
4          0.112075        0.001161  
5          0.111909        0.001112  
6          0.112136        0.001158  
7          0.111960        0.001113  
8          0.112050        0.001159  
9          0.111898        0.001082  
10         0.112133        0.001152  
11         0.111930        0.001096  
12         0.112024        0.001159  
13         0.111873        0.001057  
14         0.112087        0.001149  
15         0.111928        0.001074  
16         0.111989        0.001150  
17         0.111872        0.001046  
18         0.112041        0.001143  
19         0.111910        0.001060

param_KernelRidge = {'alpha':[0.3,0.4,0.5], 'kernel':["polynomial"],
              'degree':[3],'coef0':[0.8,1,1.2]}

KernelRidge_2 =get_best_model_and_accuracy(KernelRidge(),param_KernelRidge,x_train_add,train_y)

Best Score: 0.11754411372302964
Best Parameters: {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}
Average Time to Fit (s): 0.184
Average Time to Score (s): 0.032
                                                              params  \
0  {'alpha': 0.3, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
1    {'alpha': 0.3, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
2  {'alpha': 0.3, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   
3  {'alpha': 0.4, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
4    {'alpha': 0.4, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
5  {'alpha': 0.4, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   
6  {'alpha': 0.5, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
7    {'alpha': 0.5, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
8  {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   

   mean_test_score  std_test_score  
0         0.121835        0.002417  
1         0.119743        0.002347  
2         0.118019        0.002291  
3         0.121416        0.002253  
4         0.119359        0.002201  
5         0.117628        0.002159  
6         0.121293        0.002123  
7         0.119272        0.002083  
8         0.117544        0.002051

param_ElasticNet = {'alpha':[0.0008,0.004,0.005],'l1_ratio':[0.08,0.1,0.3,0.5],
                    'max_iter':[10000],'random_state':[3]}
ElasticNet_2 =get_best_model_and_accuracy(ElasticNet(),param_ElasticNet,x_train_add,train_y)

Best Score: 0.10667612140906058
Best Parameters: {'alpha': 0.0008, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}
Average Time to Fit (s): 0.025
Average Time to Score (s): 0.0
                                                                       params  \
0   {'alpha': 0.0008, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}   
1    {'alpha': 0.0008, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}   
2    {'alpha': 0.0008, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}   
3    {'alpha': 0.0008, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}   
4    {'alpha': 0.004, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}   
5     {'alpha': 0.004, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}   
6     {'alpha': 0.004, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}   
7     {'alpha': 0.004, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}   
8    {'alpha': 0.005, 'l1_ratio': 0.08, 'max_iter': 10000, 'random_state': 3}   
9     {'alpha': 0.005, 'l1_ratio': 0.1, 'max_iter': 10000, 'random_state': 3}   
10    {'alpha': 0.005, 'l1_ratio': 0.3, 'max_iter': 10000, 'random_state': 3}   
11    {'alpha': 0.005, 'l1_ratio': 0.5, 'max_iter': 10000, 'random_state': 3}   

    mean_test_score  std_test_score  
0          0.106676        0.000741  
1          0.107021        0.000758  
2          0.111335        0.000889  
3          0.111619        0.000880  
4          0.111584        0.000877  
5          0.111586        0.000891  
6          0.112205        0.001007  
7          0.113027        0.001072  
8          0.111594        0.000896  
9          0.111623        0.000914  
10         0.112603        0.001041  
11         0.113622        0.001111

bay_2 = BayesianRidge()
xgb_2 = XGBRegressor(colsample_bytree=0.4603, gamma=0.0468,learning_rate=0.05,
                     max_depth=3,min_child_weight=1.7817, n_estimators=2200,reg_alpha=0.4640, 
                   reg_lambda=0.8571,subsample=0.5213, silent=1,random_state =7, nthread = -1)

lgbm_2 = LGBMRegressor(objective='regression',num_leaves=5,learning_rate=0.05,
                       n_estimators=700,max_bin = 55,bagging_fraction = 0.8,bagging_freq = 5,
                       feature_fraction = 0.25,feature_fraction_seed=9, 
                       bagging_seed=9,min_data_in_leaf = 6, 
                       min_sum_hessian_in_leaf = 11)

GBR_2 = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,max_depth=4,
                                  max_features='sqrt',min_samples_leaf=15,
                                  min_samples_split=10,
                                  loss='huber', 
                                  random_state =5)

lasso_2 = Lasso_2.best_estimator_
ridge_2 = Ridge_2.best_estimator_
svr_2 = SVR_2.best_estimator_
ker_2 = KernelRidge_2.best_estimator_
ela_2 = ElasticNet_2.best_estimator_

stack_model_2 = stacking(mod=[bay_2,lasso_2,ridge_2,svr_2,ker_2,ela_2], meta_model=ker_2)

last_x_train_stack, last_x_test_stack = stack_model_2.get_oof(x_train_add,train_y,x_test_add)

last_x_train_stack.shape, last_x_test_stack.shape

((1449, 6), (1459, 6))

第二層模型KernelRidge的參數搜索

param_ker = {'alpha':[0.2,0.3,0.4,0.5], 'kernel':["polynomial"],
              'degree':[3,4],'coef0':[0.8,1,1.2]}
Ker_stack_model=get_best_model_and_accuracy(KernelRidge(),
                            param_ker,last_x_train_stack,train_y).best_estimator_

Best Score: 0.08808555947636867
Best Parameters: {'alpha': 0.2, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}
Average Time to Fit (s): 0.186
Average Time to Score (s): 0.03
                                                               params  \
0   {'alpha': 0.2, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
1   {'alpha': 0.2, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}   
2     {'alpha': 0.2, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
3     {'alpha': 0.2, 'coef0': 1, 'degree': 4, 'kernel': 'polynomial'}   
4   {'alpha': 0.2, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   
5   {'alpha': 0.2, 'coef0': 1.2, 'degree': 4, 'kernel': 'polynomial'}   
6   {'alpha': 0.3, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
7   {'alpha': 0.3, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}   
8     {'alpha': 0.3, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
9     {'alpha': 0.3, 'coef0': 1, 'degree': 4, 'kernel': 'polynomial'}   
10  {'alpha': 0.3, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   
11  {'alpha': 0.3, 'coef0': 1.2, 'degree': 4, 'kernel': 'polynomial'}   
12  {'alpha': 0.4, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
13  {'alpha': 0.4, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}   
14    {'alpha': 0.4, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
15    {'alpha': 0.4, 'coef0': 1, 'degree': 4, 'kernel': 'polynomial'}   
16  {'alpha': 0.4, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   
17  {'alpha': 0.4, 'coef0': 1.2, 'degree': 4, 'kernel': 'polynomial'}   
18  {'alpha': 0.5, 'coef0': 0.8, 'degree': 3, 'kernel': 'polynomial'}   
19  {'alpha': 0.5, 'coef0': 0.8, 'degree': 4, 'kernel': 'polynomial'}   
20    {'alpha': 0.5, 'coef0': 1, 'degree': 3, 'kernel': 'polynomial'}   
21    {'alpha': 0.5, 'coef0': 1, 'degree': 4, 'kernel': 'polynomial'}   
22  {'alpha': 0.5, 'coef0': 1.2, 'degree': 3, 'kernel': 'polynomial'}   
23  {'alpha': 0.5, 'coef0': 1.2, 'degree': 4, 'kernel': 'polynomial'}   

    mean_test_score  std_test_score  
0          0.089836        0.000473  
1          0.088086        0.000600  
2          0.089773        0.000480  
3          0.088102        0.000599  
4          0.089749        0.000485  
5          0.088118        0.000599  
6          0.090043        0.000456  
7          0.088470        0.000586  
8          0.089935        0.000462  
9          0.088486        0.000586  
10         0.089885        0.000468  
11         0.088501        0.000586  
12         0.090214        0.000443  
13         0.088732        0.000579  
14         0.090070        0.000449  
15         0.088748        0.000580  
16         0.089994        0.000455  
17         0.088762        0.000581  
18         0.090358        0.000434  
19         0.088920        0.000576  
20         0.090184        0.000439  
21         0.088936        0.000577  
22         0.090086        0.000445  
23         0.088949        0.000578

cv_rmse(Ker_stack_model,last_x_train_stack,train_y).mean()

0.08791312508608311

# 注意之前對目標數據做過log變換
y_pred_stack = np.expm1(Ker_stack_model.predict(last_x_test_stack))

可以直接用stack_model類的函數擬合併預測數據

stack_model = stacking(mod=[lgbm,ela,svr,ridge,lasso,bay,xgb,GBR,ker],
                       meta_model=KernelRidge(alpha=0.2 ,kernel='polynomial',
                                              degree=4, coef0=0.8))

stack_model.fit(x_train_add,train_y)
y_pred_stack_2 = np.exp(stack_model.predict(x_test_add))

XGBoost建模預測

xgb.fit(last_x_train_stack,train_y)
y_pred_xgb = np.expm1(xgb.predict(last_x_test_stack))
# 交叉驗證
cv_rmse(xgb,x_train_stack,train_y).mean()

0.1139198877562616

# 訓練集誤差
y_train_xgb = xgb.predict(last_x_train_stack)
rmse(y_train_xgb,train_y)

0.08778404527191365

LightGBM建模預測

lgbm.fit(last_x_train_stack,train_y)
y_pred_lgbm = np.expm1(lgbm.predict(last_x_test_stack))

cv_rmse(lgbm,x_train_stack,train_y).mean()

0.1161628433489873

y_train_lgbm = xgb.predict(x_train_stack)
rmse(y_train_lgbm,train_y)

0.10937253913955777

# 模型融合
y_pred = (0.7*y_pred_stack)+(0.15*y_pred_xgb)+(0.15*y_pred_lgbm)

submission = pd.read_csv("/home/aistudio/data/data32288/submission.csv")

submission.shape,y_pred.shape

((1459, 2), (1459,))

submission.iloc[:,1] = y_pred
submission.to_csv(r'./house_submission.csv',index=False)

submission.head()

	Id	SalePrice
0	1461	119962.721230
1	1462	161987.446003
2	1463	188901.912081
3	1464	194701.643631
4	1465	194480.370160

Blending

Blending與Stacking主要區別在於訓練集不是通過K-Fold來獲得預測值從而生成第二階段模型的特徵，而是建立一個Holdout集,第二階段的stacker模型就基於第一階段模型對驗證集的預測值進行擬合。也就是就是把Stacking流程中的K-Fold CV 改成 HoldOut CV。

步驟：

把原始的訓練集先分成兩部分，比如70%的數據作爲訓練集，剩下30%的數據作爲驗證集。第一輪訓練: 我們在這70%的數據上訓練多個模型，然後去預測那30%驗證數據的label,得到pre_val_set;同時也用這些模型去預測測試集得到pre_test_set。
第二輪訓練，我們用pre_val_set做爲新特徵繼續訓練第二層的模型Model B
用Model B對pre_test_set進行預測，得到最終結果

Blending的優點:

比stacking簡單（因爲不用進行k次的交叉驗證來獲得stacker feature）
避開了一個信息泄露問題：generlizers和stacker使用了不一樣的數據集
在團隊建模過程中，不需要給隊友分享自己的隨機種子

Blending的缺點:

使用了很少的數據
blender可能會過擬合（其實大概率是第一點導致的）
stacking使用多次的CV會比較穩健

from sklearn.model_selection import StratifiedKFold,train_test_split


# 模型融合中使用到的各個單模型
clfs = [BayesianRidge(),Lasso(),Ridge(),SVR(),KernelRidge(),ElasticNet()]


# 切分訓練數據集爲train,val兩部分
X_train, X_val, y_train, y_val = train_test_split(train_X,train_y,test_size=0.33, random_state=1855)
dataset_val = np.zeros((X_val.shape[0], len(clfs))) # 對驗證集的預測
dataset_test = np.zeros((test_X.shape[0], len(clfs))) #對測試集的預測 

# 依次訓練各個單模型
for j, clf in enumerate(clfs):

    # 使用train_X訓練模型，獲得其預測的輸出作爲第2部分的新特徵
    clf.fit(X_train, y_train)
    dataset_val[:, j] = clf.predict(X_val)
    # 對於測試集，直接用這k個模型的預測值作爲新的特徵
    dataset_test[:, j] = clf.predict(test_X)

# 融合使用的模型
clf =  XGBRegressor()
clf.fit(dataset_val, y_val)

# 注意前面對目標數據做過log變換
y_submission = np.expm1(clf.predict(dataset_test))
cv_rmse(clf,train_X,train_y).mean()

0.14310972129182878

y_submission

array([122274.41, 142203.67, 176042.67, ..., 164987.31, 107128.92,
       250321.12], dtype=float32)

y_pred_stack

array([118603.60717676, 162614.48976635, 190387.78002988, ...,
       179561.60366542, 117042.61233382, 223750.10906997])

可視化模型預測精度

# 使用mlxtend包
stack_gen = StackingCVRegressor(regressors=(lgbm,ela,svr,ridge,lasso,bay,xgb,GBR,ker),
                                meta_regressor=ker,
                                use_features_in_secondary=True)# 元分類器將根據原始迴歸器和原始數據集的預測進行訓練

獲取每個模型的交叉驗證分數

scores = {}

score = cv_rmse(lgbm,train_X,train_y)
print("lightgbm: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['lgbm'] = (score.mean(), score.std())

lightgbm: 0.1280 (0.0148)

score = cv_rmse(ela,train_X,train_y)
print("ElasticNet: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['ela'] = (score.mean(), score.std())

ElasticNet: 0.1108 (0.0151)

score = cv_rmse(svr,train_X,train_y)
print("SVR: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['svr'] = (score.mean(), score.std())

SVR: 0.1096 (0.0172)

score = cv_rmse(ridge,train_X,train_y)
print("ridge: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['ridge'] = (score.mean(), score.std())

ridge: 0.1106 (0.0154)

score = cv_rmse(lasso,train_X,train_y)
print("Lasso: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['Lasso'] = (score.mean(), score.std())

Lasso: 0.1108 (0.0150)

score = cv_rmse(bay,train_X,train_y)
print("bay: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['bay'] = (score.mean(), score.std())

bay: 0.1106 (0.0152)

score = cv_rmse(xgb,train_X,train_y)
print("xgb: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['xgb'] = (score.mean(), score.std())

xgb: 0.1259 (0.0156)

score = cv_rmse(GBR,train_X,train_y)
print("GBR: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['GBR'] = (score.mean(), score.std())

GBR: 0.1326 (0.0189)

score = cv_rmse(ker,train_X,train_y)
print("ker: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['ker'] = (score.mean(), score.std())

ker: 0.1178 (0.0167)

score = cv_rmse(stack_gen,train_X,train_y)
print("stack_gen: {:.4f} ({:.4f})".format(score.mean(), score.std()))
scores['stack_gen'] = (score.mean(), score.std())

stack_gen: 0.1338 (0.0191)

確定性能最佳的模型

sns.set_style("white")
fig = plt.figure(figsize=(24, 12))

ax = sns.pointplot(x=list(scores.keys()), y=[score for score, _ in scores.values()],
                                            markers=['o'], linestyles=['-'])
for i, score in enumerate(scores.values()):
    ax.text(i, score[0] + 0.002, '{:.6f}'.format(score[0]),
            horizontalalignment='left', size='large', color='black', weight='semibold')

plt.ylabel('Score (RMSE)', size=20, labelpad=12.5)
plt.xlabel('Model', size=20, labelpad=12.5)
plt.tick_params(axis='x', labelsize=13.5)
plt.tick_params(axis='y', labelsize=12.5)

plt.title('Scores of Models', size=20)
plt.show()

Kaggle之房價問題

Kaggle之房價問題

數據分析

缺失值

可視化

查看目標數據

特徵相關性

1.OverallQual

2.GrLivArea

3.GarageCars

4.GarageArea

5.TotalBsmtSF

6.1stFlrSF

7.FullBath

8.YearBuilt

9.YearRemodAdd

特徵工程

缺失值處理

None填充

0填充

衆數填充

lotfrontage

Utilities

Functional

編碼

構建新特徵

Log變換

特徵降維

數據建模

評分函數

主成分分析

集成學習

Lasso

Ridge

SVR

KernelRidge

ElasticNet

Stacking

XGBoost建模預測

LightGBM建模預測

Blending

可視化模型預測精度

參考資料