本人對數據挖掘很感興趣，在自學了python相關數據處理模塊和部分機器學習算法後，嘗試在kaggle上做實踐，項目是房價預測，網上也有很多網友分享的好方法，自己整個順下來算是對數據處理和挖掘有了初步的認識。

預處理

導入庫

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import scipy.stats as statsMM
import sklearn.linear_model as linear_model
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')

from sklearn.cluster import KMeans

numpy,pandas是數據處理常用的庫，matplotlib,seaborn用來輔助畫圖分析，K-Means是我在處理neighborhood特徵時用到的。

讀入訓練和測試數據

train = pd.read_csv('~/train.csv')#訓練數據train
test = pd.read_csv('~/test.csv')#測試數據test
#訓練集和測試集大小
print('original train set: ',train.shape)
print('original test set:',test.shape)
#since Id is unnecessary, save and drop it.
test_Id = test['Id']
test.drop('Id',axis=1,inplace=True)
train_Id = train['Id']
train.drop('Id',axis=1,inplace=True)
print('train drop Id:',train.shape)
print('test drop Id:',test.shape)

original train set:  (1460, 81)
original test set: (1459, 80)
train drop Id: (1460, 80)
test drop Id: (1459, 79)

特徵相關性熱力圖分析

corrmat = train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)

按相關係數排序，與SalePrice相關性最強的10個特徵是：

k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

與SalePrice相關性較高的有 OverallQual, GrLivArea,GarageCars, GargeArea, TotalBsmtSF, 1stFlrSF, FullBath, TotRmsAbvGrd, YearBuilt, YearRemodAdd, GargeYrBlt, MasVnrArea and Fireplaces.
- 這其中，GarageArea和GarageCars,GarageYrBlt有很強的相關性，可以選擇有代表性的GarageCars;
- TotalBsmtSF和1stFlrSF有很強相關性，可以作爲新特徵Basement分析；
- TotRmsAbvGrd與GrLivArea相似，取GrLivArea分析。

目標sale price特徵

#since SalePrice is the target,drop and save it
y = train.SalePrice
train_labels = y.values.copy
#train.drop('SalePrice',axis=1,inplace=True)

print(y.describe())
sns.distplot(y)
print('Skewness: %f' %y.skew())
print('Kurtosis: %f' %y.kurt())
#sns.distplot(train['SalePrice'])
#print("Skewness: %f" % train['SalePrice'].skew())
#print("Kurtosis: %f" % train['SalePrice'].kurt())

positive skewness正偏態，可取對數使其滿足正態分佈,以方便用於機器學習算法

得到特徵屬性（數值和類別）

#得到訓練集的數值特徵和類別特徵
num_f = [f for f in train.columns if train.dtypes[f] != 'object']#數值特徵
# pop the target-Saleprice
num_f.pop() #pop the last one SalePrice
print('numerical feature length:',len(num_f))
category_f = [f for f in train.columns if train.dtypes[f] == 'object']#訓練集類別特徵
print('category feature length:',len(category_f))

numerical feature length: 36
category feature length: 43

數值特徵num_f分析

def jointplot(x,y,**kwargs):
    try:
        sns.regplot(x=x,y=y)
    except Exception:
        print(x.value_counts())
f = pd.melt(train, id_vars=['SalePrice'], value_vars=num_f)
g = sns.FacetGrid(f,col='variable',col_wrap=3,sharex=False,sharey=False,size=5)
g = g.map(jointplot,'value','SalePrice')

居住面積GrLivArea（數值特徵）和price的關係

var = 'GrLivArea'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000))

居住面積與價格呈明顯的線性關係，例外是右下角兩個點，面積大，售價低，不符合實際規律，可將其去掉：

train.drop(train[(train['GrLivArea']>4000)&(train.SalePrice<300000)].index,inplace=True)

後面證明噪聲點對結果影響很大，所以找到和去除噪聲點很關鍵。

整體評價（數值特徵）和價格關係

var = 'OverallQual'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

修建年代和價格關係

var = 'YearBuilt'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000)
plt.xticks(rotation=90)

Category特徵分析

Category特徵總體概覽

''''
for c in category_f:
    train[c] = train[c].astype('category')
    if train[c].isnull().any():
        train[c] = train[c].cat.add_categories(['MISSING'])
        train[c] = train[c].fillna('MISSING')
'''''
def boxplot(x, y, **kwargs):
    sns.boxplot(x=x, y=y)
    x=plt.xticks(rotation=90)

f = pd.melt(train, id_vars=['SalePrice'], value_vars=category_f)
g = sns.FacetGrid(f, col="variable",  col_wrap=3, sharex=False,sharey=False,size=5)
g = g.map(boxplot, "value", "SalePrice")

對Price影響較大的有：Neighborhood波動較明顯，SaleCondition中Partial稍高，QUAL類型（有無泳池、壁爐、地下室）

Category特徵單因素方差分析

def anova(frame):
    anv = pd.DataFrame()
    anv['feature'] = category_f
    pvals = []
    for c in category_f:
        samples = []
        for cls in frame[c].unique():
            s = frame[frame[c] == cls]['SalePrice'].values
            samples.append(s)
        pval = statsMM.f_oneway(*samples)[1]
        pvals.append(pval)
    anv['pval'] = pvals
    return anv.sort_values('pval')

a = anova(train)
a['disparity'] = np.log(1./a['pval'].values)
sns.barplot(data=a, x='feature', y='disparity')
x=plt.xticks(rotation=90)

由上圖示分析可見，不少離散變量的具體取值對最終房價會產生較大影響（例如Neighborhood這個變量，實際上暗含了地段這個影響房價的重要因素），因此，我們可以按照各離散變量相應取值下房價的均值來給各個取值劃定一個1,2,3,4來定量描述他們對房價的影響，也就是將離散變量轉化爲數值型的有序變量：

Neighbourhood

#train.Neighborhood.astype("")
var = 'Neighborhood'
data = pd.concat([train['SalePrice'], train[var]], axis=1)
fig = sns.boxplot(x=var, y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000)
plt.xticks(rotation=90)

房價根據地段分類

地段對房價是很重要的影響因素，所以這裏使用K-Means方法地段進行聚類

from sklearn.cluster import KMeans
data = data.groupby(by='Neighborhood').SalePrice.mean()
location = np.array(data.index)
price_lo = np.array(data.values).reshape(-1,1)
km = KMeans(n_clusters=3)
label = km.fit_predict(price_lo)#計算簇中心及爲簇分配序號
expense = np.sum(km.cluster_centers_,axis=1)
CityCluster = [[],[],[]]
for i in range(len(location)):
    CityCluster[label[i]].append(location[i])
for i in range(len(CityCluster)):
    print("Expense:%.2f" % expense[i])
    print(CityCluster[i])
for i in range(len(expense)):
    print("Ratio:%.1f"%(expense[i]/sum(expense)))
nei_average = pd.Series(CityCluster[0]).drop_duplicates()
nei_exp = pd.Series(CityCluster[1]).drop_duplicates()
nei_cheap = pd.Series(CityCluster[2]).drop_duplicates()

對價格相近的地段取平均值，再按比例將不同地段進行賦值生成dictionary

nei_dict = dict()
new1 = dict()
new2 = dict()
nei_dict = nei_dict.fromkeys(['NoRidge', 'NridgHt', 'StoneBr'],5)
new1 = new1.fromkeys(['Blueste', 'BrDale', 'BrkSide', 'Edwards', 
                              'IDOTRR', 'MeadowV', 'Mitchel', 'NAmes', 'NPkVill', 
                              'OldTown', 'SWISU', 'Sawyer'],2)
new2 = new2.fromkeys(['Blmngtn', 'ClearCr', 'CollgCr', 'Crawfor', 'Gilbert',
                              'NWAmes', 'SawyerW', 'Somerst', 'Timber', 'Veenker'],3)
nei_dict.update(new1)
nei_dict.update(new2)

缺失值處理

NAs = pd.concat([train.isnull().sum(),train.isnull().sum()/train.isnull().count(),test.isnull().sum(),test.isnull().sum()/test.isnull().count()],axis=1,keys=["train","percent_train","test","percent"])
NAs[NAs.sum(axis=1)>1].sort_values(by="percent",ascending=False)

all_data = pd.concat([train,test],keys=["train","test"])
#對train和test中的所有數據進行缺失值處理

missing_col_NA:缺失即爲沒有的用“NA”來填充,數值類型補充爲0。

missing_col_NA = ["Alley","MasVnrType","FireplaceQu","GarageType","PoolQC","Fence",
                   "MiscFeature","GarageQual","GarageCond","GarageFinish"]
missing_col_0 = ["MasVnrArea","GarageCars","GarageArea"]
for col in missing_col_NA:
     all_data[col].fillna("NA",inplace=True)    
for col in missing_col_0:
    all_data[col].fillna(0,inplace=True)

missing_col_mode:有的缺失數值可以用平均值，衆數來補充。

missing_col_mode = ["MSZoning","KitchenQual","Functional","SaleType","Electrical","Exterior1st","Exterior2nd"]
for col in missing_col_mode:
    all_data[col].fillna(all_data[col].mode()[0],inplace=True)
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].apply(lambda x:x.fillna(x.median()))
#all_data['LotFrontage'] = all_data.groupby("Neighborhood")["LotFrontage"].transform(lambda x:x.fillna(x.median()))

缺失值中含有Bsm的，有無地下室進行處理：

#由於缺失值的個數是不一樣的，所以只有BsmtCond和BsmtQual同時缺失，纔將其認爲是沒有地下室的。
NoBmstIndex = (pd.isnull(all_data["BsmtCond"])==True)&(pd.isnull(all_data["BsmtQual"])==True)
all_data.loc[NoBmstIndex,"BsmtCond"] =all_data.loc[NoBmstIndex,"BsmtCond"].fillna("NA")
all_data.loc[NoBmstIndex,"BsmtQual"] =all_data.loc[NoBmstIndex,"BsmtQual"].fillna("NA")
all_data.loc[NoBmstIndex,"BsmtExposure"] =all_data.loc[NoBmstIndex,"BsmtExposure"].fillna("NA")

#其餘的用衆數來填充
all_data.BsmtCond.fillna(all_data.BsmtCond.mode()[0],inplace=True)
all_data.BsmtQual.fillna(all_data.BsmtQual.mode()[0],inplace=True)
all_data.BsmtExposure.fillna(all_data.BsmtExposure.mode()[0],inplace=True)

all_data.BsmtFinSF1.fillna(0,inplace=True)
all_data.BsmtFinSF2.fillna(0,inplace=True)
all_data.BsmtFinType1.fillna("NA",inplace=True)
all_data.BsmtFinType2.fillna("NA",inplace=True)

#要將沒有地下室和地下室未完成的面積區分開，所以如果是未完成的，將其面積設爲中位數
all_data.loc[all_data["BsmtFinType1"]=="Unf","BsmtFinSF1"]=all_data.BsmtFinSF1.median()
all_data.loc[all_data["BsmtFinType2"]=="Unf","BsmtFinSF2"]=all_data.BsmtFinSF1.median()

all_data.BsmtUnfSF.fillna(0,inplace=True)
all_data.TotalBsmtSF.fillna(0,inplace=True)

all_data.BsmtFullBath.fillna(0, inplace=True)
all_data.BsmtHalfBath.fillna(0, inplace=True)

'''
另外還有如Exterior1st 和 Exterior2nd 這樣的屬性，表示房子外牆的材料類型，如果有兩種，
第二個纔會有值，但是觀察發現有很多第二個跟第一個是一樣的，所以將第二個跟第一個一樣的Exterior2nd
設爲None， 同理Condition1和Condition2也是這樣.
'''
all_data.loc[all_data["Exterior1st"]==all_data["Exterior2nd"],"Exterior2nd"]="None"
all_data.loc[all_data["Condition1"]==all_data["Condition2"],"Condition2"]="None"

#Utilities 的取值單一，沒有太大意義，故直接將其刪除。
all_data.Utilities.value_counts()
all_data.drop('Utilities',axis=1, inplace=True)

準備訓練測試數據

刪除不用特徵

all_data.drop(["GarageArea","GarageYrBlt","MiscFeature","Fence","LotFrontage",
               "Street","SaleType","FullBath","1stFlrSF","MasVnrArea","MasVnrType"],
             axis=1,inplace=True)

添加新特徵

添加了一個基於HouseStyle的特徵,對於未完成的house轉爲數值0，已完成的爲1

Unf=['2.5Unf','1.5Unf']
all_data["HouseStypeFinish"]=all_data["HouseStyle"].map(lambda x:0 if x in Unf else 1)

all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['BsmtFinSF1'] + all_data['BsmtFinSF2']

Category離散變量數值化

# Quality map
qual_dict = { 'NA':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5}
all_data["ExterQual"] = all_data["ExterQual"].map(qual_dict).astype(int)
all_data["ExterCond"] = all_data["ExterCond"].map(qual_dict).astype(int)
all_data["BsmtQual"] = all_data["BsmtQual"].map(qual_dict).astype(int)
all_data["BsmtCond"] = all_data["BsmtCond"].map(qual_dict).astype(int)
all_data["HeatingQC"] = all_data["HeatingQC"].map(qual_dict).astype(int)
all_data["KitchenQual"] = all_data["KitchenQual"].map(qual_dict).astype(int)
all_data["FireplaceQu"] = all_data["FireplaceQu"].map(qual_dict).astype(int)
all_data["GarageQual"] = all_data["GarageQual"].map(qual_dict).astype(int)
all_data["GarageCond"] = all_data["GarageCond"].map(qual_dict).astype(int)
all_data['PoolQC'] = all_data.PoolQC.replace({'NA':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})

#basement finish type map dictionary
bsmt_fin_dict = {'NA': 0, "Unf": 1, "LwQ": 2, "Rec": 3, "BLQ": 4, "ALQ": 5, "GLQ": 6}
all_data["BsmtFinType1"] = all_data["BsmtFinType1"].map(bsmt_fin_dict).astype(int)
all_data["BsmtFinType2"] = all_data["BsmtFinType2"].map(bsmt_fin_dict).astype(int)
all_data['BsmtExposure'] = all_data.BsmtExposure.replace({'NA':0,'No':1,'Mn':2,'Av':3,'Gd':4})

#other feature has obvious level category
#all_data['BsmtExposure'] = all_data.BsmtExposure.replace({'NA':0,'No':1,'Mn':2,'Av':3,'Gd':4})

all_data['CentralAir'] = all_data.CentralAir.replace({'N':0,'Y':1})
all_data['GarageFinish'] = all_data.GarageFinish.replace({'NA':0,'Unf':1,'Rfn':2,'Fin':3})
all_data['GarageType'] = all_data.GarageType.replace({'NA':0,'Detchd':1,'CarPort':2,
                                                      'BuiltIn':3,'Basment':4,'Attchd':5,
                                                     '2Types':6})
all_data['Alley'] = all_data.Alley.replace({'NA':0,'Pave':1,'Grvl':2})
all_data['GarageFinish'] = all_data.GarageFinish.replace({'NA':0,'Unf':1,'RFn':2,'Fin':3})

#Neighborhood map
all_data['Neighborhood']=all_data['Neighborhood'].map(nei_dict).astype(int)

# new or old house
all_data["Age"] = 2010 - all_data["YearBuilt"]
# if house is remodeled
all_data['Remolded'] = (all_data["YearRemodAdd"] != all_data["YearBuilt"]) * 1
# time since remodeled
all_data['AgeSinceRem'] = 2010-all_data['YearRemodAdd']
all_data['New'] = ((all_data["YearBuilt"] == all_data["YrSold"]))*1
all_data['New'] = ((all_data["YearRemodAdd"] == all_data["YrSold"]))*1

def encode(frame, feature):
    '''
    對所有類型變量，依照各個類型變量的不同取值對應的樣本集內房價的均值，按照房價均值高低
    對此變量的當前取值確定其相對數值1,2,3,4等等，相當於對類型變量賦值使其成爲連續變量。
    注意：此函數會直接在原frame的DataFrame內創建新的一列來存放feature編碼後的值。
    '''
    ordering = pd.DataFrame()
    ordering['val'] = frame[feature].unique()
    ordering.index = ordering.val
    ordering['price_mean'] = frame[[feature, 'SalePrice']].groupby(feature).mean()['SalePrice']
    # 上述 groupby()操作可以將某一feature下同一取值的數據整個到一起，結合mean()可以直接得到該特徵不同取值的房價均值
    ordering = ordering.sort_values('price_mean')
    ordering['order'] = range(1, ordering.shape[0]+1)
    ordering = ordering['order'].to_dict()
    for attr_v, score in ordering.items():
        # e.g. qualitative[2]: {'Grvl': 1, 'MISSING': 3, 'Pave': 2}
        frame.loc[frame[feature] == attr_v, feature+'_E'] = score

category_encoded = []
# 由於category集合中包含了非數值型變量和僞數值型變量（多爲評分、等級等，其取值爲1,2,3,4等等）兩類
# 因此只需要對非數值型變量進行encode()處理。
# 如果採用One-Hot編碼，則整個qualitative的特徵都要進行pd.get_dummies()處理
category_left = ['BldgType','Condition1','Condition2','Electrical','Exterior1st',
                'Exterior2nd','Foundation','Functional','Heating','HouseStyle',
                'LandContour','LandSlope','LotConfig','LotShape','MSZoning',
                'PavedDrive','RoofMatl','RoofStyle','SaleCondition']
for q in category_left:
    encode(all_data, q)
    category_encoded.append(q+'_E')
all_data.drop(category_left, axis=1, inplace=True)  # 離散變量已經有了編碼後的新變量，因此刪去原變量

print(category_encoded)
print(len(category_encoded))

數值特徵轉爲類別特徵

all_data.YrSold = all_data.YrSold.astype(str)
all_data.MoSold = all_data.MoSold.astype(str)
all_data.MSSubClass = all_data.MSSubClass.astype(str)
all_data.HalfBath = all_data.HalfBath.astype(str)
all_data.BedroomAbvGr = all_data.BedroomAbvGr.astype(str)
#all_data.GarageCars = all_data.GarageCars.astype(str)

num_new = [f for f in all_data.columns if all_data.dtypes[f] != 'object']
category_new = [f for f in all_data.columns if all_data.dtypes[f] == 'object']
print(num_new)
print(category_new)
print(len(num_new))
print(len(category_new))

#特徵互相關分析與選取相關分析與選取
def spearman(frame, features):
    '''
    採用“斯皮爾曼等級相關”來計算變量與房價的相關性(可查閱百科)
    此相關係數簡單來說，可以對上述數值化處理後的等級變量及其它與房價的相關性進行更好的評價
    （特別是對於非線性關係）
    '''
    spr = pd.DataFrame()
    spr['feature'] = features
    spr['corr'] = [frame[f].corr(frame['SalePrice'], 'spearman') for f in features]
    spr = spr.sort_values('corr')
    plt.figure(figsize=(6, 0.25*len(features)))
    sns.barplot(data=spr, y='feature', x='corr', orient='h')    
#features = num_data + category_encoded
spearman(all_data, num_new)

可以看出數值化後的特徵影響符合現實表現，整體評價，地段，房屋面積等因素在前列，房屋年代是負相關。

對一些數值特徵進行對數變換

from scipy.stats import skew
#log transform the target use log(1+x)
train["SalePrice"] = np.log1p(train["SalePrice"])
sns.distplot(train['SalePrice'])
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())
y = train.SalePrice
train_labels = y.values.copy
all_data.drop('SalePrice',axis=1,inplace=True)

Skewness: 0.121580
Kurtosis: 0.804751

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
from scipy.stats import skew

skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

對新的類別特徵進行one-hot編碼使其數值化

# Create dummy features for categorical values via one-hot encoding
for col in category_new:
    for_dummy = all_data.pop(col)
    extra_data = pd.get_dummies(for_dummy,prefix=col)
    all_data = pd.concat([all_data, extra_data],axis=1)

標準化

注意對數據的標準化要用到所有數據中，包括訓練集和測試集。

from sklearn.preprocessing import StandardScaler
N = StandardScaler()
scal_all_data = N.fit_transform(all_data)

from sklearn.model_selection import cross_val_score, train_test_split
print("New number of features : " + str(all_data.shape[1]))
train_all_data = scal_all_data[:train.shape[0]]
test_all_data = scal_all_data[train.shape[0]:]

print("train:" +str(train_all_data.shape))
print("test:" +str(test_all_data.shape))
print(len(y))
#test_all_data.tail()

New number of features : 112
train:(1458, 112)
test:(1459, 112)
1458

生成訓練集和測試集

X_train, X_test, y_train, y_test = train_test_split(train_all_data, y,test_size = 0.3, random_state=0)
print("X_train : " + str(X_train.shape))
print("X_test : " + str(X_test.shape))
print("y_train : " + str(y_train.shape))
print("y_test : " + str(y_test.shape))

訓練模型

定義一個函數用以計算每個模型的交叉驗證結果rmse，評判不同模型的效果


#對訓練集和測試集分別進行交叉驗證，得到error measure for official scoring : RMSE

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score
import xgboost
def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)
def rmse_cv_test(model):
    rmse= np.sqrt(-cross_val_score(model, X_test, y_test, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

多項式迴歸

from sklearn.kernel_ridge import KernelRidge
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
score = rmse_cv(KRR)
score_test = rmse_cv_test(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score_test.mean(), score_test.std()))

Kernel Ridge score: 0.1189 (0.0108)

Kernel Ridge score: 0.1270 (0.0117)

嶺迴歸Ridge

ridge = RidgeCV(alphas = [0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6, 10, 30, 60])
ridge.fit(X_train, y_train)
alpha = ridge.alpha_
print("Best alpha :", alpha)

print("Try again for more precision with alphas centered around " + str(alpha))
ridge = RidgeCV(alphas = [alpha * .5,alpha * .55,alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, 
                          alpha * .9, alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15,
                          alpha * 1.25, alpha * 1.3, alpha * 1.35, alpha * 1.4], 
                cv = 10)
ridge.fit(X_train, y_train)
alpha = ridge.alpha_
print("Best alpha :", alpha)
print("Ridge RMSE on Training set :", rmse_cv(ridge).mean())
print("Ridge RMSE on Test set :", rmse_cv_test(ridge).mean())

Best alpha : 30.0
Try again for more precision with alphas centered around 30.0
Best alpha : 27.0
Ridge RMSE on Training set : 0.118370003199
Ridge RMSE on Test set : 0.126548632298

# Plot important coefficients
coefs = pd.Series(ridge.coef_, index = all_data.columns)
print("Ridge picked " + str(sum(coefs != 0)) + " features and eliminated the other " +  \
      str(sum(coefs == 0)) + " features")
imp_coefs = pd.concat([coefs.sort_values().head(10),
                     coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
plt.title("Coefficients in the Ridge Model")
plt.show()

Ridge picked 112 features and eliminated the other 0 features

lasso model

#lasso = LassoCV(alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 
#                         0.3, 0.6, 1], 
#                max_iter = 50000, cv = 10)
lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005])
lasso.fit(X_train, y_train)
alpha = lasso.alpha_
print("Best alpha :", alpha)

print("Try again for more precision with alphas centered around " + str(alpha))
lasso = LassoCV(alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, 
                          alpha * .85, alpha * .9, alpha * .95, alpha, alpha * 1.05, 
                          alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, alpha * 1.35, 
                          alpha * 1.4], 
                max_iter = 50000, cv = 10)
lasso.fit(X_train, y_train)
alpha = lasso.alpha_
print("Best alpha :", alpha)

print("Lasso RMSE on Training set :", rmse_cv(lasso).mean())
print("Lasso RMSE on Test set :", rmse_cv_test(lasso).mean())

Best alpha : 0.001
Try again for more precision with alphas centered around 0.001
Best alpha : 0.0014
Lasso RMSE on Training set : 0.115958983923
Lasso RMSE on Test set : 0.123666217616

# Plot important coefficients
coefs = pd.Series(lasso.coef_, index = all_data.columns)
print("Lasso picked " + str(sum(coefs != 0)) + " features and eliminated the other " +  \
      str(sum(coefs == 0)) + " features")
imp_coefs = pd.concat([coefs.sort_values().head(10),
                     coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")
plt.show()

Lasso picked 80 features and eliminated the other 32 features

ElasticNet Model

from sklearn.linear_model import ElasticNetCV
elasticNet = ElasticNetCV(l1_ratio = [0.1, 0.3, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 0.95, 1],
                          alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 
                                    0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                          max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
print("Best l1_ratio :", ratio)
print("Best alpha :", alpha )

print("Try again for more precision with l1_ratio centered around " + str(ratio))
elasticNet = ElasticNetCV(l1_ratio = [ratio * .85, ratio * .9, ratio * .95, ratio, ratio * 1.05, ratio * 1.1, ratio * 1.15],
                          alphas = [0.0001, 0.0003, 0.0006, 0.001, 0.003, 0.006, 0.01, 0.03, 0.06, 0.1, 0.3, 0.6, 1, 3, 6], 
                          max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
if (elasticNet.l1_ratio_ > 1):
    elasticNet.l1_ratio_ = 1    
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
print("Best l1_ratio :", ratio)
print("Best alpha :", alpha )

print("Now try again for more precision on alpha, with l1_ratio fixed at " + str(ratio) + 
      " and alpha centered around " + str(alpha))
elasticNet = ElasticNetCV(l1_ratio = ratio,
                          alphas = [alpha * .6, alpha * .65, alpha * .7, alpha * .75, alpha * .8, alpha * .85, alpha * .9, 
                                    alpha * .95, alpha, alpha * 1.05, alpha * 1.1, alpha * 1.15, alpha * 1.25, alpha * 1.3, 
                                    alpha * 1.35, alpha * 1.4], 
                          max_iter = 50000, cv = 10)
elasticNet.fit(X_train, y_train)
if (elasticNet.l1_ratio_ > 1):
    elasticNet.l1_ratio_ = 1    
alpha = elasticNet.alpha_
ratio = elasticNet.l1_ratio_
print("Best l1_ratio :", ratio)
print("Best alpha :", alpha )

print("ElasticNet RMSE on Training set :", rmse_cv(elasticNet).mean())
print("ElasticNet RMSE on Test set :", rmse_cv_test(elasticNet).mean())

Best l1_ratio : 0.5
Best alpha : 0.003
Try again for more precision with l1_ratio centered around 0.5
Best l1_ratio : 0.475
Best alpha : 0.003
Now try again for more precision on alpha, with l1_ratio fixed at 0.475 and alpha centered around 0.003
Best l1_ratio : 0.475
Best alpha : 0.003
ElasticNet RMSE on Training set : 0.116031335159
ElasticNet RMSE on Test set : 0.12208609128

coefs = pd.Series(elasticNet.coef_, index = all_data.columns)
print("ElasticNet picked " + str(sum(coefs != 0)) + " features and eliminated the other " +  str(sum(coefs == 0)) + " features")
imp_coefs = pd.concat([coefs.sort_values().head(10),
                     coefs.sort_values().tail(10)])
imp_coefs.plot(kind = "barh")
plt.title("Coefficients in the ElasticNet Model")
plt.show()

ElasticNet picked 80 features and eliminated the other 32 features

Xgboost

xgboost的參數是從網上找的，對運行時間影響較大

from xgboost import XGBRegressor
'''
xgb = XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
'''
xgb = XGBRegressor(n_estimators=700,learning_rate=0.07,subsample=0.9,colsample_bytree=0.7)
print('Xgboodt RMSE on training data:',rmse_cv(xgb).mean())
print('Xgboost RMSE on test data:',rmse_cv_test(xgb).mean())

Xgboodt RMSE on training data: 0.12726105661
Xgboost RMSE on test data: 0.126590562523

GradientBoost

from sklearn.ensemble import GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                max_depth=4, max_features='sqrt',
                                min_samples_leaf=15, min_samples_split=10, 
                                loss='huber', random_state =5)
print('GradientBoost RMSE on training data:',rmse_cv(gbr).mean())
print('GradientBoost RMSE on test data:',rmse_cv_test(gbr).mean())

GradientBoost RMSE on training data: 0.125016856097
GradientBoost RMSE on test data: 0.127225404959

雖然elasticNet和Lasso模型在訓練集上表現較好，但容易過擬合，受數據影響較大，xgboost等集成算法具有很穩定的表現。對各個模型進行加權求和。

Ensemble


from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin,clone
from sklearn.model_selection import KFold
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self, models,weights):
        self.models = models
        self.weights = np.array(weights)
    def fit(self, X, y):
        self.models_ = [clone(x) for x in self.models]
        for model in self.models_:
             model.fit(X,y)
        return self
    def predict(self,X):
        predictions = np.column_stack([model.predict(X) for model in self.models_])
        return np.sum(self.weights*predictions, axis=1)

#aver_model = AveragingModels(models=(elasticNet,KRR,ridge,lasso),weights=(0.5,0.1,0.25,0.15))
#aver_model = AveragingModels(models=(elasticNet,xgb,ridge,lasso),weights=(0.5,0.1,0.25,0.15))
aver_model = AveragingModels(models=(xgb,gbr,elasticNet,ridge,lasso),weights=(0.6,0.1,0.2,0.05,0.05))
print('aver model on train:',rmse_cv(aver_model).mean())
print('aver model on test:',rmse_cv_test(aver_model).mean())

 aver model on train: 0.118569419587
 aver model on test: 0.119069294424

對xgboost賦予較高的權重，elasticnet和lasso表現相近，對elasticnet較高權重後lasso可相對較小。

#pred_y_log = elasticNet.predict(test_all_data)
aver_model.fit(train_all_data,y)
pred_y_log = aver_model.predict(test_all_data)
pred_y = np.expm1(pred_y_log)

參考資料：
- 1.A study on Regression applied to the Ames dataset(by juliencs)https://www.kaggle.com/juliencs/a-study-on-regression-applied-to-the-ames-dataset/notebook
- 2.Comprehensive data exploration with Python(by Pedro Marcelino)https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python/notebook
- 3.House Price EDA(by Dominik Gawlik)https://www.kaggle.com/dgawlik/house-prices-eda
- 4.Stacked Regressions : Top 4% on LeaderBoard(by Serigne)https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard/notebook
- 5.XGBoost + Lasso(by Human Analog)https://www.kaggle.com/humananalog/xgboost-lasso/comments

我的Kaggle初探--House Price Predict

預處理