一起加入這次沉浸式學習吧,本次分享的方案基本上包好了結構化數據比賽的基本流程:數據分析、數據預處理,特徵工程、模型訓練以及模型融合,大家可以留在週末學習一波。
比賽名稱:Sberbank Russian Housing Market
比賽鏈接:https://www.kaggle.com/c/sberbank-russian-housing-market
競賽背景
住房成本需要消費者和開發商的大量投資。 在規劃預算時(無論是個人預算還是公司預算),任何一方不到最後就是不確定其中哪一項是最大開支。 俄羅斯最早、最大的銀行Sberbank通過預測房地產價格來幫助客戶預測預算,因此租戶,開發商和貸方在簽訂租約或購買建築物時更加相互信任。
儘管俄羅斯的住房市場相對穩定,但該國動盪的經濟形勢使得根據公寓價格預測成爲一項獨特的挑戰。 房屋數量(如臥室數量和位置)之間複雜的相互關係足以使價格預測變得複雜。 加上不穩定的經濟因素,意味着Sberbank及其客戶需要的不僅僅是其機器學習庫中的簡單迴歸模型。
在這場競賽中,Sberbank向Kagglers提出挑戰,要求他們開發使用多種特徵來預測房地產價格的算法。 競爭對手將依靠豐富的數據集,其中包括住房數據和宏觀經濟模式。 準確的預測模型將使Sberbank在不確定的經濟環境中爲其客戶提供更多的確定性。
賽題解析
這種競賽目的是預測每一處房產的銷售價格。目標變量在train.csv中稱爲price_doc。訓練數據爲2011年8月至2015年6月,測試集爲2015年7月至2016年5月。該數據集還包括俄羅斯經濟和金融部門的總體狀況信息,因此您可以專注於爲每個房產生成準確的價格預測,而無需猜測商業週期將如何變化。
競賽數據
- train.csv,test.csv:有關單個交易的信息。 這些行由“ id”字段索引,該字段引用單個事務(特定屬性在單獨的事務中可能出現多次)。 這些文件還包括有關每個屬性的本地區域的補充信息。
- macro.csv:有關俄羅斯宏觀經濟和金融部門的數據(可以根據“時間戳”與訓練集和測試集合並)
- data_dictionary.txt:其他數據文件中可用字段的說明
- sample_submission.csv:格式正確的示例提交文件
其中字段比較多,我們可以通過data_dictionary
文件可以發現至少有200+個字段,所以本次比賽的數據還是比較豐富,比較客觀,同時也具有研究價值。
數據分析
來源:https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-sberbank
- 房產價格分佈
我們將價格按照從小到大排序,畫出如下每處房產價格分佈:
plt.figure(figsize=(8,6))
plt.scatter(range(train_df.shape[0]), np.sort(train_df.price_doc.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()
- 房產價格隨着時間變化趨勢
train_df['yearmonth'] = train_df['timestamp'].apply(lambda x: x[:4]+x[5:7])
grouped_df = train_df.groupby('yearmonth')['price_doc'].aggregate(np.median).reset_index()
plt.figure(figsize=(12,8))
sns.barplot(grouped_df.yearmonth.values, grouped_df.price_doc.values, alpha=0.8, color=color[2])
plt.ylabel('Median Price', fontsize=12)
plt.xlabel('Year Month', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()
- 特徵重要性較高的特徵
因爲有292個變量,讓我們構建一個基本的xgboost模型,然後先研究重要的變量。
for f in train_df.columns:
if train_df[f].dtype=='object':
lbl = preprocessing.LabelEncoder()
lbl.fit(list(train_df[f].values))
train_df[f] = lbl.transform(list(train_df[f].values))
train_y = train_df.price_doc.values
train_X = train_df.drop(["id", "timestamp", "price_doc"], axis=1)
xgb_params = {
'eta': 0.05,
'max_depth': 8,
'subsample': 0.7,
'colsample_bytree': 0.7,
'objective': 'reg:linear',
'eval_metric': 'rmse',
'silent': 1
}
dtrain = xgb.DMatrix(train_X, train_y, feature_names=train_X.columns.values)
model = xgb.train(dict(xgb_params, silent=0), dtrain, num_boost_round=100)
# plot the important features #
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
因此,數據特徵中的重要性前5個變量及其描述爲:
full_sq-以平方米爲單位的總面積,包括涼廊,陽臺和其他非住宅區
life_sq-居住面積(平方米),不包括涼廊,陽臺和其他非居住區
floor-對於房屋,建築物的當前層數
max_floor-建築物中的總樓層數
build_year-建造年份
full_seq與房產價格的分佈
ulimit = np.percentile(train_df.price_doc.values, 99.5)
llimit = np.percentile(train_df.price_doc.values, 0.5)
train_df['price_doc'].ix[train_df['price_doc']>ulimit] = ulimit
train_df['price_doc'].ix[train_df['price_doc']<llimit] = llimit
col = "full_sq"
ulimit = np.percentile(train_df[col].values, 99.5)
llimit = np.percentile(train_df[col].values, 0.5)
train_df[col].ix[train_df[col]>ulimit] = ulimit
train_df[col].ix[train_df[col]<llimit] = llimit
plt.figure(figsize=(12,12))
sns.jointplot(x=np.log1p(train_df.full_sq.values), y=np.log1p(train_df.price_doc.values), size=10)
plt.ylabel('Log of Price', fontsize=12)
plt.xlabel('Log of Total area in square metre', fontsize=12)
plt.show()
life_sq與房產價格分佈
col = "life_sq"
train_df[col].fillna(0, inplace=True)
ulimit = np.percentile(train_df[col].values, 95)
llimit = np.percentile(train_df[col].values, 5)
train_df[col].ix[train_df[col]>ulimit] = ulimit
train_df[col].ix[train_df[col]<llimit] = llimit
plt.figure(figsize=(12,12))
sns.jointplot(x=np.log1p(train_df.life_sq.values), y=np.log1p(train_df.price_doc.values),
kind='kde', size=10)
plt.ylabel('Log of Price', fontsize=12)
plt.xlabel('Log of living area in square metre', fontsize=12)
plt.show()
樓層與房產價格中位數分佈
grouped_df = train_df.groupby('floor')['price_doc'].aggregate(np.median).reset_index()
plt.figure(figsize=(12,8))
sns.pointplot(grouped_df.floor.values, grouped_df.price_doc.values, alpha=0.8, color=color[2])
plt.ylabel('Median Price', fontsize=12)
plt.xlabel('Floor number', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()
Top 1% 代碼分享
- Data.py: 數據清洗以及特徵工程
- Exploration.py: 數據分析
- Model.py: XGBoost模型
- BaseModel.py: 基線模型:RandomForestRegressor、GradientBoostingRegressor、Lasso等
- lightGBM.py: lightGBM模型
- Stacking.py: model stacking (final model):模型融合
因爲代碼比較清晰簡潔,非常適合數據挖掘的新手解讀學習,其中作者寫的Stacking也是非常漂亮,我們可以感受下:
Stacking是通過一個元分類器或者元迴歸器整合多個模型的集成學習技術。基礎模型利用整個訓練集做訓練,元模型利用基礎模型做特徵進行訓練。一般Stacking多使用不同類型的基礎模型
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit, cross_val_score
from sklearn.cross_validation import KFold
from sklearn.ensemble import AdaBoostRegressor, RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import Imputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
import xgboost as xgb
import lightgbm as lgb
from sklearn.preprocessing import StandardScaler
# 封裝一下lightgbm讓其可以在stacking裏面被調用
class LGBregressor(object):
def __init__(self,params):
self.params = params
def fit(self, X, y, w):
y /= 10000000
# self.scaler = StandardScaler().fit(y)
# y = self.scaler.transform(y)
split = int(X.shape[0] * 0.8)
indices = np.random.permutation(X.shape[0])
train_id, test_id = indices[:split], indices[split:]
x_train, y_train, w_train, x_valid, y_valid, w_valid = X[train_id], y[train_id], w[train_id], X[test_id], y[test_id], w[test_id],
d_train = lgb.Dataset(x_train, y_train, weight=w_train)
d_valid = lgb.Dataset(x_valid, y_valid, weight=w_valid)
partial_bst = lgb.train(self.params, d_train, 10000, valid_sets=d_valid, early_stopping_rounds=50)
num_round = partial_bst.best_iteration
d_all = lgb.Dataset(X, label = y, weight=w)
self.bst = lgb.train(self.params, d_all, num_round)
def predict(self, X):
return self.bst.predict(X) * 10000000
# return self.scaler.inverse_transform(self.bst.predict(X))
# 封裝一下xgboost讓其可以在stacking裏面被調用
class XGBregressor(object):
def __init__(self, params):
self.params = params
def fit(self, X, y, w=None):
if w==None:
w = np.ones(X.shape[0])
split = int(X.shape[0] * 0.8)
indices = np.random.permutation(X.shape[0])
train_id, test_id = indices[:split], indices[split:]
x_train, y_train, w_train, x_valid, y_valid, w_valid = X[train_id], y[train_id], w[train_id], X[test_id], y[test_id], w[test_id],
d_train = xgb.DMatrix(x_train, label=y_train, weight=w_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid, weight=w_valid)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
partial_bst = xgb.train(self.params, d_train, 10000, early_stopping_rounds=50, evals = watchlist, verbose_eval=100)
num_round = partial_bst.best_iteration
d_all = xgb.DMatrix(X, label = y, weight=w)
self.bst = xgb.train(self.params, d_all, num_round)
def predict(self, X):
test = xgb.DMatrix(X)
return self.bst.predict(test)
# This object modified from Wille on https://dnc1994.com/2016/05/rank-10-percent-in-first-kaggle-competition-en/
class Ensemble(object):
def __init__(self, n_folds, stacker, base_models):
self.n_folds = n_folds
self.stacker = stacker
self.base_models = base_models
def fit_predict(self, trainDf, testDf):
X = trainDf.drop(['price_doc', 'w'], 1).values
y = trainDf['price_doc'].values
w = trainDf['w'].values
T = testDf.values
X_fillna = trainDf.drop(['price_doc', 'w'], 1).fillna(-999).values
T_fillna = testDf.fillna(-999).values
folds = list(KFold(len(y), n_folds=self.n_folds, shuffle=True))
S_train = np.zeros((X.shape[0], len(self.base_models)))
S_test = np.zeros((T.shape[0], len(self.base_models)))
for i, clf in enumerate(self.base_models):
print('Training base model ' + str(i+1) + '...')
S_test_i = np.zeros((T.shape[0], len(folds)))
for j, (train_idx, test_idx) in enumerate(folds):
print('Training round ' + str(j+1) + '...')
if clf not in [xgb1,lgb1]: # sklearn models cannot handle missing values.
X = X_fillna
T = T_fillna
X_train = X[train_idx]
y_train = y[train_idx]
w_train = w[train_idx]
X_holdout = X[test_idx]
# w_holdout = w[test_idx]
# y_holdout = y[test_idx]
clf.fit(X_train, y_train, w_train)
y_pred = clf.predict(X_holdout)
S_train[test_idx, i] = y_pred
S_test_i[:, j] = clf.predict(T)
S_test[:, i] = S_test_i.mean(1)
self.S_train, self.S_test, self.y = S_train, S_test, y # for diagnosis purpose
self.corr = pd.concat([pd.DataFrame(S_train),trainDf['price_doc']],1).corr() # correlation of predictions by different models.
# cv_stack = ShuffleSplit(n_splits=6, test_size=0.2)
# score_stacking = cross_val_score(self.stacker, S_train, y, cv=cv_stack, n_jobs=1, scoring='neg_mean_squared_error')
# print(np.sqrt(-score_stacking.mean())) # CV result of stacking
self.stacker.fit(S_train, y)
y_pred = self.stacker.predict(S_test)
return y_pred
if __name__ == "__main__":
trainDf = pd.read_csv('train_featured.csv')
testDf = pd.read_csv('test_featured.csv')
params1 = {'eta':0.05, 'max_depth':5, 'subsample':0.8, 'colsample_bytree':0.8, 'min_child_weight':1,
'gamma':0, 'silent':1, 'objective':'reg:linear', 'eval_metric':'rmse'}
xgb1 = XGBregressor(params1)
params2 = {'booster':'gblinear', 'alpha':0,# for gblinear, delete this line if change back to gbtree
'eta':0.1, 'max_depth':2, 'subsample':1, 'colsample_bytree':1, 'min_child_weight':1,
'gamma':0, 'silent':1, 'objective':'reg:linear', 'eval_metric':'rmse'}
xgb2 = XGBregressor(params2)
RF = RandomForestRegressor(n_estimators=500, max_features=0.2)
ETR = ExtraTreesRegressor(n_estimators=500, max_features=0.3, max_depth=None)
Ada = AdaBoostRegressor(DecisionTreeRegressor(max_depth=15),n_estimators=200)
GBR = GradientBoostingRegressor(n_estimators=200,max_depth=5,max_features=0.5)
LR =LinearRegression()
params_lgb = {'objective':'regression','metric':'rmse',
'learning_rate':0.05,'max_depth':-1,'sub_feature':0.7,'sub_row':1,
'num_leaves':15,'min_data':30,'max_bin':20,
'bagging_fraction':0.9,'bagging_freq':40,'verbosity':0}
lgb1 = LGBregressor(params_lgb)
E = Ensemble(5, xgb2, [xgb1,lgb1,RF,ETR,Ada,GBR])
prediction = E.fit_predict(trainDf, testDf)
output = pd.read_csv('test.csv')
output = output[['id']]
output['price_doc'] = prediction
output.to_csv(r'Ensemble\Submission_Stack.csv',index=False)
我們還可以學習到什麼
一般每個比賽的discussion部分,我們可以看到前排方案的討論交流,感覺讀了他們分享的總結以及簡介比代碼獲得收益更大
鏈接爲:https://www.kaggle.com/c/sberbank-russian-housing-market/discussion/35684
從第一名分享的方案中,對我收益比較大的是:
- 沒有對目標變量直接預測,而是對單位平方米的價格進行預測,之後轉化
- 嘗試很多的獨立模型,這裏指的是因爲他們發現有兩個變量放在一塊導致模型差異很大(Investment 和OwnerOccupier),然後將兩個變量置於兩組不同的特徵輸入給模型
- 去除異常值,單獨訓練模型
更多資料可以閱讀:https://www.one-tab.com/page/Yv_JbxErRU6yE3oa7MsgnQ