均值編碼(mean ecoding)
means = X_tr.groupby(col).target.mean()
train_new[col+'_mean_target'] = train_new[col].map(means)
val_new[col+'_mean_target'] = val_new[col].map(means)
2)均值編碼的效果能夠幫助目標對象分離,而普通的Label Encoding由於屬於無監督編碼,其編碼最後的效果比較隨機。
-
CV loop inside training data(推薦)
-
Smoothing
-
Adding random noise
-
Sorting and calculating expanding mean
y_tr = df_tr['target'].values
skf = StratifiedKFold(y_tr,5,shuffle=True,random_state=123)
for tr_ind,val_ind in skf:
X_tr,X_val = df_tr.iloc[tr_ind],d_tr.iloc[val_ind]
for in cols: ## Iterate through the need to ecode cols
means = X_val[col].map(X_tr.groupby(col)['target'].mean())
X_val[col+'_mean_target'] = means
train_new.iloc[val_ind] = X_val
prior = df_tr['target'].mean()
train_new.fillna(prior,inplace=True)
Smoothing:在原來均值編碼的基礎上添加一項係數α來正則化表達,α需要調參。計算公式如下:
cumsum = df_tr.groupby(col).['target'].cumsum() - df_tr['target']
cumcnt = df_tr.groupby(col).cumcount()
train_new[col+'_mean_target'] = cumsum/cumcnt
還有很多代表性的技術:
統計特徵與鄰域特徵
gb=df.groupby(['User_id','Page_id'],as_index=False).agg(
{"Ad_price":{'Max_price':np.max,'Min_price':np.min}
}
)
gb.columns = ['user_id','page_id','min_price','max_price']
df = pd.merge(df,gb,how='left',on=['User_id','Page_id'])
Note:agg函數的應用,實例如下:
矩陣分解
-
可以僅僅對樣本集的某些特徵進行降維分解
-
提供了額外的多樣性
-
利於模型融合
-
-
存在信息的損失,對於某些特定的任務比較有效
-
降維後的維度一般在5-100
-
特定任務
-
-
SVD 和PCA
-
TruncatedSVD
-
針對稀疏矩陣
-
-
None-negative Matrix Factorization(NMF)
-
確保所有的元素非負
-
對計數(count)性質的數據比較好
-
特徵交互
tSNE
-
嘗試不同的hyperparameters
-
train和test需要放在一起降維
-
當矩陣維度過大時,需要事先降維處理,再進行tSNE