均值编码(mean ecoding)
means = X_tr.groupby(col).target.mean()
train_new[col+'_mean_target'] = train_new[col].map(means)
val_new[col+'_mean_target'] = val_new[col].map(means)
2)均值编码的效果能够帮助目标对象分离,而普通的Label Encoding由于属于无监督编码,其编码最后的效果比较随机。
-
CV loop inside training data(推荐)
-
Smoothing
-
Adding random noise
-
Sorting and calculating expanding mean
y_tr = df_tr['target'].values
skf = StratifiedKFold(y_tr,5,shuffle=True,random_state=123)
for tr_ind,val_ind in skf:
X_tr,X_val = df_tr.iloc[tr_ind],d_tr.iloc[val_ind]
for in cols: ## Iterate through the need to ecode cols
means = X_val[col].map(X_tr.groupby(col)['target'].mean())
X_val[col+'_mean_target'] = means
train_new.iloc[val_ind] = X_val
prior = df_tr['target'].mean()
train_new.fillna(prior,inplace=True)
Smoothing:在原来均值编码的基础上添加一项系数α来正则化表达,α需要调参。计算公式如下:
cumsum = df_tr.groupby(col).['target'].cumsum() - df_tr['target']
cumcnt = df_tr.groupby(col).cumcount()
train_new[col+'_mean_target'] = cumsum/cumcnt
还有很多代表性的技术:
统计特征与邻域特征
gb=df.groupby(['User_id','Page_id'],as_index=False).agg(
{"Ad_price":{'Max_price':np.max,'Min_price':np.min}
}
)
gb.columns = ['user_id','page_id','min_price','max_price']
df = pd.merge(df,gb,how='left',on=['User_id','Page_id'])
Note:agg函数的应用,实例如下:
矩阵分解
-
可以仅仅对样本集的某些特征进行降维分解
-
提供了额外的多样性
-
利于模型融合
-
-
存在信息的损失,对于某些特定的任务比较有效
-
降维后的维度一般在5-100
-
特定任务
-
-
SVD 和PCA
-
TruncatedSVD
-
针对稀疏矩阵
-
-
None-negative Matrix Factorization(NMF)
-
确保所有的元素非负
-
对计数(count)性质的数据比较好
-
特征交互
tSNE
-
尝试不同的hyperparameters
-
train和test需要放在一起降维
-
当矩阵维度过大时,需要事先降维处理,再进行tSNE