kaggle——Santander Customer Transaction Prediction

比賽地址
https://www.kaggle.com/c/santander-customer-transaction-prediction

一、賽後總結

1.1學習他人

1.1.1 List of Fake Samples and Public/Private LB split

https://www.kaggle.com/yag320/list-of-fake-samples-and-public-private-lb-split
首先測試集和訓練集統計分析非常相似,但是unique value統計差別大。所以猜想測試集由真實樣本由真實樣本採樣而生成的。由此可以找到100000個fake example、100000個real example。又假設採樣是再分public/private LB後進行的,所以real example可以分出50000 public+50000 private

1.1.2 giba single model public 0.9245 private 0.9234

https://www.kaggle.com/titericz/giba-single-model-public-0-9245-private-0-9234

>Reverse features
不理解爲什麼要反轉負相關特徵

#Reverse features
for var in features:
    if np.corrcoef( train_df['target'], train_df[var] )[1][0] < 0:
        train_df[var] = train_df[var] * -1
        test_df[var]  = test_df[var]  * -1

>生成特徵
每個特徵生成四個特徵,爲本值、count、feature_id、rank值。(不是很理解feature_id的作用),後面還對var做了歸一化。 但生成的是(40000000, 4)的矩陣,也就是豎列的堆疊。

def var_to_feat(vr, var_stats, feat_id ):
    new_df = pd.DataFrame()
    new_df["var"] = vr.values
    new_df["hist"] = pd.Series(vr).map(var_stats)
    new_df["feature_id"] = feat_id
    new_df["var_rank"] = new_df["var"].rank()/200000.
    return new_df.values

TARGET = np.array( list(train_df['target'].values) * 200 )

TRAIN = []
var_mean = {}
var_var  = {}
for var in features:
    tmp = var_to_feat(train_df[var], var_stats[var], int(var[4:]) )
    var_mean[var] = np.mean(tmp[:,0]) 
    var_var[var]  = np.var(tmp[:,0])
    tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]
    TRAIN.append( tmp )
TRAIN = np.vstack( TRAIN )

del train_df
_=gc.collect()

print( TRAIN.shape, len( TARGET ) )

>模型LGBM
用LGBM模型訓練

model = lgb.LGBMClassifier(**{
     'learning_rate': 0.04,
     'num_leaves': 31,
     'max_bin': 1023,
     'min_child_samples': 1000,
     'reg_alpha': 0.1,
     'reg_lambda': 0.2,
     'feature_fraction': 1.0,
     'bagging_freq': 1,
     'bagging_fraction': 0.85,
     'objective': 'binary',
     'n_jobs': -1,
     'n_estimators':200,})

MODELS = []
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=11111)
for fold_, (train_indexes, valid_indexes) in enumerate(skf.split(TRAIN, TARGET)):
    print('Fold:', fold_ )
    model = model.fit( TRAIN[train_indexes], TARGET[train_indexes],
                      eval_set = (TRAIN[valid_indexes], TARGET[valid_indexes]),
                      verbose = 10,
                      eval_metric='auc',
                      early_stopping_rounds=25,
                      categorical_feature = [2] )
    MODELS.append( model )
    
del TRAIN, TARGET
_=gc.collect()

>預測
對test數據進行同樣的特徵工程,對每個特徵每個模型預測,然後logx-log(1-x),爲什麼??

爲什麼最後還要sub[‘target’] = sub[‘target’].rank() / 200000.???
原話: rank or not it produces the same score since the metric is rank based (AUC). I used rank just to normalize to the range [0-1]

ypred = np.zeros( (200000,200) )
for feat,var in enumerate(features):
    tmp = var_to_feat(test_df[var], var_stats[var], int(var[4:]) )
    tmp[:,0] = (tmp[:,0]-var_mean[var])/var_var[var]
    for model_id in range(10):
        model = MODELS[model_id]
        ypred[:,feat] += model.predict_proba( tmp )[:,1] / 10.
ypred = np.mean( logit(ypred), axis=1 )

sub = test_df[['ID_code']]
sub['target'] = ypred
sub['target'] = sub['target'].rank() / 200000.
sub.to_csv('golden_sub.csv', index=False)
print( sub.head(10) )

I studied your code some more. This is a brilliant solution !! Reversing some variables and stacking all of them into 4 columns is really ingenious. It simulates ideas from an NN convolution where the model can use patterns it learns from one variable to assist in its pattern detection of another variable. This also prevents LGBM from modeling spurious interactions between variables. But it’s more advanced than a convolution (that uses the same weights for all variables) because you provide column 3 which has the original variable’s number (0-199), so your LGBM can customize its prediction for each variable. Lastly you combine everything back together mathematically accurate by using mean logit. Very very nice. Setting the frequency count as a categorical value is a nice touch which allows LGBM to efficiently divide the different distributions. You maximized the modeling ability of an LGBM duplicating other participants’ success with NNs. I am quite impressed !!

持續更新。。。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章