Kaggle Ensembling Guide

Model ensembling is a very powerful technique to increase accuracy on a variety of ML tasks. In this article I will share my ensembling approaches for Kaggle Competitions.

For the first part we look at creating ensembles from submission files. The second part will look at creating ensembles through stacked generalization/blending.

I answer why ensembling reduces the generalization error. Finally I show different methods of ensembling, together with their results and code to try it out for yourself.

This is how you win ML competitions: you take other peoples’ work and ensemble them together.” Vitaly Kuznetsov NIPS2014

Creating ensembles from submission files

The most basic and convenient way to ensemble is to ensemble Kaggle submission CSV files. You only need the predictions on the test set for these methods — no need to retrain a model. This makes it a quick way to ensemble already existing model predictions, ideal when teaming up.
Voting ensembles.

We first take a look at a simple majority vote ensemble. Let’s see why model ensembling reduces error rate and why it works better to ensemble low-correlated model predictions.
Error correcting codes

During space missions it is very important that all signals are correctly relayed.

If we have a signal in the form of a binary string like:

1110110011101111011111011011

and somehow this signal is corrupted (a bit is flipped) to:

1010110011101111011111011011

then lives could be lost.

A coding solution was found in error correcting codes. The simplest error correcting code is a repetition-code: Relay the signal multiple times in equally sized chunks and have a majority vote.

Original signal:
1110110011

Encoded:
10,3 101011001111101100111110110011

Decoding:
1010110011
1110110011
1110110011

Majority vote:
1110110011

Signal corruption is a very rare occurrence and often occur in small bursts. So then it figures that it is even rarer to have a corrupted majority vote.

As long as the corruption is not completely unpredictable (has a 50% chance of occurring) then signals can be repaired.
A machine learning example

Suppose we have a test set of 10 samples. The ground truth is all positive (“1”):

1111111111

We furthermore have 3 binary classifiers (A,B,C) with a 70% accuracy. You can view these classifiers for now as pseudo-random number generators which output a “1” 70% of the time and a “0” 30% of the time.

We will now show how these pseudo-classifiers are able to obtain 78% accuracy through a voting ensemble.
A pinch of maths

For a majority vote with 3 members we can expect 4 outcomes:

All three are correct
0.7 * 0.7 * 0.7
= 0.3429

Two are correct
0.7 * 0.7 * 0.3

0.7 * 0.3 * 0.7
0.3 * 0.7 * 0.7
= 0.4409

Two are wrong
0.3 * 0.3 * 0.7

0.3 * 0.7 * 0.3
0.7 * 0.3 * 0.3
= 0.189

All three are wrong
0.3 * 0.3 * 0.3
= 0.027

We see that most of the times (~44%) the majority vote corrects an error. This majority vote ensemble will be correct an average of ~78% (0.3429 + 0.4409 = 0.7838).
Number of voters

Like repetition codes increase in their error-correcting capability when more codes are repeated, so do ensembles usually improve when adding more ensemble members.

Using the same pinch of maths as above: a voting ensemble of 5 pseudo-random classifiers with 70% accuracy would be correct ~83% of the time. One or two errors are being corrected during ~66% of the majority votes. (0.36015 + 0.3087)
Correlation

When I first joined the team for KDD-cup 2014, Marios Michailidis (KazAnova) proposed something peculiar. He calculated the Pearson correlation for all our submission files and gathered a few well-performing models which were less correlated.

Creating an averaging ensemble from these diverse submissions gave us the biggest 50-spot jump on the leaderboard. Uncorrelated submissions clearly do better when ensembled than correlated submissions. But why?

To see this, let us take 3 simple models again. The ground truth is still all 1’s:

1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracy.

These models are highly correlated in their predictions. When we take a majority vote we see no improvement:

1111111100 = 80% accuracy

Now we compare to 3 less-performing, but highly uncorrelated models:
1111111100 = 80% accuracy
0111011101 = 70% accuracy
1000101111 = 60% accuracy

When we ensemble this with a majority vote we get:

1111111101 = 90% accuracy

Which is an improvement: A lower correlation between ensemble model members seems to result in an increase in the error-correcting capability.
Use for Kaggle: Forest Cover Type prediction

ForestMajority votes make most sense when the evaluation metric requires hard predictions, for instance with (multiclass-) classification accuracy.

The forest cover type prediction challenge uses the UCI Forest CoverType dataset. The dataset has 54 attributes and there are 6 classes.

We create a simple starter model with a 500-tree Random Forest. We then create a few more models and pick the best performing one. For this task and our model selection an ExtraTreesClassifier works best.
Weighing

We then use a weighted majority vote. Why weighing? Usually we want to give a better model more weight in a vote. So in our case we count the vote by the best model 3 times. The other 4 models count for one vote each.

The reasoning is as follows: The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative.

We can expect this ensemble to repair a few erroneous choices by the best model, leading to a small improvement only. That’s our punishment for forgoing a democracy and creating a Plato’s Republic.

“Every city encompasses two cities that are at war with each other.” Plato in The Republic

Table 1. shows the result of training 5 models, and the resulting score when combining these with a weighted majority vote.
Model Public Accuracy Score
GradientBoostingMachine 0.65057
RandomForest Gini 0.75107
RandomForest Entropy 0.75222
ExtraTrees Entropy 0.75524
ExtraTrees Gini (Best) 0.75571
Voting Ensemble (Democracy) 0.75337
Voting Ensemble (3*Best vs. Rest) 0.75667

Use for Kaggle: CIFAR-10 Object detection in images

CIFAR-10CIFAR-10 is another multi-class classification challenge where accuracy matters.

Our team leader for this challenge, Phil Culliton, first found the best setup to replicate a good model from dr. Graham.

Then he used a voting ensemble of around 30 convnets submissions (all scoring above 90% accuracy). The best single model of the ensemble scored 0.93170.

A voting ensemble of 30 models scored 0.94120. A ~0.01 reduction in error rate, pushing the resulting score beyond the estimated human classification accuracy.
Code

We have a sample voting script you could use at the MLWave Github repo. It operates on a directory of Kaggle submissions and creates a new submission. Update: Armando Segnini has added weighing.

Ensembling. Train 10 neural networks and average their predictions. It’s a fairly trivial technique that results in easy, sizeable performance improvements.

One may be mystified as to why averaging helps so much, but there is a simple reason for the effectiveness of averaging. Suppose that two classifiers have an error rate of 70%. Then, when they agree they are right. But when they disagree, one of them is often right, so now the average prediction will place much more weight on the correct answer.

The effect will be especially strong whenever the network is confident when it’s right and unconfident when it’s wrong. Ilya Sutskever A brief overview of Deep Learning.

Averaging

Averaging works well for a wide range of problems (both classification and regression) and metrics (AUC, squared error or logaritmic loss).

There is not much more to averaging than taking the mean of individual model predictions. An often heard shorthand for this on Kaggle is “bagging submissions”.

Averaging predictions often reduces overfit. You ideally want a smooth separation between classes, and a single model’s predictions can be a little rough around the edges.

Kaggle Ensembling Guide

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

大齡程序員思考

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

nuget添加readme

目標檢測中的IoU

python實現遞歸斐波那契數列、進制轉換、排序

EfficientNet-PyTorch

Focal loss的pytorch版本實現

如何計算pytorch中圖像輸入的均值和方差

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結