Ensemble Model
For clasisfication problem the ensemble model is very effective. Such as the situation of the Image recognition via deep learning.(black box)
For a grade system, we use the GBDT or XGBoost etc.
In engineering field, the Interpretable is very important,since we could determine the problem once we meet an issue.
How to build an ensemble model? Bagging and Boosting
Bagging: Random forest
Boosting: GBDT, XGBoost
We calculate the average value from all of the predictions from the models
We use the variance / standard deviation to evaluate the stability of the model
from the example above,we know that the model will become more stable
Random Forest
Bagging is a framwork for ensemble model
The random forest using multiple decision trees for the final predictions
It also can be used for regression problem(mean value)
Build the random forest
if we train the decision trees with big correlation, then the performance of random forest will not be very good
The diversity is the most important property of the random forest
1) Randomization of the training sample, it means that we choose diffferent part of the training data for each decision tree of the random forest
sample with replacement
we could also randomize the features. For example, if we have 100 features, we choose 10 from 100 randomly, then we build the decision tree via the 10 features.
Overfitting of random forest
Hyperparameter of random forest:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
n_estimators: the number of decision trees we used. The more decision trees, the more training time of the random forest.
criterion: how to choose the features for the current node. or the measure the quality of the current split. gini, or entropy.
max_depth: the maximum depth of the decison tree.
min_samples_split, min_samples_leaf: control the number of the leaves
max_features: the number of features to consider when looking for the best split
An example:
# import the data set
from sklearn.datasets import load_digits
# import random forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# import data
digits = load_digits()
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size = 0.2, random_state = 42)
# create the random forest classifier
clf = RandomForestClassifier(n_estimators=400, criterion='entropy',
max_depth = 5, min_samples_split = 3, max_features = 'sqrt', random_state = 0)
clf.fit(X_train, y_train)
print("Accuracy in train data set is: %.2f, in the test data set is %.2f"
%(clf.score(X_train, y_train), clf.score(X_test, y_test)))
output:
Accuracy in train data set is: 0.98, in the test data set is 0.95
Another Demo:
prediction for turnover rate
# Turnover rate demo
# import package
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
#%matplotlib inline
from sklearn.model_selection import train_test_split
# read data as pandas dataframe
df = pd.read_csv('HR_comma_sep.csv', index_col = None)
# check that there is any data missing
print (df.isnull().any(), '\n\n')
# print some data
print (df.head(), "\n\n")
satisfaction_level last_evaluation number_project average_montly_hours \
0 0.38 0.53 2 157
1 0.80 0.86 5 262
2 0.11 0.88 7 272
3 0.72 0.87 5 223
4 0.37 0.52 2 159
time_spend_company Work_accident left promotion_last_5years sales \
0 3 0 1 0 sales
1 6 0 1 0 sales
2 4 0 1 0 sales
3 5 0 1 0 sales
4 3 0 1 0 sales
salary
0 low
1 medium
2 medium
3 low
4 low
# rename the column
df = df.rename(columns = {'satisfaction_level' : 'satisfaction',
'last_evaluation' : 'evaluation',
'number_project' : 'projectCount',
'average_montly_hours' : 'averageMonthlyHours',
'time_spend_company' : 'yearsAtCompany',
'Work_accident' : 'workAccident',
'promotion_last_5years' : 'promotion',
'sales' : 'department',
'left' : 'turnover'
})
# move the label to the first column
front = df['turnover']
df.drop(labels=['turnover'], axis = 1, inplace = True)
df.insert(0, 'turnover', front)
#df.head()
# calculate the turnover rate
turnover_rate = df.turnover.value_counts() / len(df)
print ("the turnover rate is: %.2f\n\n" % turnover_rate[1])
# print the describe() info
print(df.describe(), "\n\n")
turnover satisfaction evaluation projectCount \
count 12504.000000 12504.000000 12504.000000 12504.000000
mean 0.200256 0.621834 0.716446 3.803503
std 0.400208 0.245010 0.169745 1.196592
min 0.000000 0.090000 0.360000 2.000000
25% 0.000000 0.450000 0.560000 3.000000
50% 0.000000 0.650000 0.720000 4.000000
75% 0.000000 0.820000 0.870000 5.000000
max 1.000000 1.000000 1.000000 7.000000
averageMonthlyHours yearsAtCompany workAccident promotion
count 12504.000000 12504.000000 12504.000000 12504.000000
mean 200.721769 3.385717 0.149472 0.016555
std 49.341169 1.321437 0.356568 0.127601
min 96.000000 2.000000 0.000000 0.000000
25% 157.000000 3.000000 0.000000 0.000000
50% 200.000000 3.000000 0.000000 0.000000
75% 244.000000 4.000000 0.000000 0.000000
max 310.000000 10.000000 1.000000 1.000000
# convert the string value into integer
df['department'] = df['department'].astype('category').cat.codes
df['salary'] = df['salary'].astype('category').cat.codes
# split the train / test data set
target_name = 'turnover'
X = df.drop('turnover', axis = 1)
y = df[target_name]
# the stratify = y means that the turnover rate equal to the turnover rate in the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 123, stratify = y)
# Now, time to train the data
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
# train the decision tree
dtree = tree.DecisionTreeClassifier(
criterion = 'entropy',
#max_depth = 3, # constraint the depth of the tree to prevent from overfitting
min_weight_fraction_leaf = 0.01 # using the % rate to set the examples of the node
)
dtree = dtree.fit(X_train, y_train)
print("\n\n ---Decision Tree---")
print(classification_report(y_test, dtree.predict(X_test)))
---Decision Tree---
precision recall f1-score support
0 0.97 0.98 0.98 1500
1 0.93 0.89 0.91 376
accuracy 0.96 1876
macro avg 0.95 0.94 0.94 1876
weighted avg 0.96 0.96 0.96 1876
# train the random forest
rf = RandomForestClassifier(
criterion = 'entropy',
n_estimators = 1000,
max_depth = None, # prevent from over fitting, None means that no limitation
min_samples_split = 10, # at least number of nodes for the next split
#min_weight_fraction_leaf = 0.02 # define the number of sample of the leaf node to prevent from overfitting
)
rf.fit(X_train, y_train)
print("\n\n ---Random Forest---")
print(classification_report(y_test, rf.predict(X_test)))
---隨機森林---
precision recall f1-score support
0 0.98 1.00 0.99 1500
1 0.99 0.90 0.94 376
accuracy 0.98 1876
macro avg 0.98 0.95 0.96 1876
weighted avg 0.98 0.98 0.98 1876