Algorithm: Random Forest, ensemble model

Ensemble Model

For clasisfication problem the ensemble model is very effective. Such as  the situation of the Image recognition via deep learning.(black box)

For a grade system, we use the GBDT or XGBoost etc.

In engineering field, the Interpretable is very important,since we could determine the problem once we meet an issue.

 

How to build an ensemble model? Bagging and Boosting

Bagging: Random forest

Boosting: GBDT, XGBoost

We calculate the average value from all of the predictions from the models

We use the variance / standard deviation to evaluate the stability of the model

 

from the example above,we know that the model will become more stable

 

Random Forest

Bagging is a framwork for ensemble model

The random forest using multiple decision trees for the final predictions

It also can be used for regression problem(mean value)

Build the random forest

if we train the decision trees with big correlation, then the performance of random forest will not be very good

The diversity is the most important property of the random forest

1) Randomization of the training sample, it means that we choose diffferent part of the training data for each decision tree of the random forest

sample with replacement

we could also randomize the features. For example, if we have 100 features, we choose 10 from 100 randomly, then we build the decision tree via the 10 features.

Overfitting of random forest

Hyperparameter of random forest:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

n_estimators: the number of decision trees we used. The more decision trees, the more training time of the random forest.

criterion: how to choose the features for the current node. or the measure the quality of the current split. gini, or entropy.

max_depth: the maximum depth of the decison tree.

min_samples_split, min_samples_leaf: control the number of the leaves

max_features: the number of features to consider when looking for the best split

An example:

# import the data set
from sklearn.datasets import load_digits

# import random forest classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

# import data
digits = load_digits()

X = digits.data
y = digits.target

X_train, X_test, y_train, y_test = train_test_split(X,
                                    y, test_size = 0.2, random_state = 42)

# create the random forest classifier
clf = RandomForestClassifier(n_estimators=400, criterion='entropy',
                            max_depth = 5, min_samples_split = 3, max_features = 'sqrt', random_state = 0)
clf.fit(X_train, y_train)

print("Accuracy in train data set is: %.2f, in the test data set is %.2f"
        %(clf.score(X_train, y_train), clf.score(X_test, y_test)))

output:

Accuracy in train data set is: 0.98, in the test data set is 0.95

Another Demo:

prediction for turnover rate

# Turnover rate demo

# import package
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
#%matplotlib inline
from sklearn.model_selection import train_test_split

# read data as pandas dataframe
df = pd.read_csv('HR_comma_sep.csv', index_col = None)

# check that there is any data missing
print (df.isnull().any(), '\n\n')

# print some data
print (df.head(), "\n\n")
   satisfaction_level  last_evaluation  number_project  average_montly_hours  \
0                0.38             0.53               2                   157   
1                0.80             0.86               5                   262   
2                0.11             0.88               7                   272   
3                0.72             0.87               5                   223   
4                0.37             0.52               2                   159   

   time_spend_company  Work_accident  left  promotion_last_5years  sales  \
0                   3              0     1                      0  sales   
1                   6              0     1                      0  sales   
2                   4              0     1                      0  sales   
3                   5              0     1                      0  sales   
4                   3              0     1                      0  sales   

   salary  
0     low  
1  medium  
2  medium  
3     low  
4     low   
# rename the column
df = df.rename(columns = {'satisfaction_level' : 'satisfaction',
                          'last_evaluation' : 'evaluation',
                          'number_project' : 'projectCount',
                          'average_montly_hours' : 'averageMonthlyHours',
                          'time_spend_company' : 'yearsAtCompany',
                          'Work_accident' : 'workAccident',
                          'promotion_last_5years' : 'promotion',
                          'sales' : 'department',
                          'left' : 'turnover'
                        })

# move the label to the first column
front = df['turnover']
df.drop(labels=['turnover'], axis = 1, inplace = True)
df.insert(0, 'turnover', front)
#df.head()

# calculate the turnover rate
turnover_rate = df.turnover.value_counts() / len(df)
print ("the turnover rate is: %.2f\n\n" % turnover_rate[1])

# print the describe() info
print(df.describe(), "\n\n")
           turnover  satisfaction    evaluation  projectCount  \
count  12504.000000  12504.000000  12504.000000  12504.000000   
mean       0.200256      0.621834      0.716446      3.803503   
std        0.400208      0.245010      0.169745      1.196592   
min        0.000000      0.090000      0.360000      2.000000   
25%        0.000000      0.450000      0.560000      3.000000   
50%        0.000000      0.650000      0.720000      4.000000   
75%        0.000000      0.820000      0.870000      5.000000   
max        1.000000      1.000000      1.000000      7.000000   

       averageMonthlyHours  yearsAtCompany  workAccident     promotion  
count         12504.000000    12504.000000  12504.000000  12504.000000  
mean            200.721769        3.385717      0.149472      0.016555  
std              49.341169        1.321437      0.356568      0.127601  
min              96.000000        2.000000      0.000000      0.000000  
25%             157.000000        3.000000      0.000000      0.000000  
50%             200.000000        3.000000      0.000000      0.000000  
75%             244.000000        4.000000      0.000000      0.000000  
max             310.000000       10.000000      1.000000      1.000000   
# convert the string value into integer
df['department'] = df['department'].astype('category').cat.codes
df['salary'] = df['salary'].astype('category').cat.codes

# split the train / test data set
target_name = 'turnover'
X = df.drop('turnover', axis = 1)
y = df[target_name]

# the stratify = y means that the turnover rate equal to the turnover rate in the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 123, stratify = y)

# Now, time to train the data
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

# train the decision tree
dtree = tree.DecisionTreeClassifier(
    criterion = 'entropy',
    #max_depth = 3, # constraint the depth of the tree to prevent from overfitting
    min_weight_fraction_leaf = 0.01 # using the % rate to set the examples of the node
    )
dtree = dtree.fit(X_train, y_train)
print("\n\n ---Decision Tree---")
print(classification_report(y_test, dtree.predict(X_test)))
 ---Decision Tree---
              precision    recall  f1-score   support

           0       0.97      0.98      0.98      1500
           1       0.93      0.89      0.91       376

    accuracy                           0.96      1876
   macro avg       0.95      0.94      0.94      1876
weighted avg       0.96      0.96      0.96      1876

 

# train the random forest
rf = RandomForestClassifier(
    criterion = 'entropy',
    n_estimators = 1000,
    max_depth = None, # prevent from over fitting, None means that no limitation
    min_samples_split = 10, # at least number of nodes for the next split
    #min_weight_fraction_leaf = 0.02 # define the number of sample of the leaf node to prevent from overfitting
    )
rf.fit(X_train, y_train)
print("\n\n ---Random Forest---")
print(classification_report(y_test, rf.predict(X_test)))
 ---隨機森林---
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1500
           1       0.99      0.90      0.94       376

    accuracy                           0.98      1876
   macro avg       0.98      0.95      0.96      1876
weighted avg       0.98      0.98      0.98      1876

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章