Building a Random Forests with PySpark
- Decision Tree
- Random Forests
DecisionTree
- RF的基本組件DT(決策樹)
- 決策樹常用於分類和迴歸任務
- Entropy熵
- Entorpy of target
- Entorpy of target with features
- Information Gain 信息增益
# example dataset
import pandas as pd
toy_data = pd.DataFrame({'Age_group':
['old', 'teenager', 'young', 'old', 'young', 'teenager', 'teenager', 'old', 'teenager', 'young', 'young','teenager','young','old'],
'Smoker':['yes', 'yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'no', 'no', 'yes', 'yes' ,'no', 'yes'],
'Medical_condition':['yes','yes','yes','yes','yes','yes','no','no','yes','yes','no','no','no','no'],
'Salary_level':['high','medium','medium','high','high','low','low','low','medium','low','high','medium','medium','medium'],
'insurance_premium':['high','high','low','high','low','high','low','high','high','high','low','low','high','high']
})
toy_data
Age_group | Smoker | Medical_condition | Salary_level | insurance_premium | |
---|---|---|---|---|---|
0 | old | yes | yes | high | high |
1 | teenager | yes | yes | medium | high |
2 | young | yes | yes | medium | low |
3 | old | no | yes | high | high |
4 | young | yes | yes | high | low |
5 | teenager | no | yes | low | high |
6 | teenager | no | no | low | low |
7 | old | no | no | low | high |
8 | teenager | no | yes | medium | high |
9 | young | no | yes | low | high |
10 | young | yes | no | high | low |
11 | teenager | yes | no | medium | low |
12 | young | no | no | medium | high |
13 | old | yes | no | medium | high |
Entropy
計算 Entropy of target
- target column : insurance_premium
- high 9
- low 5
- probability high : 9/14 = 0.64
- probability low : 4/15 = 0.36
計算Entropy of target with features
- feature : smoker
- yes : high 3, low 4
- no : high 6, low 1
其他特徵的計算方法同上:
- Entropy(smoker) = 0.79
- Entropy(age_group) = 0.69
- Entropy(medical_condition) = 89
- Entropy(salary_level) = 0.91
Information Gain ( IG )
其他特徵IG計算方法同上:
- IG(smoker) = 0.15
- IG(age_group) = 0.25
- IG(medical_condition) = 0.05
- IG(salary_level) = 0.03
很明顯,age_group這個特徵擁有最大的信息增益,因此決策樹的根節點就從age_group開始,將數據集分爲三部分:
- toy_data(age_group == teenager)
- toy_data(age_group == young)
- toy_data(age_group == old)
然後,在以上的三個數據的子集上遞歸的執行以上的計算,尋找信息增益最大的特徵,分裂下去,直到不能繼續爲止。
Random Forests
瞭解了決策樹的計算過程,接下來開始隨機森林。顧名思義,隨機森林就是由多顆決策樹組成的,它將多棵決策樹的結果組合起來做最終的預測輸出,這種方法很有效,隨機森林的準確率總是優於單棵樹的結果。
組合策略:
- 迴歸:平均,加權平均
- 分類:投票
隨機森林的特點:
- 特徵的重要性:特徵選擇
- 性能提升:>> 決策樹
- 減少過擬合
- 計算開銷增加:訓練多棵決策樹
代碼實現
Let’s build a random forest model using spark’s MLlib
- create a sparksession & load dataset
- eda
- feature engineering
- splitting train/test set
- building & training model
- evaluation
sparksession & loaddata
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('random_forest').getOrCreate()
# load data
df = spark.read.csv('./Data/affairs.csv', inferSchema=True, header=True)
eda
print((df.count(), len(df.columns)))
(6366, 6)
df.printSchema()
root
|-- rate_marriage: integer (nullable = true)
|-- age: double (nullable = true)
|-- yrs_married: double (nullable = true)
|-- children: double (nullable = true)
|-- religious: integer (nullable = true)
|-- affairs: integer (nullable = true)
df.show(5, False)
+-------------+----+-----------+--------+---------+-------+
|rate_marriage|age |yrs_married|children|religious|affairs|
+-------------+----+-----------+--------+---------+-------+
|5 |32.0|6.0 |1.0 |3 |0 |
|4 |22.0|2.5 |0.0 |2 |0 |
|3 |32.0|9.0 |3.0 |3 |1 |
|3 |27.0|13.0 |3.0 |1 |1 |
|4 |22.0|2.5 |0.0 |1 |1 |
+-------------+----+-----------+--------+---------+-------+
only showing top 5 rows
df.describe().show()
+-------+------------------+------------------+-----------------+------------------+------------------+------------------+
|summary| rate_marriage| age| yrs_married| children| religious| affairs|
+-------+------------------+------------------+-----------------+------------------+------------------+------------------+
| count| 6366| 6366| 6366| 6366| 6366| 6366|
| mean| 4.109644989004084|29.082862079798932| 9.00942507068803|1.3968740182218033|2.4261702796104303|0.3224945020420987|
| stddev|0.9614295945655025| 6.847881883668817|7.280119972766412| 1.433470828560344|0.8783688402641785| 0.467467779921086|
| min| 1| 17.5| 0.5| 0.0| 1| 0|
| max| 5| 42.0| 23.0| 5.5| 4| 1|
+-------+------------------+------------------+-----------------+------------------+------------------+------------------+
df.groupBy('affairs').count().show() # 有外遇約30%
+-------+-----+
|affairs|count|
+-------+-----+
| 1| 2053|
| 0| 4313|
+-------+-----+
df.groupBy('rate_marriage').count().show() # 大多數人給她們的婚姻打4、5分
+-------------+-----+
|rate_marriage|count|
+-------------+-----+
| 1| 99|
| 3| 993|
| 5| 2684|
| 4| 2242|
| 2| 348|
+-------------+-----+
# 不同分數,婚外情情況
temp_df = df.groupBy('rate_marriage', 'affairs').count().orderBy('rate_marriage','affairs','count', ascending=True)
temp_df.show()
+-------------+-------+-----+
|rate_marriage|affairs|count|
+-------------+-------+-----+
| 1| 0| 25|
| 1| 1| 74|
| 2| 0| 127|
| 2| 1| 221|
| 3| 0| 446|
| 3| 1| 547|
| 4| 0| 1518|
| 4| 1| 724|
| 5| 0| 2197|
| 5| 1| 487|
+-------------+-------+-----+
# 不同分數,有外遇的人數
temp_df = temp_df.filter(temp_df.affairs==1)
temp_df.show()
+-------------+-------+-----+
|rate_marriage|affairs|count|
+-------------+-------+-----+
| 1| 1| 74|
| 2| 1| 221|
| 3| 1| 547|
| 4| 1| 724|
| 5| 1| 487|
+-------------+-------+-----+
# 不同分數,總人數
temp_2 = df.groupBy('rate_marriage').count()
temp_2.show()
+-------------+-----+
|rate_marriage|count|
+-------------+-----+
| 1| 99|
| 3| 993|
| 5| 2684|
| 4| 2242|
| 2| 348|
+-------------+-----+
# religious
df.groupBy('religious', 'affairs').count().orderBy('religious', 'affairs', 'count', ascending=True).show()
+---------+-------+-----+
|religious|affairs|count|
+---------+-------+-----+
| 1| 0| 613|
| 1| 1| 408|
| 2| 0| 1448|
| 2| 1| 819|
| 3| 0| 1715|
| 3| 1| 707|
| 4| 0| 537|
| 4| 1| 119|
+---------+-------+-----+
# children
df.groupBy('children', 'affairs').count().orderBy('children', 'affairs', 'count', ascending=True).show()
+--------+-------+-----+
|children|affairs|count|
+--------+-------+-----+
| 0.0| 0| 1912|
| 0.0| 1| 502|
| 1.0| 0| 747|
| 1.0| 1| 412|
| 2.0| 0| 873|
| 2.0| 1| 608|
| 3.0| 0| 460|
| 3.0| 1| 321|
| 4.0| 0| 197|
| 4.0| 1| 131|
| 5.5| 0| 124|
| 5.5| 1| 79|
+--------+-------+-----+
df.groupBy('affairs').mean().show()
+-------+------------------+------------------+------------------+------------------+------------------+------------+
|affairs|avg(rate_marriage)| avg(age)| avg(yrs_married)| avg(children)| avg(religious)|avg(affairs)|
+-------+------------------+------------------+------------------+------------------+------------------+------------+
| 1|3.6473453482708234|30.537018996590355|11.152459814905017|1.7289332683877252| 2.261568436434486| 1.0|
| 0| 4.329700904242986| 28.39067934152562| 7.989334569904939|1.2388128912589844|2.5045212149316023| 0.0|
+-------+------------------+------------------+------------------+------------------+------------------+------------+
create feature data
from pyspark.ml.feature import VectorAssembler
df_assembler = VectorAssembler(inputCols=['rate_marriage', 'age', 'yrs_married', 'children', 'religious'], outputCol='features')
df = df_assembler.transform(df)
df.show(5)
+-------------+----+-----------+--------+---------+-------+--------------------+
|rate_marriage| age|yrs_married|children|religious|affairs| features|
+-------------+----+-----------+--------+---------+-------+--------------------+
| 5|32.0| 6.0| 1.0| 3| 0|[5.0,32.0,6.0,1.0...|
| 4|22.0| 2.5| 0.0| 2| 0|[4.0,22.0,2.5,0.0...|
| 3|32.0| 9.0| 3.0| 3| 1|[3.0,32.0,9.0,3.0...|
| 3|27.0| 13.0| 3.0| 1| 1|[3.0,27.0,13.0,3....|
| 4|22.0| 2.5| 0.0| 1| 1|[4.0,22.0,2.5,0.0...|
+-------------+----+-----------+--------+---------+-------+--------------------+
only showing top 5 rows
df.select(['features', 'affairs']).show(5)
+--------------------+-------+
| features|affairs|
+--------------------+-------+
|[5.0,32.0,6.0,1.0...| 0|
|[4.0,22.0,2.5,0.0...| 0|
|[3.0,32.0,9.0,3.0...| 1|
|[3.0,27.0,13.0,3....| 1|
|[4.0,22.0,2.5,0.0...| 1|
+--------------------+-------+
only showing top 5 rows
data = df.select(['features', 'affairs'])
splitting train\test set
train_df , test_df = data.randomSplit([0.75, 0.25])
print('train set (%d, %d)'%(train_df.count(), len(train_df.columns)))
print('test set (%d, %d)'%(test_df.count(), len(test_df.columns)))
train set (4784, 2)
test set (1582, 2)
build model
- Logistic Regression VS Random Forests
from pyspark.ml.classification import RandomForestClassifier,LogisticRegression, DecisionTreeClassifier
rf = RandomForestClassifier(labelCol='affairs', numTrees=50).fit(train_df)
lr = LogisticRegression(labelCol='affairs').fit(train_df)
dt = DecisionTreeClassifier(labelCol='affairs').fit(train_df)
rf_pred = rf.transform(test_df)
lr_pred = lr.transform(test_df)
dt_pred = dt.transform(test_df)
Evaluation
- Accuracy
- Precision
- AUC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator #auc
rf_accuracy = MulticlassClassificationEvaluator(labelCol='affairs', metricName='accuracy').evaluate(rf_pred)
print("RF's accuracy is %f"%rf_accuracy)
lr_accuracy = MulticlassClassificationEvaluator(labelCol='affairs', metricName='accuracy').evaluate(lr_pred)
print("LR's accuracy is %f"%lr_accuracy)
dt_accuracy= MulticlassClassificationEvaluator(labelCol='affairs', metricName='accuracy').evaluate(dt_pred)
print("DT's accuracy is %f"%dt_accuracy)
RF's accuracy is 0.727560
LR's accuracy is 0.724399
DT's accuracy is 0.719343
rf_precision = MulticlassClassificationEvaluator(labelCol='affairs', metricName='weightedPrecision').evaluate(rf_pred)
print("RF's precision is %f"%rf_precision)
lr_precision = MulticlassClassificationEvaluator(labelCol='affairs', metricName='weightedPrecision').evaluate(lr_pred)
print("LR's precision is %f"%lr_precision)
dt_precision= MulticlassClassificationEvaluator(labelCol='affairs', metricName='weightedPrecision').evaluate(dt_pred)
print("DT's precision is %f"%dt_precision)
RF's precision is 0.709906
LR's precision is 0.706239
DT's precision is 0.707323
rf_auc = BinaryClassificationEvaluator(labelCol='affairs').evaluate(rf_pred)
print("RF's precision is %f"%rf_auc)
lr_auc = BinaryClassificationEvaluator(labelCol='affairs').evaluate(lr_pred)
print("LR's precision is %f"%lr_auc)
dt_auc= BinaryClassificationEvaluator(labelCol='affairs').evaluate(dt_pred)
print("DT's precision is %f"%dt_auc)
RF's precision is 0.752915
LR's precision is 0.745961
DT's precision is 0.609049
feature importances
rf.featureImportances
SparseVector(5, {0: 0.5652, 1: 0.0286, 2: 0.2444, 3: 0.0781, 4: 0.0836})
df.schema['features'].metadata['ml_attr']['attrs']
{'numeric': [{'idx': 0, 'name': 'rate_marriage'},
{'idx': 1, 'name': 'age'},
{'idx': 2, 'name': 'yrs_married'},
{'idx': 3, 'name': 'children'},
{'idx': 4, 'name': 'religious'}]}