七.决策树
1.基本原理
相似的输入导致相似的输出。
年龄:青年-1,中年-2,老年-3
学历:专科-1,本科-2,硕士-3,博士-4
经验:缺乏-1,一般-2,丰富-3,资深-4
性别:男-1,女-2
薪资:1-低,2-中,3-高,4-超高
年龄 学历 工作经验 性别 -> 薪资
1 1 1 2 5000 1
1 2 2 1 8000 2
2 3 3 2 10000 3
3 4 4 1 30000 4
...
------------------------------------------
1 2 2 1 ?
回归——平均 \ 结合特征的相
分类——投票 / 似程度做加权
随着子表的划分,信息熵越来越小,信息量越来越大,
数据越来越有序。
11123 2... 3...
12211 2... 3...
11221 2... 3...
11... 12... 13... 14...
11... 12... 13... 14...
11... 12... 13... 14...
依次选择原始样本矩阵中的每一列,构建相应特征值相同的若干子表树,在叶级子表中所有特征值都是相同的,对于未知输出的输入,按照同样的规则,归属到某个叶级子表,将该子表中各样本的输出按照取平均(回归)或者取投票(分类)的方法,计算预测输出。
2.工程优化
- 根据信息熵的减少量计算每个特征对预测结果的影响程度,信息熵减少量越大的特征对预测结果的影响也越大。
- 根据上一步计算出的影响程度,按照从大到小的顺序,选择划分子表的特征依据,即优先选择影响程度最大的特征。
- 根据事先给定的条件,提前结束子表的划分过程,借以简化决策树的结构,缩短构建和搜索的时间,在预测精度牺牲不大的前提下,提高模型性能。
3.集合算法
- 所谓集合算法,亦称集合弱学习方法,其核心思想就是,通过平均或者投票,将多个不同学习方法的结论加以综合,给出一个相对可靠预测结果,所选择弱学习方法,在算法或数据上应该具备足够分散性,以体现相对不同的倾向性,这样得出的综合结论才能够更加泛化.
- 基于决策树的集合算法,就是按照某种规则,构建多棵彼此不同的决策树模型,分别给出针对未知样本的预测结果,最后通过平均或投票得到相对综合的结论.
- 根据构建多棵决策树所依据的规则不同,基于决策树的集合算法可被分为以下几种:
A:从原始训练样本中,以有放回或无放回抽样的方式,随机选取部分样本,构建一棵决策树,重复以上过程,得到若干棵决策输,以此弱化某些强势样本对预测结果的影响力,提高模型精度。
B:如果在自助聚合的基础上,每次构建决策树时,不但随机选择样本(行),而且其特征(列)也是随机选择的,则称为随机森林。
C:正向激励:首先为训练样本分配相等的权重,构建第一棵决策树,用该决策树对训练样本进行预测,为预测错误的样本提升权重,再次构建下一棵决策树,以此类推,得到针对每个样本拥有不同权重的多棵决策树。
代码:house.py
4.特征重要性
- 决策树模型,在确定划分子表优先选择特征的过程中,需要根据最大未写减原则,确定划分子表的依据,因此,作为学习模型的副产品,可以得到每个特征对于输出的影响力度,即特征重要性:feature_importances_,该输出与模型算法有关
代码:fi.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.datasets as sd import sklearn.utils as su import sklearn.tree as st import sklearn.ensemble as se import matplotlib.pyplot as mp boston = sd.load_boston() feature_names = boston.feature_names x, y = su.shuffle(boston.data, boston.target, random_state=7) train_size = int(len(x) * 0.8) train_x, test_x, train_y, test_y = \ x[:train_size], x[train_size:], \ y[:train_size], y[train_size:] model = st.DecisionTreeRegressor(max_depth=4) model.fit(train_x, train_y) # 决策树回归器给出的特征重要性 fi_dt = model.feature_importances_ model = se.AdaBoostRegressor( st.DecisionTreeRegressor(max_depth=4), n_estimators=400, random_state=7) model.fit(train_x, train_y) # 基于决策树的正向激励回归器给出的特征重要性 fi_ab = model.feature_importances_ mp.figure('Feature Importance', facecolor='lightgray') mp.subplot(211) mp.title('Decision Tree', fontsize=16) mp.ylabel('Importance', fontsize=12) mp.tick_params(labelsize=10) mp.grid(axis='y', linestyle=':') sorted_indices = fi_dt.argsort()[::-1] pos = np.arange(sorted_indices.size) mp.bar(pos, fi_dt[sorted_indices], facecolor='deepskyblue', edgecolor='steelblue') mp.xticks(pos, feature_names[sorted_indices], rotation=30) mp.subplot(212) mp.title('AdaBoost Decision Tree', fontsize=16) mp.ylabel('Importance', fontsize=12) mp.tick_params(labelsize=10) mp.grid(axis='y', linestyle=':') sorted_indices = fi_ab.argsort()[::-1] pos = np.arange(sorted_indices.size) mp.bar(pos, fi_ab[sorted_indices], facecolor='lightcoral', edgecolor='indianred') mp.xticks(pos, feature_names[sorted_indices], rotation=30) mp.tight_layout() mp.show()
学习模型关于特征重要性的计算,除了与选择的算法有关以外,还与数据的采集粒度有关。
代码:bike.py
# -*- coding: utf-8 -*- from __future__ import unicode_literals import csv import numpy as np import sklearn.utils as su import sklearn.ensemble as se import sklearn.metrics as sm import matplotlib.pyplot as mp with open('../../data/bike_day.csv', 'r') as f: reader = csv.reader(f) x, y = [], [] for row in reader: x.append(row[2:13]) y.append(row[-1]) fn_dy = np.array(x[0]) x = np.array(x[1:], dtype=float) y = np.array(y[1:], dtype=float) x, y = su.shuffle(x, y, random_state=7) train_size = int(len(x) * 0.9) train_x, test_x, train_y, test_y = \ x[:train_size], x[train_size:], \ y[:train_size], y[train_size:] # 随机森林回归器 model = se.RandomForestRegressor( max_depth=10, n_estimators=1000, min_samples_split=2) model.fit(train_x, train_y) # 基于“天”数据集的特征重要性 fi_dy = model.feature_importances_ pred_test_y = model.predict(test_x) print(sm.r2_score(test_y, pred_test_y)) with open('../../data/bike_hour.csv', 'r') as f: reader = csv.reader(f) x, y = [], [] for row in reader: x.append(row[2:13]) y.append(row[-1]) fn_hr = np.array(x[0]) x = np.array(x[1:], dtype=float) y = np.array(y[1:], dtype=float) x, y = su.shuffle(x, y, random_state=7) train_size = int(len(x) * 0.9) train_x, test_x, train_y, test_y = \ x[:train_size], x[train_size:], \ y[:train_size], y[train_size:] # 随机森林回归器 model = se.RandomForestRegressor( max_depth=10, n_estimators=1000, min_samples_split=2) model.fit(train_x, train_y) # 基于“小时”数据集的特征重要性 fi_hr = model.feature_importances_ pred_test_y = model.predict(test_x) print(sm.r2_score(test_y, pred_test_y)) mp.figure('Bike', facecolor='lightgray') mp.subplot(211) mp.title('Day', fontsize=16) mp.ylabel('Importance', fontsize=12) mp.tick_params(labelsize=10) mp.grid(axis='y', linestyle=':') sorted_indices = fi_dy.argsort()[::-1] pos = np.arange(sorted_indices.size) mp.bar(pos, fi_dy[sorted_indices], facecolor='deepskyblue', edgecolor='steelblue') mp.xticks(pos, fn_dy[sorted_indices], rotation=30) mp.subplot(212) mp.title('Hour', fontsize=16) mp.ylabel('Importance', fontsize=12) mp.tick_params(labelsize=10) mp.grid(axis='y', linestyle=':') sorted_indices = fi_hr.argsort()[::-1] pos = np.arange(sorted_indices.size) mp.bar(pos, fi_hr[sorted_indices], facecolor='lightcoral', edgecolor='indianred') mp.xticks(pos, fn_hr[sorted_indices], rotation=30) mp.tight_layout() mp.show()
八.人工分类
- 输入 输出
-------------- -----
特征1 特征2
3 1 0
2 5 1
1 8 1
6 4 0
5 2 0
3 5 1
4 7 1
4 -1 0
------------------
6 8 1
5 1 0
代码:simple.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import matplotlib.pyplot as mp x = np.array([ [3, 1], [2, 5], [1, 8], [6, 4], [5, 2], [3, 5], [4, 7], [4, -1]]) y = np.array([0, 1, 1, 0, 0, 1, 1, 0]) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = np.zeros(len(flat_x), dtype=int) flat_y[flat_x[:, 0] < flat_x[:, 1]] = 1 grid_y = flat_y.reshape(grid_x[0].shape) mp.figure('Simple Classification', facecolor='lightgray') mp.title('Simple Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80) mp.show()
九.逻辑分类
- y = w0+w1x1+w2x2
连续的预测值->离散的预测值
[-oo, +oo]->{0, 1} 1
逻辑函数:sigmoid = ---------
1+e^-y
非线性化
1
y = -----------------------------
1+e^-(w0+w1x1+w2x2)
3 1 -> 0.2 0
2 5 -> 0.8 1
1 8 -> 0.7 1
6 4 -> 0.3 0
...
将预测函数的输出看做输入被划分为1类的概率,择概率大的类别作为预测结果。
代码:log2.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.linear_model as lm import matplotlib.pyplot as mp x = np.array([ [3, 1], [2, 5], [1, 8], [6, 4], [5, 2], [3, 5], [4, 7], [4, -1]]) y = np.array([0, 1, 1, 0, 0, 1, 1, 0]) # 逻辑分类器 model = lm.LogisticRegression( solver='liblinear', C=1) model.fit(x, y) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0].shape) mp.figure('Logistic Classification', facecolor='lightgray') mp.title('Logistic Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80) mp.show()
多元分类
/\
走 不走
/\
骑车 不骑车
/\
坐车 不坐车
/\
开车 不开车
...
A B C
... -> A 1 0.7 0 0.2 0 0.1 -> A
... -> B 0 0.1 1 0.8 0 0.4 -> B
... -> C 0 0.3 0 0.3 1 0.9 -> C
代码:log3.py
# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.linear_model as lm import matplotlib.pyplot as mp x = np.array([ [4, 7], [3.5, 8], [3.1, 6.2], [0.5, 1], [1, 2], [1.2, 1.9], [6, 2], [5.7, 1.5], [5.4, 2.2]]) y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]) # 逻辑分类器 model = lm.LogisticRegression( solver='liblinear', C=1000) model.fit(x, y) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0].shape) mp.figure('Logistic Classification', facecolor='lightgray') mp.title('Logistic Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80) mp.show()
十.朴素贝叶斯分类
-
算法原理
100010
1
P(遇到美女)=10/1000=0.01
P(被美女爱)=1/10=0.1贝叶斯定理:
P(A)P(B|A)
P(A|B)=-------------
P(B)
P(A|B)P(B)=P(B|A)P(A)
P(A,B)=P(B,A)
3 1 -> 0
1
P(C)P(x|C)
P(C|x)=-------------
P(x)
P(C)P(x|C)
=P(C,x)
=P(C,x1,x2)
=P(x1,x2,C)
=P(x1|x2,C)P(x2,C)
=P(x1|x2,C)P(x2|C)P(C)
朴素:条件独立假设
=P(x1|C)P(x2|C)P(C)
代码:nb.py
# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.naive_bayes as nb import matplotlib.pyplot as mp x, y = [], [] with open('../../data/multiple1.txt', 'r') as f: for line in f.readlines(): data = [float(substr) for substr in line.split(',')] x.append(data[:-1]) y.append(data[-1]) x = np.array(x) y = np.array(y, dtype=int) # 朴素贝叶斯分类器 model = nb.GaussianNB() model.fit(x, y) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0].shape) pred_y = model.predict(x) print((pred_y == y).sum() / pred_y.size) mp.figure('Naive Bayes Classification', facecolor='lightgray') mp.title('Naive Bayes Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80) mp.show()