七.決策樹
1.基本原理
相似的輸入導致相似的輸出。
年齡:青年-1,中年-2,老年-3
學歷:專科-1,本科-2,碩士-3,博士-4
經驗:缺乏-1,一般-2,豐富-3,資深-4
性別:男-1,女-2
薪資:1-低,2-中,3-高,4-超高
年齡 學歷 工作經驗 性別 -> 薪資
1 1 1 2 5000 1
1 2 2 1 8000 2
2 3 3 2 10000 3
3 4 4 1 30000 4
...
------------------------------------------
1 2 2 1 ?
迴歸——平均 \ 結合特徵的相
分類——投票 / 似程度做加權
隨着子表的劃分,信息熵越來越小,信息量越來越大,
數據越來越有序。
11123 2... 3...
12211 2... 3...
11221 2... 3...
11... 12... 13... 14...
11... 12... 13... 14...
11... 12... 13... 14...
依次選擇原始樣本矩陣中的每一列,構建相應特徵值相同的若干子表樹,在葉級子表中所有特徵值都是相同的,對於未知輸出的輸入,按照同樣的規則,歸屬到某個葉級子表,將該子表中各樣本的輸出按照取平均(迴歸)或者取投票(分類)的方法,計算預測輸出。
2.工程優化
- 根據信息熵的減少量計算每個特徵對預測結果的影響程度,信息熵減少量越大的特徵對預測結果的影響也越大。
- 根據上一步計算出的影響程度,按照從大到小的順序,選擇劃分子表的特徵依據,即優先選擇影響程度最大的特徵。
- 根據事先給定的條件,提前結束子表的劃分過程,藉以簡化決策樹的結構,縮短構建和搜索的時間,在預測精度犧牲不大的前提下,提高模型性能。
3.集合算法
- 所謂集合算法,亦稱集合弱學習方法,其核心思想就是,通過平均或者投票,將多個不同學習方法的結論加以綜合,給出一個相對可靠預測結果,所選擇弱學習方法,在算法或數據上應該具備足夠分散性,以體現相對不同的傾向性,這樣得出的綜合結論才能夠更加泛化.
- 基於決策樹的集合算法,就是按照某種規則,構建多棵彼此不同的決策樹模型,分別給出針對未知樣本的預測結果,最後通過平均或投票得到相對綜合的結論.
- 根據構建多棵決策樹所依據的規則不同,基於決策樹的集合算法可被分爲以下幾種:
A:從原始訓練樣本中,以有放回或無放回抽樣的方式,隨機選取部分樣本,構建一棵決策樹,重複以上過程,得到若干棵決策輸,以此弱化某些強勢樣本對預測結果的影響力,提高模型精度。
B:如果在自助聚合的基礎上,每次構建決策樹時,不但隨機選擇樣本(行),而且其特徵(列)也是隨機選擇的,則稱爲隨機森林。
C:正向激勵:首先爲訓練樣本分配相等的權重,構建第一棵決策樹,用該決策樹對訓練樣本進行預測,爲預測錯誤的樣本提升權重,再次構建下一棵決策樹,以此類推,得到針對每個樣本擁有不同權重的多棵決策樹。
代碼:house.py
4.特徵重要性
- 決策樹模型,在確定劃分子表優先選擇特徵的過程中,需要根據最大未寫減原則,確定劃分子表的依據,因此,作爲學習模型的副產品,可以得到每個特徵對於輸出的影響力度,即特徵重要性:feature_importances_,該輸出與模型算法有關
代碼:fi.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.datasets as sd import sklearn.utils as su import sklearn.tree as st import sklearn.ensemble as se import matplotlib.pyplot as mp boston = sd.load_boston() feature_names = boston.feature_names x, y = su.shuffle(boston.data, boston.target, random_state=7) train_size = int(len(x) * 0.8) train_x, test_x, train_y, test_y = \ x[:train_size], x[train_size:], \ y[:train_size], y[train_size:] model = st.DecisionTreeRegressor(max_depth=4) model.fit(train_x, train_y) # 決策樹迴歸器給出的特徵重要性 fi_dt = model.feature_importances_ model = se.AdaBoostRegressor( st.DecisionTreeRegressor(max_depth=4), n_estimators=400, random_state=7) model.fit(train_x, train_y) # 基於決策樹的正向激勵迴歸器給出的特徵重要性 fi_ab = model.feature_importances_ mp.figure('Feature Importance', facecolor='lightgray') mp.subplot(211) mp.title('Decision Tree', fontsize=16) mp.ylabel('Importance', fontsize=12) mp.tick_params(labelsize=10) mp.grid(axis='y', linestyle=':') sorted_indices = fi_dt.argsort()[::-1] pos = np.arange(sorted_indices.size) mp.bar(pos, fi_dt[sorted_indices], facecolor='deepskyblue', edgecolor='steelblue') mp.xticks(pos, feature_names[sorted_indices], rotation=30) mp.subplot(212) mp.title('AdaBoost Decision Tree', fontsize=16) mp.ylabel('Importance', fontsize=12) mp.tick_params(labelsize=10) mp.grid(axis='y', linestyle=':') sorted_indices = fi_ab.argsort()[::-1] pos = np.arange(sorted_indices.size) mp.bar(pos, fi_ab[sorted_indices], facecolor='lightcoral', edgecolor='indianred') mp.xticks(pos, feature_names[sorted_indices], rotation=30) mp.tight_layout() mp.show()
學習模型關於特徵重要性的計算,除了與選擇的算法有關以外,還與數據的採集粒度有關。
代碼:bike.py
# -*- coding: utf-8 -*- from __future__ import unicode_literals import csv import numpy as np import sklearn.utils as su import sklearn.ensemble as se import sklearn.metrics as sm import matplotlib.pyplot as mp with open('../../data/bike_day.csv', 'r') as f: reader = csv.reader(f) x, y = [], [] for row in reader: x.append(row[2:13]) y.append(row[-1]) fn_dy = np.array(x[0]) x = np.array(x[1:], dtype=float) y = np.array(y[1:], dtype=float) x, y = su.shuffle(x, y, random_state=7) train_size = int(len(x) * 0.9) train_x, test_x, train_y, test_y = \ x[:train_size], x[train_size:], \ y[:train_size], y[train_size:] # 隨機森林迴歸器 model = se.RandomForestRegressor( max_depth=10, n_estimators=1000, min_samples_split=2) model.fit(train_x, train_y) # 基於“天”數據集的特徵重要性 fi_dy = model.feature_importances_ pred_test_y = model.predict(test_x) print(sm.r2_score(test_y, pred_test_y)) with open('../../data/bike_hour.csv', 'r') as f: reader = csv.reader(f) x, y = [], [] for row in reader: x.append(row[2:13]) y.append(row[-1]) fn_hr = np.array(x[0]) x = np.array(x[1:], dtype=float) y = np.array(y[1:], dtype=float) x, y = su.shuffle(x, y, random_state=7) train_size = int(len(x) * 0.9) train_x, test_x, train_y, test_y = \ x[:train_size], x[train_size:], \ y[:train_size], y[train_size:] # 隨機森林迴歸器 model = se.RandomForestRegressor( max_depth=10, n_estimators=1000, min_samples_split=2) model.fit(train_x, train_y) # 基於“小時”數據集的特徵重要性 fi_hr = model.feature_importances_ pred_test_y = model.predict(test_x) print(sm.r2_score(test_y, pred_test_y)) mp.figure('Bike', facecolor='lightgray') mp.subplot(211) mp.title('Day', fontsize=16) mp.ylabel('Importance', fontsize=12) mp.tick_params(labelsize=10) mp.grid(axis='y', linestyle=':') sorted_indices = fi_dy.argsort()[::-1] pos = np.arange(sorted_indices.size) mp.bar(pos, fi_dy[sorted_indices], facecolor='deepskyblue', edgecolor='steelblue') mp.xticks(pos, fn_dy[sorted_indices], rotation=30) mp.subplot(212) mp.title('Hour', fontsize=16) mp.ylabel('Importance', fontsize=12) mp.tick_params(labelsize=10) mp.grid(axis='y', linestyle=':') sorted_indices = fi_hr.argsort()[::-1] pos = np.arange(sorted_indices.size) mp.bar(pos, fi_hr[sorted_indices], facecolor='lightcoral', edgecolor='indianred') mp.xticks(pos, fn_hr[sorted_indices], rotation=30) mp.tight_layout() mp.show()
八.人工分類
- 輸入 輸出
-------------- -----
特徵1 特徵2
3 1 0
2 5 1
1 8 1
6 4 0
5 2 0
3 5 1
4 7 1
4 -1 0
------------------
6 8 1
5 1 0
代碼:simple.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import matplotlib.pyplot as mp x = np.array([ [3, 1], [2, 5], [1, 8], [6, 4], [5, 2], [3, 5], [4, 7], [4, -1]]) y = np.array([0, 1, 1, 0, 0, 1, 1, 0]) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = np.zeros(len(flat_x), dtype=int) flat_y[flat_x[:, 0] < flat_x[:, 1]] = 1 grid_y = flat_y.reshape(grid_x[0].shape) mp.figure('Simple Classification', facecolor='lightgray') mp.title('Simple Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80) mp.show()
九.邏輯分類
- y = w0+w1x1+w2x2
連續的預測值->離散的預測值
[-oo, +oo]->{0, 1} 1
邏輯函數:sigmoid = ---------
1+e^-y
非線性化
1
y = -----------------------------
1+e^-(w0+w1x1+w2x2)
3 1 -> 0.2 0
2 5 -> 0.8 1
1 8 -> 0.7 1
6 4 -> 0.3 0
...
將預測函數的輸出看做輸入被劃分爲1類的概率,擇概率大的類別作爲預測結果。
代碼:log2.py# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.linear_model as lm import matplotlib.pyplot as mp x = np.array([ [3, 1], [2, 5], [1, 8], [6, 4], [5, 2], [3, 5], [4, 7], [4, -1]]) y = np.array([0, 1, 1, 0, 0, 1, 1, 0]) # 邏輯分類器 model = lm.LogisticRegression( solver='liblinear', C=1) model.fit(x, y) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0].shape) mp.figure('Logistic Classification', facecolor='lightgray') mp.title('Logistic Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80) mp.show()
多元分類
/\
走 不走
/\
騎車 不騎車
/\
坐車 不坐車
/\
開車 不開車
...
A B C
... -> A 1 0.7 0 0.2 0 0.1 -> A
... -> B 0 0.1 1 0.8 0 0.4 -> B
... -> C 0 0.3 0 0.3 1 0.9 -> C
代碼:log3.py
# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.linear_model as lm import matplotlib.pyplot as mp x = np.array([ [4, 7], [3.5, 8], [3.1, 6.2], [0.5, 1], [1, 2], [1.2, 1.9], [6, 2], [5.7, 1.5], [5.4, 2.2]]) y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2]) # 邏輯分類器 model = lm.LogisticRegression( solver='liblinear', C=1000) model.fit(x, y) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0].shape) mp.figure('Logistic Classification', facecolor='lightgray') mp.title('Logistic Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80) mp.show()
十.樸素貝葉斯分類
-
算法原理
100010
1
P(遇到美女)=10/1000=0.01
P(被美女愛)=1/10=0.1貝葉斯定理:
P(A)P(B|A)
P(A|B)=-------------
P(B)
P(A|B)P(B)=P(B|A)P(A)
P(A,B)=P(B,A)
3 1 -> 0
1
P(C)P(x|C)
P(C|x)=-------------
P(x)
P(C)P(x|C)
=P(C,x)
=P(C,x1,x2)
=P(x1,x2,C)
=P(x1|x2,C)P(x2,C)
=P(x1|x2,C)P(x2|C)P(C)
樸素:條件獨立假設
=P(x1|C)P(x2|C)P(C)
代碼:nb.py
# -*- coding: utf-8 -*- from __future__ import unicode_literals import numpy as np import sklearn.naive_bayes as nb import matplotlib.pyplot as mp x, y = [], [] with open('../../data/multiple1.txt', 'r') as f: for line in f.readlines(): data = [float(substr) for substr in line.split(',')] x.append(data[:-1]) y.append(data[-1]) x = np.array(x) y = np.array(y, dtype=int) # 樸素貝葉斯分類器 model = nb.GaussianNB() model.fit(x, y) l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005 b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005 grid_x = np.meshgrid(np.arange(l, r, h), np.arange(b, t, v)) flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()] flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0].shape) pred_y = model.predict(x) print((pred_y == y).sum() / pred_y.size) mp.figure('Naive Bayes Classification', facecolor='lightgray') mp.title('Naive Bayes Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80) mp.show()