機器學習day02

七.決策樹

1.基本原理
相似的輸入導致相似的輸出。
年齡：青年-1，中年-2，老年-3
學歷：專科-1，本科-2，碩士-3，博士-4
經驗：缺乏-1，一般-2，豐富-3，資深-4
性別：男-1，女-2
薪資：1-低，2-中，3-高，4-超高
年齡學歷工作經驗性別 -> 薪資
1 1 1 2 5000 1
1 2 2 1 8000 2
2 3 3 2 10000 3
3 4 4 1 30000 4
...
------------------------------------------
1 2 2 1 ?
迴歸——平均 \ 結合特徵的相
分類——投票 / 似程度做加權
隨着子表的劃分，信息熵越來越小，信息量越來越大，
數據越來越有序。
11123 2... 3...
12211 2... 3...
11221 2... 3...

11... 12... 13... 14...
11... 12... 13... 14...
11... 12... 13... 14...
依次選擇原始樣本矩陣中的每一列，構建相應特徵值相同的若干子表樹，在葉級子表中所有特徵值都是相同的，對於未知輸出的輸入，按照同樣的規則，歸屬到某個葉級子表，將該子表中各樣本的輸出按照取平均(迴歸)或者取投票(分類)的方法，計算預測輸出。
2.工程優化

根據信息熵的減少量計算每個特徵對預測結果的影響程度，信息熵減少量越大的特徵對預測結果的影響也越大。
根據上一步計算出的影響程度，按照從大到小的順序，選擇劃分子表的特徵依據，即優先選擇影響程度最大的特徵。
根據事先給定的條件，提前結束子表的劃分過程，藉以簡化決策樹的結構，縮短構建和搜索的時間，在預測精度犧牲不大的前提下，提高模型性能。

3.集合算法

所謂集合算法,亦稱集合弱學習方法,其核心思想就是,通過平均或者投票,將多個不同學習方法的結論加以綜合,給出一個相對可靠預測結果,所選擇弱學習方法,在算法或數據上應該具備足夠分散性,以體現相對不同的傾向性,這樣得出的綜合結論才能夠更加泛化.
基於決策樹的集合算法,就是按照某種規則,構建多棵彼此不同的決策樹模型,分別給出針對未知樣本的預測結果,最後通過平均或投票得到相對綜合的結論.
根據構建多棵決策樹所依據的規則不同,基於決策樹的集合算法可被分爲以下幾種:
A:從原始訓練樣本中，以有放回或無放回抽樣的方式，隨機選取部分樣本，構建一棵決策樹，重複以上過程，得到若干棵決策輸，以此弱化某些強勢樣本對預測結果的影響力，提高模型精度。
B:如果在自助聚合的基礎上，每次構建決策樹時，不但隨機選擇樣本(行)，而且其特徵(列)也是隨機選擇的，則稱爲隨機森林。
C:正向激勵：首先爲訓練樣本分配相等的權重，構建第一棵決策樹，用該決策樹對訓練樣本進行預測，爲預測錯誤的樣本提升權重，再次構建下一棵決策樹，以此類推，得到針對每個樣本擁有不同權重的多棵決策樹。
代碼：house.py

4.特徵重要性

決策樹模型,在確定劃分子表優先選擇特徵的過程中,需要根據最大未寫減原則,確定劃分子表的依據,因此,作爲學習模型的副產品,可以得到每個特徵對於輸出的影響力度,即特徵重要性:feature_importances_,該輸出與模型算法有關
代碼:fi.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.datasets as sd
import sklearn.utils as su
import sklearn.tree as st
import sklearn.ensemble as se
import matplotlib.pyplot as mp
boston = sd.load_boston()
feature_names = boston.feature_names
x, y = su.shuffle(boston.data, boston.target,
                  random_state=7)
train_size = int(len(x) * 0.8)
train_x, test_x, train_y, test_y = \
    x[:train_size], x[train_size:], \
    y[:train_size], y[train_size:]
model = st.DecisionTreeRegressor(max_depth=4)
model.fit(train_x, train_y)
# 決策樹迴歸器給出的特徵重要性
fi_dt = model.feature_importances_
model = se.AdaBoostRegressor(
    st.DecisionTreeRegressor(max_depth=4),
    n_estimators=400, random_state=7)
model.fit(train_x, train_y)
# 基於決策樹的正向激勵迴歸器給出的特徵重要性
fi_ab = model.feature_importances_
mp.figure('Feature Importance', facecolor='lightgray')
mp.subplot(211)
mp.title('Decision Tree', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_dt.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_dt[sorted_indices],
       facecolor='deepskyblue', edgecolor='steelblue')
mp.xticks(pos, feature_names[sorted_indices],
          rotation=30)
mp.subplot(212)
mp.title('AdaBoost Decision Tree', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_ab.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_ab[sorted_indices],
       facecolor='lightcoral', edgecolor='indianred')
mp.xticks(pos, feature_names[sorted_indices],
          rotation=30)
mp.tight_layout()
mp.show()

學習模型關於特徵重要性的計算，除了與選擇的算法有關以外，還與數據的採集粒度有關。
代碼：bike.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import csv
import numpy as np
import sklearn.utils as su
import sklearn.ensemble as se
import sklearn.metrics as sm
import matplotlib.pyplot as mp
with open('../../data/bike_day.csv', 'r') as f:
    reader = csv.reader(f)
    x, y = [], []
    for row in reader:
        x.append(row[2:13])
        y.append(row[-1])
fn_dy = np.array(x[0])
x = np.array(x[1:], dtype=float)
y = np.array(y[1:], dtype=float)
x, y = su.shuffle(x, y, random_state=7)
train_size = int(len(x) * 0.9)
train_x, test_x, train_y, test_y = \
    x[:train_size], x[train_size:], \
    y[:train_size], y[train_size:]
# 隨機森林迴歸器
model = se.RandomForestRegressor(
    max_depth=10, n_estimators=1000,
    min_samples_split=2)
model.fit(train_x, train_y)
# 基於“天”數據集的特徵重要性
fi_dy = model.feature_importances_
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
with open('../../data/bike_hour.csv', 'r') as f:
    reader = csv.reader(f)
    x, y = [], []
    for row in reader:
        x.append(row[2:13])
        y.append(row[-1])
fn_hr = np.array(x[0])
x = np.array(x[1:], dtype=float)
y = np.array(y[1:], dtype=float)
x, y = su.shuffle(x, y, random_state=7)
train_size = int(len(x) * 0.9)
train_x, test_x, train_y, test_y = \
    x[:train_size], x[train_size:], \
    y[:train_size], y[train_size:]
# 隨機森林迴歸器
model = se.RandomForestRegressor(
    max_depth=10, n_estimators=1000,
    min_samples_split=2)
model.fit(train_x, train_y)
# 基於“小時”數據集的特徵重要性
fi_hr = model.feature_importances_
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
mp.figure('Bike', facecolor='lightgray')
mp.subplot(211)
mp.title('Day', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_dy.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_dy[sorted_indices],
       facecolor='deepskyblue', edgecolor='steelblue')
mp.xticks(pos, fn_dy[sorted_indices],
          rotation=30)
mp.subplot(212)
mp.title('Hour', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_hr.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_hr[sorted_indices],
       facecolor='lightcoral', edgecolor='indianred')
mp.xticks(pos, fn_hr[sorted_indices],
          rotation=30)
mp.tight_layout()
mp.show()

八.人工分類

輸入輸出
-------------- -----
特徵1 特徵2
3 1 0
2 5 1
1 8 1
6 4 0
5 2 0
3 5 1
4 7 1
4 -1 0
------------------
6 8 1
5 1 0
代碼：simple.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import matplotlib.pyplot as mp
x = np.array([
    [3, 1],
    [2, 5],
    [1, 8],
    [6, 4],
    [5, 2],
    [3, 5],
    [4, 7],
    [4, -1]])
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
                     np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = np.zeros(len(flat_x), dtype=int)
flat_y[flat_x[:, 0] < flat_x[:, 1]] = 1
grid_y = flat_y.reshape(grid_x[0].shape)
mp.figure('Simple Classification',
          facecolor='lightgray')
mp.title('Simple Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

九.邏輯分類

y = w0+w1x1+w2x2
連續的預測值->離散的預測值
[-oo, +oo]->{0, 1} 1
邏輯函數：sigmoid = ---------
1+e^-y
非線性化
1
y = -----------------------------
1+e^-(w0+w1x1+w2x2)
3 1 -> 0.2 0
2 5 -> 0.8 1
1 8 -> 0.7 1
6 4 -> 0.3 0
...
將預測函數的輸出看做輸入被劃分爲1類的概率，擇概率大的類別作爲預測結果。
代碼：log2.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.linear_model as lm
import matplotlib.pyplot as mp
x = np.array([
    [3, 1],
    [2, 5],
    [1, 8],
    [6, 4],
    [5, 2],
    [3, 5],
    [4, 7],
    [4, -1]])
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
# 邏輯分類器
model = lm.LogisticRegression(
    solver='liblinear', C=1)
model.fit(x, y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
                     np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
mp.figure('Logistic Classification',
          facecolor='lightgray')
mp.title('Logistic Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

多元分類
/\
走不走
/\
騎車不騎車
/\
坐車不坐車
/\
開車不開車
...
A B C
... -> A 1 0.7 0 0.2 0 0.1 -> A
... -> B 0 0.1 1 0.8 0 0.4 -> B
... -> C 0 0.3 0 0.3 1 0.9 -> C
代碼：log3.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.linear_model as lm
import matplotlib.pyplot as mp
x = np.array([
    [4, 7],
    [3.5, 8],
    [3.1, 6.2],
    [0.5, 1],
    [1, 2],
    [1.2, 1.9],
    [6, 2],
    [5.7, 1.5],
    [5.4, 2.2]])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
# 邏輯分類器
model = lm.LogisticRegression(
    solver='liblinear', C=1000)
model.fit(x, y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
                     np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

mp.figure('Logistic Classification',
          facecolor='lightgray')
mp.title('Logistic Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

十.樸素貝葉斯分類

算法原理
1000

10
1
P(遇到美女)=10/1000=0.01
P(被美女愛)=1/10=0.1

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.naive_bayes as nb
import matplotlib.pyplot as mp
x, y = [], []
with open('../../data/multiple1.txt', 'r') as f:
    for line in f.readlines():
        data = [float(substr) for substr
                in line.split(',')]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)
# 樸素貝葉斯分類器
model = nb.GaussianNB()
model.fit(x, y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
                     np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_y = model.predict(x)
print((pred_y == y).sum() / pred_y.size)
mp.figure('Naive Bayes Classification',
          facecolor='lightgray')
mp.title('Naive Bayes Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

開源堡壘機jumpserver的搭建與使用

源碼搭建svn

機器學習day01

機器學習day04

機器學習day02

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結