机器学习day02

七.决策树

1.基本原理
相似的输入导致相似的输出。
年龄：青年-1，中年-2，老年-3
学历：专科-1，本科-2，硕士-3，博士-4
经验：缺乏-1，一般-2，丰富-3，资深-4
性别：男-1，女-2
薪资：1-低，2-中，3-高，4-超高
年龄学历工作经验性别 -> 薪资
1 1 1 2 5000 1
1 2 2 1 8000 2
2 3 3 2 10000 3
3 4 4 1 30000 4
...
------------------------------------------
1 2 2 1 ?
回归——平均 \ 结合特征的相
分类——投票 / 似程度做加权
随着子表的划分，信息熵越来越小，信息量越来越大，
数据越来越有序。
11123 2... 3...
12211 2... 3...
11221 2... 3...

11... 12... 13... 14...
11... 12... 13... 14...
11... 12... 13... 14...
依次选择原始样本矩阵中的每一列，构建相应特征值相同的若干子表树，在叶级子表中所有特征值都是相同的，对于未知输出的输入，按照同样的规则，归属到某个叶级子表，将该子表中各样本的输出按照取平均(回归)或者取投票(分类)的方法，计算预测输出。
2.工程优化

根据信息熵的减少量计算每个特征对预测结果的影响程度，信息熵减少量越大的特征对预测结果的影响也越大。
根据上一步计算出的影响程度，按照从大到小的顺序，选择划分子表的特征依据，即优先选择影响程度最大的特征。
根据事先给定的条件，提前结束子表的划分过程，借以简化决策树的结构，缩短构建和搜索的时间，在预测精度牺牲不大的前提下，提高模型性能。

3.集合算法

所谓集合算法,亦称集合弱学习方法,其核心思想就是,通过平均或者投票,将多个不同学习方法的结论加以综合,给出一个相对可靠预测结果,所选择弱学习方法,在算法或数据上应该具备足够分散性,以体现相对不同的倾向性,这样得出的综合结论才能够更加泛化.
基于决策树的集合算法,就是按照某种规则,构建多棵彼此不同的决策树模型,分别给出针对未知样本的预测结果,最后通过平均或投票得到相对综合的结论.
根据构建多棵决策树所依据的规则不同,基于决策树的集合算法可被分为以下几种:
A:从原始训练样本中，以有放回或无放回抽样的方式，随机选取部分样本，构建一棵决策树，重复以上过程，得到若干棵决策输，以此弱化某些强势样本对预测结果的影响力，提高模型精度。
B:如果在自助聚合的基础上，每次构建决策树时，不但随机选择样本(行)，而且其特征(列)也是随机选择的，则称为随机森林。
C:正向激励：首先为训练样本分配相等的权重，构建第一棵决策树，用该决策树对训练样本进行预测，为预测错误的样本提升权重，再次构建下一棵决策树，以此类推，得到针对每个样本拥有不同权重的多棵决策树。
代码：house.py

4.特征重要性

决策树模型,在确定划分子表优先选择特征的过程中,需要根据最大未写减原则,确定划分子表的依据,因此,作为学习模型的副产品,可以得到每个特征对于输出的影响力度,即特征重要性:feature_importances_,该输出与模型算法有关
代码:fi.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.datasets as sd
import sklearn.utils as su
import sklearn.tree as st
import sklearn.ensemble as se
import matplotlib.pyplot as mp
boston = sd.load_boston()
feature_names = boston.feature_names
x, y = su.shuffle(boston.data, boston.target,
                  random_state=7)
train_size = int(len(x) * 0.8)
train_x, test_x, train_y, test_y = \
    x[:train_size], x[train_size:], \
    y[:train_size], y[train_size:]
model = st.DecisionTreeRegressor(max_depth=4)
model.fit(train_x, train_y)
# 决策树回归器给出的特征重要性
fi_dt = model.feature_importances_
model = se.AdaBoostRegressor(
    st.DecisionTreeRegressor(max_depth=4),
    n_estimators=400, random_state=7)
model.fit(train_x, train_y)
# 基于决策树的正向激励回归器给出的特征重要性
fi_ab = model.feature_importances_
mp.figure('Feature Importance', facecolor='lightgray')
mp.subplot(211)
mp.title('Decision Tree', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_dt.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_dt[sorted_indices],
       facecolor='deepskyblue', edgecolor='steelblue')
mp.xticks(pos, feature_names[sorted_indices],
          rotation=30)
mp.subplot(212)
mp.title('AdaBoost Decision Tree', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_ab.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_ab[sorted_indices],
       facecolor='lightcoral', edgecolor='indianred')
mp.xticks(pos, feature_names[sorted_indices],
          rotation=30)
mp.tight_layout()
mp.show()

学习模型关于特征重要性的计算，除了与选择的算法有关以外，还与数据的采集粒度有关。
代码：bike.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import csv
import numpy as np
import sklearn.utils as su
import sklearn.ensemble as se
import sklearn.metrics as sm
import matplotlib.pyplot as mp
with open('../../data/bike_day.csv', 'r') as f:
    reader = csv.reader(f)
    x, y = [], []
    for row in reader:
        x.append(row[2:13])
        y.append(row[-1])
fn_dy = np.array(x[0])
x = np.array(x[1:], dtype=float)
y = np.array(y[1:], dtype=float)
x, y = su.shuffle(x, y, random_state=7)
train_size = int(len(x) * 0.9)
train_x, test_x, train_y, test_y = \
    x[:train_size], x[train_size:], \
    y[:train_size], y[train_size:]
# 随机森林回归器
model = se.RandomForestRegressor(
    max_depth=10, n_estimators=1000,
    min_samples_split=2)
model.fit(train_x, train_y)
# 基于“天”数据集的特征重要性
fi_dy = model.feature_importances_
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
with open('../../data/bike_hour.csv', 'r') as f:
    reader = csv.reader(f)
    x, y = [], []
    for row in reader:
        x.append(row[2:13])
        y.append(row[-1])
fn_hr = np.array(x[0])
x = np.array(x[1:], dtype=float)
y = np.array(y[1:], dtype=float)
x, y = su.shuffle(x, y, random_state=7)
train_size = int(len(x) * 0.9)
train_x, test_x, train_y, test_y = \
    x[:train_size], x[train_size:], \
    y[:train_size], y[train_size:]
# 随机森林回归器
model = se.RandomForestRegressor(
    max_depth=10, n_estimators=1000,
    min_samples_split=2)
model.fit(train_x, train_y)
# 基于“小时”数据集的特征重要性
fi_hr = model.feature_importances_
pred_test_y = model.predict(test_x)
print(sm.r2_score(test_y, pred_test_y))
mp.figure('Bike', facecolor='lightgray')
mp.subplot(211)
mp.title('Day', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_dy.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_dy[sorted_indices],
       facecolor='deepskyblue', edgecolor='steelblue')
mp.xticks(pos, fn_dy[sorted_indices],
          rotation=30)
mp.subplot(212)
mp.title('Hour', fontsize=16)
mp.ylabel('Importance', fontsize=12)
mp.tick_params(labelsize=10)
mp.grid(axis='y', linestyle=':')
sorted_indices = fi_hr.argsort()[::-1]
pos = np.arange(sorted_indices.size)
mp.bar(pos, fi_hr[sorted_indices],
       facecolor='lightcoral', edgecolor='indianred')
mp.xticks(pos, fn_hr[sorted_indices],
          rotation=30)
mp.tight_layout()
mp.show()

八.人工分类

输入输出
-------------- -----
特征1 特征2
3 1 0
2 5 1
1 8 1
6 4 0
5 2 0
3 5 1
4 7 1
4 -1 0
------------------
6 8 1
5 1 0
代码：simple.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import matplotlib.pyplot as mp
x = np.array([
    [3, 1],
    [2, 5],
    [1, 8],
    [6, 4],
    [5, 2],
    [3, 5],
    [4, 7],
    [4, -1]])
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
                     np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = np.zeros(len(flat_x), dtype=int)
flat_y[flat_x[:, 0] < flat_x[:, 1]] = 1
grid_y = flat_y.reshape(grid_x[0].shape)
mp.figure('Simple Classification',
          facecolor='lightgray')
mp.title('Simple Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

九.逻辑分类

y = w0+w1x1+w2x2
连续的预测值->离散的预测值
[-oo, +oo]->{0, 1} 1
逻辑函数：sigmoid = ---------
1+e^-y
非线性化
1
y = -----------------------------
1+e^-(w0+w1x1+w2x2)
3 1 -> 0.2 0
2 5 -> 0.8 1
1 8 -> 0.7 1
6 4 -> 0.3 0
...
将预测函数的输出看做输入被划分为1类的概率，择概率大的类别作为预测结果。
代码：log2.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.linear_model as lm
import matplotlib.pyplot as mp
x = np.array([
    [3, 1],
    [2, 5],
    [1, 8],
    [6, 4],
    [5, 2],
    [3, 5],
    [4, 7],
    [4, -1]])
y = np.array([0, 1, 1, 0, 0, 1, 1, 0])
# 逻辑分类器
model = lm.LogisticRegression(
    solver='liblinear', C=1)
model.fit(x, y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
                     np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
mp.figure('Logistic Classification',
          facecolor='lightgray')
mp.title('Logistic Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

多元分类
/\
走不走
/\
骑车不骑车
/\
坐车不坐车
/\
开车不开车
...
A B C
... -> A 1 0.7 0 0.2 0 0.1 -> A
... -> B 0 0.1 1 0.8 0 0.4 -> B
... -> C 0 0.3 0 0.3 1 0.9 -> C
代码：log3.py

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.linear_model as lm
import matplotlib.pyplot as mp
x = np.array([
    [4, 7],
    [3.5, 8],
    [3.1, 6.2],
    [0.5, 1],
    [1, 2],
    [1.2, 1.9],
    [6, 2],
    [5.7, 1.5],
    [5.4, 2.2]])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
# 逻辑分类器
model = lm.LogisticRegression(
    solver='liblinear', C=1000)
model.fit(x, y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
                     np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)

mp.figure('Logistic Classification',
          facecolor='lightgray')
mp.title('Logistic Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

十.朴素贝叶斯分类

算法原理
1000

10
1
P(遇到美女)=10/1000=0.01
P(被美女爱)=1/10=0.1

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import numpy as np
import sklearn.naive_bayes as nb
import matplotlib.pyplot as mp
x, y = [], []
with open('../../data/multiple1.txt', 'r') as f:
    for line in f.readlines():
        data = [float(substr) for substr
                in line.split(',')]
        x.append(data[:-1])
        y.append(data[-1])
x = np.array(x)
y = np.array(y, dtype=int)
# 朴素贝叶斯分类器
model = nb.GaussianNB()
model.fit(x, y)
l, r, h = x[:, 0].min() - 1, x[:, 0].max() + 1, 0.005
b, t, v = x[:, 1].min() - 1, x[:, 1].max() + 1, 0.005
grid_x = np.meshgrid(np.arange(l, r, h),
                     np.arange(b, t, v))
flat_x = np.c_[grid_x[0].ravel(), grid_x[1].ravel()]
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_y = model.predict(x)
print((pred_y == y).sum() / pred_y.size)
mp.figure('Naive Bayes Classification',
          facecolor='lightgray')
mp.title('Naive Bayes Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y,
              cmap='gray')
mp.scatter(x[:, 0], x[:, 1], c=y, cmap='brg', s=80)
mp.show()

Wireshark 安装+使用（一）

開源堡壘機jumpserver的搭建與使用

源碼搭建svn

機器學習day01

機器學習day04

機器學習day02

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結