二、【Python】機器學習-監督學習

關鍵詞

分類(Classification)

迴歸(Regression)

泛化（Generalize）

過擬合（Overfitting）

欠擬合（Underfitting）

2.1 分類與迴歸

監督機器學習問題分爲兩類：分類（Classification）與迴歸（Regression）

分類：目的是預測類別標籤，這些標籤來自預定義的可選列表。分類問題一般分爲二分類(Binary Classification)和多分類(Multiclass classfication)。

在二分類問題中，將其中的一個類別稱爲正類（Positive Class）。另一個稱之爲反類（Negative Class）。

迴歸：目的是預測一個連續值。區分分類和迴歸的方法就是看問題的輸出是否具有一定的連續性。

2.2 泛化、過擬合與欠擬合

泛化（Generalize）：我理解爲是一種拓展。如果一個模型能夠對新數據做出準確的預測，那麼我們就說該模型能夠從訓練集泛化到測試集。
過擬合（Overfitting）：在創建並測試模型時，得到一個在訓練集表現很好的模型，但是不可以泛化到新數據的模型，則該模型存在過擬合。
欠擬合（Underfitting）：與過擬合相反，模型在訓練集表現很差，更不能泛化到預測新數據，則稱之爲欠擬合。

模型複雜度和數據集大小的關係：數據點的變化範圍越大在不發生過擬合的前提下，模型就可以越複雜。

2.3 監督學習算法

知識點

解釋這些算法如何預測

模型複雜度如何變化

概述每個算法如何構建模型

算法的優缺點

最適應用於哪類數據

解釋其中最重要參數的意義

分類數據集
下面的例子使用內置的forge數據集，說明二分類。

import mglearn
import matplotlib as plt
import numpy as np
# 生成內置的forge數據集，並將其兩個特徵賦給X和y。
X,y = mglearn.datasets.make_forge()
mglearn.discrete_scatter(X[:,0],X[:,1],y)
print("X shape:{}".format(X.shape))
plt.pyplot.xlabel("First Feature")
plt.pyplot.ylabel("Second Feature")

X shape:(26, 2)


c:\users\helli\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\deprecation.py:85: DeprecationWarning: Function make_blobs is deprecated; Please import make_blobs directly from scikit-learn
  warnings.warn(msg, category=DeprecationWarning)





Text(0, 0.5, 'Second Feature')

上面的數據點可以看出，X_shape帶有26個數據點和兩個特徵。

迴歸算法
模擬wave數據集來說明，wave是隻有一個輸入特徵和一個連續的目標變量(或響應)，後者是模型想要預測的對象。

import matplotlib as plt
X,y = mglearn.datasets.make_wave(n_samples=40)
plt.pyplot(X,y,'o')
plt.ylim(-3,3)[]f

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-16-510b6bbfd369> in <module>
      1 import matplotlib as plt
      2 X,y = mglearn.datasets.make_wave(n_samples=40)
----> 3 plt.pyplot(X,y,'o')
      4 plt.ylim(-3,3)


TypeError: 'module' object is not callable

import matplotlib as plt
print("{}".format(.__version__))

3.0.2

2.3.1 K近鄰算法

k-NN算法是最簡單的，構建模型只需要保存訓練數據集即可。
最簡單，最易理解的就是我們只考慮一個最近鄰情況，即我們想要預測的點最近的訓練數據點。預測結果就是這個訓練數據點的已知輸出。

# n_neighbors的參數是相鄰近的點
mglearn.plots.plot_knn_classification(n_neighbors=4)

c:\users\helli\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\deprecation.py:85: DeprecationWarning: Function make_blobs is deprecated; Please import make_blobs directly from scikit-learn
  warnings.warn(msg, category=DeprecationWarning)

除了最近鄰，還可以考慮任意個（k個）鄰居。這也是k近鄰算法名字的來歷。在多個鄰居時，用"投票法"（Voting）指定標籤。對於每個測試點，我們數一數多少個鄰居屬於類別0，多少個鄰居屬於類別1。然後將出現次數更多的類別作爲預測結果。

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# 從mglearn獲取數據
# 將數據3：1分爲訓練和測試
X,y = mglearn.datasets.make_forge()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
# 調用方法，設定三個鄰居個數
clf = KNeighborsClassifier(n_neighbors=3)
# 利用訓練集對這個分類器進行擬合，對於KNeighborsClassifier來說就是保存數據集，以便在預測時計算與鄰居的距離
clf.fit(X_train,y_train)
# 調用predict方法來對測試數據進行預測。對於測試集中的每個數據點，都要計算它在訓練集的最近鄰然後找出其中出現次數最多的類別。
print("Test set prediction:{}".format(clf.predict(X_test)))
# 數據泛化能力
print("Test set accuracy:{:.2f}".format(clf.score(X_test,y_test)))

Test set prediction:[1 0 1 0 1 0 0]
Test set accuracy:0.86


c:\users\helli\appdata\local\programs\python\python37\lib\site-packages\sklearn\utils\deprecation.py:85: DeprecationWarning: Function make_blobs is deprecated; Please import make_blobs directly from scikit-learn
  warnings.warn(msg, category=DeprecationWarning)

2.3.2 分析KNeighborsClassifier

對於二維數據集，可以在xy平面畫出所有可能的測試點的預測結果。根據每個點所屬的類別進行着色，這個可以查看決策邊界（decision boundary）

# 對1，3，9個鄰居三種情況進行決策邊界的可視化。
fig, axes = plt.pyplot.subplots(1,3,figsize=(10,3))

for n_neighbors, ax in zip([1,3,9], axes):
    clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X,y)
    mglearn.plots.plot_2d_separator(clf,X,fill=True,eps=0.5,ax=ax,alpha=.4)
    mglearn.discrete_scatter(X[:,0],X[:,1],y,ax=ax)
    ax.set_title("{} neighbor(s)".format(n_neighbors))
    ax.set_xlabel("feature 0")
    ax.set_ylabel("feature 1")

從上圖可以看出，neighbor越大，決策邊界越平滑，相鄰值小，對應更高的模型複雜度；相鄰值大，對應更低的模型複雜度。

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data,cancer.target,stratify=cancer.target,random_state=66)

training_accuracy = []
test_accuracy=[]
n_s = range(1,11)
for n_neighbors in n_s:
    clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    clf.fit(X_train,y_train)
    training_accuracy.append(clf.score(X_train,y_train))
    test_accuracy.append(clf.score(X_test,y_test))
    
plt.pyplot.plot(n_s, training_accuracy,label="training")
plt.pyplot.plot(n_s,test_accuracy,label="test",linestyle='--',color='g')
plt.pyplot.xlabel("n_s")
plt.pyplot.ylabel("Accuracy")
plt.pyplot.legend()

<matplotlib.legend.Legend at 0x1f8b6f68d68>

2.3.3 K近鄰迴歸

使用wave數據集

mglearn.plots.plot_knn_regression(n_neighbors=3)

from sklearn.neighbors import KNeighborsRegressor
X,y = mglearn.datasets.make_wave(n_samples=40)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
reg = KNeighborsRegressor(n_neighbors=3)
reg.fit(X_train,y_train)
print("pre:{}".format(reg.predict(X_test)))
print("score:{:.2f}".format(reg.score(X_test,y_test)))

pre:[-0.05396539  0.35686046  1.13671923 -1.89415682 -1.13881398 -1.63113382
  0.35686046  0.91241374 -0.44680446 -1.13881398]
score:0.83

fig, axes = plt.pyplot.subplots(1,3,figsize=(10,3))
# 創建1000個數據點，在-3，3之間均勻分佈
line = np.linspace(-3,3,1000).reshape(-1,1)
for n_neighbors, ax in zip([1,3,9], axes):
    clf = KNeighborsRegressor(n_neighbors=n_neighbors).fit(X_train,y_train)
    ax.plot(line,clf.predict(line))
    ax.plot(X_train,y_train,'^',c=mglearn.cm2(0),markersize=8)
    ax.plot(X_test,y_test,marker='v',c=mglearn.cm2(1),markersize=8)
    ax.set_title("{} neighbor(s)\n{:.2f}train score.{:.2f}testscore".format(n_neighbors,clf.score(X_train,y_train),clf.score(X_test,y_test)))
    ax.set_xlabel("feature 0")
    ax.set_ylabel("feature 1")

2.3.4 優缺點

KNeighbors分類器有兩個重要參數：鄰居個數和數據點之間距離的度量方法，一般3-5個鄰居數都會得到比較好的結果。
K-NN有點之一就是模型很容易理解。不需要過多的調節，就可以得到不錯的效果。這是最大的有點。簡單易學好上手。
但是對於很多特徵的數據集，該算法就會無能爲力了，而且速度較慢，因此一般不會應用到實踐中。

二、【Python】機器學習-監督學習

2.1 分類與迴歸

2.2 泛化、過擬合與欠擬合

2.3 監督學習算法

2.3.1 K近鄰算法

2.3.2 分析KNeighborsClassifier

2.3.3 K近鄰迴歸

2.3.4 優缺點

Loguru—輕量日誌

牛客網-python獲取輸入

mixins混入

emit的用法

前端常用工具庫整理(全)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結