機器學習之K近鄰算法 kNN(2)

1.knn算法的超參數問題

"""
    超參數 :運行機器學習算法之前需要指定的參數
    模型參數:算法過程中學習的參數

    kNN算法沒有模型參數
    kNN算法中的k是典型的超參數

    尋找最好的k
"""

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

digits = datasets.load_digits()

# 數據矩陣
X = digits.data
#  特徵
Y = digits.target

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

best_score = 0.0
best_k = -1
best_method = ""
# for k in range(1, 11):
#     kNeighborsClassifier = KNeighborsClassifier(n_neighbors=k)
#     kNeighborsClassifier.fit(x_train, y_train)
#     score = kNeighborsClassifier.score(x_test, y_test)
#     if score > best_score:
#         best_k = k
#         best_score = score

for method in ["uniform", "distance"]:
    for k in range(1, 11):
        kNeighborsClassifier = KNeighborsClassifier(n_neighbors=k, weights=method)
        kNeighborsClassifier.fit(x_train, y_train)
        score = kNeighborsClassifier.score(x_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method

print(best_k)
print(best_score)
print(best_method)

2.使用GridSearchCV

"""
    Grid Search
"""

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

param_grid = [
    {
        'weights': ['uniform'],
        'n_neighbors': [i for i in range(1, 11)],
    },
    {
        'weights': ['distance'],
        'n_neighbors': [i for i in range(1, 11)],
        'p': [i for i in range(1, 6)]
    }
]

digits = datasets.load_digits()

# 數據矩陣
X = digits.data
#  特徵
Y = digits.target

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

kNeighborsClassifier = KNeighborsClassifier()

grid_search = GridSearchCV(kNeighborsClassifier, param_grid, verbose=2)

grid_search.fit(x_train, y_train)

result = grid_search.best_estimator_
best_score = grid_search.best_score_
best_params = grid_search.best_params_m

print(result)
print(best_score)
print(best_params)

3.爲什麼要數據歸一化

腫瘤大小 (釐米) 發現時間(天)
樣本1 1 200
樣本2 5 100

樣本間的距離被發現時間所主導

數據歸一化

解決方案:將所有的數據映射到同一尺度

最值歸一化 (normalization): 把所有的數據映射到0-1之間

x(scale) = (x - x(min))/(x(max) -x(min))

適用於分佈明顯邊界的情況;受outlier 影響 較大

均值方差歸一法:把所有數據歸一到均值爲0方差爲1的分佈中

適用於分佈沒有明顯的邊界;可能存在極端數據

x(scale) =(x-x(mean))/S

注: x(mean) 均值 S:方差

案例

import numpy as np

x = np.random.randint(0, 100, size=100)

# 最值歸一化
x_data = (x - np.min(x)) / (np.max(x) - np.min(x))

X = np.random.randint(0, 100, (50, 2))
X = np.array(X, dtype=float)

X[:, 0] = (X[:, 0] - np.min(X)) / (np.max(X[:, 0]) - np.min(X[:, 0]))
X[:, 1] = (X[:, 1] - np.min(X)) / (np.max(X[:, 1]) - np.min(X[:, 1]))

# 均值方差歸一化 Standardization
X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype=float)

X2[:, 0] = (X2[:, 0] - np.mean(X2[:, 0])) / np.std(X2[:, 0])
X2[:, 1] = (X2[:, 1] - np.mean(X2[:, 1])) / np.std(X2[:, 1])

print(np.mean(X2[:, 0]))
print(np.std(X2[:, 0]))

對測試數據集歸一化

將測試數據集 使用 訓練數據得到的mean_train 以及std_train相應的進行歸一化

(x_test-mean_train) / std_train

得到測試數據集歸一化的結果

測試數據是模擬真實環境
真實環境很有可能無法得到所有的測試數據的均值和方差
對數據的歸一化也是算法的一部分

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

iris = datasets.load_iris()
X = iris.data
y = iris.target

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

standardScaler = StandardScaler()
#  fit 之後 已經存放了均值方差歸一化相關信息
standardScaler.fit(x_train)

x_train = standardScaler.transform(x_train)
x_test_standard = standardScaler.transform(x_test)

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(x_train, y_train)
score = knn_clf.score(x_test_standard, y_test)

print(standardScaler.mean_)
print(standardScaler.scale_)
print(score)  # 1.0

歸一化處理後 成功率爲1.0

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章