KNN算法調優

KNN算法調優

原創

ckllf

2019-12-18 13:02

　　1.所用方法:

　　交叉驗證與網格搜索

　　交叉驗證(爲了讓被評估的模型更加精確可信):

　　所有訓練集數據分成N等分，幾等分就是幾折交叉驗證

　　網格搜索:調參數 K-近鄰:超參數K

　　2.API:

　　sklearn.model_selection.GridSearchCV： CV即cross validation

　　GridSearchCV(estimator,param_grid=None,cv=None)

　　.對估計器的參數指定值進行詳盡搜索

　　.estimator 估計器對象

　　.param_grid: 參數估計器(dict){"n_neighbors":[1,3,5]}

　　.cv :指定幾折交叉驗證

　　.fit:輸入訓練數據

　　.結果分析;

　　.best_score:在交叉驗證中驗證的最好結果

　　.best_estimaor:最好的參數模型

　　.cv_results_:每次交叉驗證後的驗證集正確率結果和訓練集正確率結果

　　3.對之前的預測簽入案例調優:

　　# -*- coding: utf-8 -*-

　　'''

　　@Author ：Jason

　　'''

　　from sklearn.model_selection import GridSearchCV

　　from sklearn.neighbors import KNeighborsClassifier

　　import pandas as pd

　　from sklearn.model_selection import train_test_split

　　from sklearn.preprocessing import StandardScaler

　　def knn():

　　'''

　　k-近鄰預測用去簽入位置

　　:return:

　　'''

　　#1.讀取數據

　　data = pd.read_csv(r"./files/FBlocation/train.csv")

　　print(data.head())

　　#2.處理數據

　　#2.1.縮小數據,查詢數據篩選:query 理解爲 sql 中的查詢

　　data.query("x > 1.0 & y < 1.25 & y > 2.5 & y < 2.75")

　　#2.2.處理時間

　　time_value = pd.to_datetime(data["time"],unit="s") #秒

　　print(time_value)

　　#2.3.把時間格式轉換成字典格式

　　time_value = pd.DataFrame(time_value) #年月日時分秒等變爲{"year":2019,"month":01} 這樣的

　　#2.4.構造一些特徵，年月都一樣

　　data["day"] = time_value.day

　　data["hour"] = time_value.hour

　　data["weekday"] = time_value.weekday

　　#2.5.刪除一些特徵鄭州婦科醫院 http://m.zyfuke.com/

　　data = data.drop(["time"],axis=1) #pandas中axis=1代表列，sklearn中axis=0代表列

　　#2.6.將簽到位置少於 n 個用戶的數據刪除

　　place_count = data.groupby("place_id").count() #根據place_id分組，統計次數

　　tf = place_count[place_count.row_id > 3].reset_index() #次數大於3的索引重置0,1,2排序，將原來索引放置place_id列

　　data = data[data["place_id"].isin(tf.place_id)] #如果place_id > 3的數據保存，小於則去掉

　　#2.7.去除數據當中的特徵值和目標值

　　y = data["place_id"]

　　x = data.drop(["place_id"],axis=0.25)

　　#2.8.進行數據的分割訓練集合測試集

　　x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

　　#3.特徵工程(標準化) #這裏標準化，和前期對比

　　std = StandardScaler()

　　#測試、訓練集特徵值標準化

　　x_train_std = std.fit_transform(x_train)

　　# y_train_std = std.fit_transform(y_train)#已經fit轉換過了，可以直接transform()

　　y_train_std = std.transform(y_train)

　　#4.進行算法 #超參數

　　knn = KNeighborsClassifier(n_neighbors=5)

　　#構造一些參數的值進行搜索

　　param = {"n_neighbors":[3,5,10]}

　　#進行網格搜索

　　gc = GridSearchCV(knn,param_grid=param,cv=2)

　　gc.fit(x_train,y_train)

　　#預測正確率

　　print("在測試集上正確率:",gc.score(x_test,y_test))

　　print("在交叉集上最好的結果:",gc.best_score_)

　　print("選擇最好的模型是:",gc.best_estimator_)

　　print("每個超參數每次交叉驗證的結果:",gc.cv_results_)

　　return None

　　if __name__ == "__main__":

　　knn()

　　結果:

　　從結果看出，最後的模型中，參數K取的值爲10

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

Nginx R31 doc-13-Limiting Access to Proxied HTTP Resources 訪問限流

中外程序員到底有啥區別？

Python數據分析與挖掘實戰（5章）

python包：pandas

C++文件/流

一、什麼是Docker

二、Docker 組件

揹包九講一 01揹包

今天！通義靈碼在北京、成都、杭州三城開講啦

Python中for循環運行機制探究以及可迭代對象、迭代器詳解

BP神經網絡原理推導及python實現

瞭解幾個Python高級特性

Python學習之數據清洗之增刪改查

Python 函數入門：變化的參數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結