20200203_knn分類算法

這是國外大哥的一個單子,總的來說沒有什麼技術難點
In this homework, you will develop a model t0 predict whether a given ca gets high or low gasmileage based on the Auto data set.

在本作業中,您將開發一個模型來預測給定的ca是高還是低的汽油里程,基於Auto數據集

import numpy as np
import pandas as pd
%matplotlib inline
#讀取數據
test=pd.read_csv('Auto.csv')
#展示數據前5行
test.head()
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
#數據信息展示
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg             397 non-null float64
cylinders       397 non-null int64
displacement    397 non-null float64
horsepower      397 non-null object
weight          397 non-null int64
acceleration    397 non-null float64
year            397 non-null int64
origin          397 non-null int64
name            397 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB
#由於有缺失值,所以將缺失值刪除
test.replace('?',np.nan,inplace = True)
test.dropna(inplace=True)
#強制轉換爲int類型
test['horsepower']=test['horsepower'].astype('int')

Create a binary variable. mpg01. that contains a 1 if mpg contains a value above its median, and a0 if mpg contains a value below its median. Y ou can compute the median using themedian0) function. 10 points Explore the data graphically in order to investigate the association between mpg01 and theother features. Which of the other features seem most likely to be useful in predicting mpg01?Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
(A)創建一個二進制變量。mpg 01。如果mpg包含高於中位數的值,則包含1;如果mpg包含低於中位數的值,則爲0。你可以用梅迪安0)函數來計算中值。10點以圖形方式研究mpg 01與其他特徵之間的關聯。在預測mpg 01時,其他哪些功能似乎最有用?散亂圖和盒圖可能是回答這個問題的有用工具。描述你的發現。

#查看他的中位數
test['mpg'].median()
#編寫函數,分割類別變量
def function(x):
    if x>23.0:
        return 1
    else:
        return 0
test['mpg01']=test['mpg'].apply(lambda x: function(x))
#查看相關性高低
test.corr()
import seaborn as sns
g = sns.pairplot(test, hue='mpg01', palette='seismic', diag_kind = 'kde',diag_kws=dict(shade=True))
g.set()

© Split the data into a training set and a test set

將數據分爲訓練集和測試集

from sklearn.model_selection import train_test_split
# 使用train_test_split方法,劃分訓練集和測試集,指定80%數據爲訓練集,20%爲測試集
x=test.drop(['mpg01','mpg','name'],axis=1)
y=test['mpg01']
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

Perform LDA on the training data in order to predict mpg01 using the variables that seemed
most associated with mpg01 in (b). What is the test error of the model obtained?

對訓練數據進行LDA,使用(b)中與mpg01關聯最大的變量來預測mpg01,得到的模型的測試誤差是多少?(15分)

test.info()
#導入包
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
numerical=['weight']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
lda = LinearDiscriminantAnalysis(n_components=1)
lda.fit(X_train1, y_train)
print(lda.score(X_test1, y_test)) #score是指分類的正確率

Perform QDA on the training data in order to predict mpg01 using the variables that seemed
most associated with mpg01 in (b). What is the test error of the model obtained?

對訓練數據進行QDA,使用(b)中與mpg01關聯最大的變量來預測mpg01,得到的模型的測試誤差是多少?(15分)

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
Qda = QuadraticDiscriminantAnalysis()
Qda.fit(X_train1, y_train)
print(Qda.score(X_test1, y_test)) #score是指分類的正確率

Perform logistic regression on the training data in order to predict mpg01 using the variables
that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

對訓練數據進行邏輯迴歸,使用(b)中與mpg01關係最密切的變量來預測mpg01,得到的模型檢驗誤差是多少?

numerical=['weight','cylinders']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(X_train1,y_train)
from sklearn.metrics import classification_report
print('----------------Train Set----------------------')
print(classification_report(y_train, lr.predict(X_train1)))
print('----------------test set----------------------')
print(classification_report(y_test, lr.predict(X_test1)))

Perform KNN on the training data, with several values of K, in order to predict mpg01. Use
only the variables that seemed most associated with mpg01 in (b). What test errors do you
obtain? Which value of K seems to perform the best on this data set?

對訓練數據執行幾個K值的KNN,以預測mpg01。只使用(b)中與mpg01關聯最大的變量。你得到了什麼測試錯誤?K的哪個值在這個數據集中表現最好?

from sklearn.neighbors import KNeighborsClassifier
# K參數選項
neighbors=range(1,30)
# 準確率
numerical=['weight']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
knn_acc=[]
# 嘗試neighbors中所列舉的所有K選項,使用KNeighborsClassifier模型做多次訓練。
# 針對每種K值情況計算一次在測試集上的準確率,打印每次訓練所獲得的準確率,並將每次準確率結果添入列表knn_acc中。
for i in neighbors:
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train1, y_train)
    knn_acc.append(model.score(X_test1, y_test))
print(knn_acc)
import matplotlib.pyplot as plt
plt.plot(neighbors,knn_acc, label='AUC')
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章