20200203_knn分类算法

这是国外大哥的一个单子,总的来说没有什么技术难点
In this homework, you will develop a model t0 predict whether a given ca gets high or low gasmileage based on the Auto data set.

在本作业中,您将开发一个模型来预测给定的ca是高还是低的汽油里程,基于Auto数据集

import numpy as np
import pandas as pd
%matplotlib inline
#读取数据
test=pd.read_csv('Auto.csv')
#展示数据前5行
test.head()
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130 3504 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165 3693 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150 3436 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150 3433 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140 3449 10.5 70 1 ford torino
#数据信息展示
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg             397 non-null float64
cylinders       397 non-null int64
displacement    397 non-null float64
horsepower      397 non-null object
weight          397 non-null int64
acceleration    397 non-null float64
year            397 non-null int64
origin          397 non-null int64
name            397 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.0+ KB
#由于有缺失值,所以将缺失值删除
test.replace('?',np.nan,inplace = True)
test.dropna(inplace=True)
#强制转换为int类型
test['horsepower']=test['horsepower'].astype('int')

Create a binary variable. mpg01. that contains a 1 if mpg contains a value above its median, and a0 if mpg contains a value below its median. Y ou can compute the median using themedian0) function. 10 points Explore the data graphically in order to investigate the association between mpg01 and theother features. Which of the other features seem most likely to be useful in predicting mpg01?Scatterplots and boxplots may be useful tools to answer this question. Describe your findings.
(A)创建一个二进制变量。mpg 01。如果mpg包含高于中位数的值,则包含1;如果mpg包含低于中位数的值,则为0。你可以用梅迪安0)函数来计算中值。10点以图形方式研究mpg 01与其他特征之间的关联。在预测mpg 01时,其他哪些功能似乎最有用?散乱图和盒图可能是回答这个问题的有用工具。描述你的发现。

#查看他的中位数
test['mpg'].median()
#编写函数,分割类别变量
def function(x):
    if x>23.0:
        return 1
    else:
        return 0
test['mpg01']=test['mpg'].apply(lambda x: function(x))
#查看相关性高低
test.corr()
import seaborn as sns
g = sns.pairplot(test, hue='mpg01', palette='seismic', diag_kind = 'kde',diag_kws=dict(shade=True))
g.set()

© Split the data into a training set and a test set

将数据分为训练集和测试集

from sklearn.model_selection import train_test_split
# 使用train_test_split方法,划分训练集和测试集,指定80%数据为训练集,20%为测试集
x=test.drop(['mpg01','mpg','name'],axis=1)
y=test['mpg01']
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

Perform LDA on the training data in order to predict mpg01 using the variables that seemed
most associated with mpg01 in (b). What is the test error of the model obtained?

对训练数据进行LDA,使用(b)中与mpg01关联最大的变量来预测mpg01,得到的模型的测试误差是多少?(15分)

test.info()
#导入包
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
numerical=['weight']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
lda = LinearDiscriminantAnalysis(n_components=1)
lda.fit(X_train1, y_train)
print(lda.score(X_test1, y_test)) #score是指分类的正确率

Perform QDA on the training data in order to predict mpg01 using the variables that seemed
most associated with mpg01 in (b). What is the test error of the model obtained?

对训练数据进行QDA,使用(b)中与mpg01关联最大的变量来预测mpg01,得到的模型的测试误差是多少?(15分)

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
Qda = QuadraticDiscriminantAnalysis()
Qda.fit(X_train1, y_train)
print(Qda.score(X_test1, y_test)) #score是指分类的正确率

Perform logistic regression on the training data in order to predict mpg01 using the variables
that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

对训练数据进行逻辑回归,使用(b)中与mpg01关系最密切的变量来预测mpg01,得到的模型检验误差是多少?

numerical=['weight','cylinders']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr = lr.fit(X_train1,y_train)
from sklearn.metrics import classification_report
print('----------------Train Set----------------------')
print(classification_report(y_train, lr.predict(X_train1)))
print('----------------test set----------------------')
print(classification_report(y_test, lr.predict(X_test1)))

Perform KNN on the training data, with several values of K, in order to predict mpg01. Use
only the variables that seemed most associated with mpg01 in (b). What test errors do you
obtain? Which value of K seems to perform the best on this data set?

对训练数据执行几个K值的KNN,以预测mpg01。只使用(b)中与mpg01关联最大的变量。你得到了什么测试错误?K的哪个值在这个数据集中表现最好?

from sklearn.neighbors import KNeighborsClassifier
# K参数选项
neighbors=range(1,30)
# 准确率
numerical=['weight']
X_train1=X_train[numerical]
X_test1=X_test[numerical]
knn_acc=[]
# 尝试neighbors中所列举的所有K选项,使用KNeighborsClassifier模型做多次训练。
# 针对每种K值情况计算一次在测试集上的准确率,打印每次训练所获得的准确率,并将每次准确率结果添入列表knn_acc中。
for i in neighbors:
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train1, y_train)
    knn_acc.append(model.score(X_test1, y_test))
print(knn_acc)
import matplotlib.pyplot as plt
plt.plot(neighbors,knn_acc, label='AUC')
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章