数据分析——多变量回归:预测红酒口感

1.Python代码

#!/usr/bin/env python3
# encoding: utf-8
'''
@file: WineCV.py
@time: 2020/6/13 0013 18:51
@author: Jack
@contact: [email protected]
'''

import urllib.request
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LassoCV
from math import sqrt
import matplotlib.pyplot as plt

## 读取数据集
target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv")
data = urllib.request.urlopen(target_url)
xList = []
labels = []
names = []
firstLine = True
for line in data:
    if firstLine:
        names = str(line, encoding='utf-8').strip().split(";")
        firstLine = False
    else:
        row = str(line, encoding='utf-8').strip().split(";")
        labels.append(float(row[-1]))
        row.pop()
        floatRow = [float(num) for num in row]
        xList.append(floatRow)

nrows = len(xList)
ncols = len(xList[0])

## 计算均值方差
xMean = []
xSD = []
for i in range(ncols):
    col = [xList[j][i] for j in range(nrows)]
    mean = sum(col) / nrows
    xMean.append(mean)
    colDiff = [(xList[j][i] - mean) for j in range(nrows)]
    sumSq = sum([colDiff[i] * colDiff[i] for i in range(nrows)])
    stdDev = sqrt(sumSq / nrows)
    xSD.append(stdDev)

xNormalized = []
for i in range(nrows):
    rowNormalized = [(xList[i][j] - xMean[j]) / xSD[j] for j in range(ncols)]
    xNormalized.append(rowNormalized)


meanLabel = sum(labels) / nrows
sdLabel = sqrt(sum([(labels[i] - meanLabel) * (labels[i] - meanLabel) for i in range(nrows)]) / nrows)

labelNormalized = [(labels[i] - meanLabel) / sdLabel for i in range(nrows)]

## 未归一化labels
Y = np.array(labels)

## 归一化labels
# Y = np.array(labelNormalized)

## 未归一化 X's
X = np.array(xList)

## 归一化 Xss
X = np.array(xNormalized)

## Call LassoCV from sklearn.linear_model
wineModel = LassoCV(cv=10).fit(X, Y)

## 显示结果

plt.figure()
plt.plot(wineModel.alphas_, wineModel.mse_path_, ':')
plt.plot(wineModel.alphas_, wineModel.mse_path_.mean(axis=-1),
         label='Average MSE Across Folds', linewidth=2)
plt.axvline(wineModel.alpha_, linestyle='--',
            label='CV Estimate of Best alpha')
plt.semilogx()
plt.legend()
ax = plt.gca()
ax.invert_xaxis()
plt.xlabel('alpha')
plt.ylabel('Mean Square Error')
plt.axis('tight')
plt.show()

# print out the value of alpha that minimizes the Cv-error
print("alpha Value that Minimizes CV Error  ", wineModel.alpha_)
print("Minimum MSE  ", min(wineModel.mse_path_.mean(axis=-1)))

在这里插入图片描述

alpha Value that Minimizes CV Error   0.010948337166040092
Minimum MSE   0.4338019871536978

2.代码说明

上述代码展示了执行 10 折交叉验证的效果并绘制了图形。代码的第1部分从UCI网站读入数据,转化为列表的列表,然后对属性以及标签进行归一化。接着将列表转换为 numpy 数组X(属性矩阵)以及数组 Y(标签向量)。回归存在 2 个版本的定义。其中一个版本使用归一化的数值,另一个版本使用非归一化的数值。不论哪种定义,都可以将对应的非归一化版本注释掉,重新运行代码看对属性或者标签进行归一化的实际效果。使用一行代码定义交叉验证的数据份数(10),并且对模型进行了训练。然后程序在 10 份数据的每一份上绘制错误随 α 变化的曲线,同时绘制在 10 份数据上的错误平均值。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章