數據分析——多變量回歸:預測紅酒口感

1.Python代碼

#!/usr/bin/env python3
# encoding: utf-8
'''
@file: WineCV.py
@time: 2020/6/13 0013 18:51
@author: Jack
@contact: [email protected]
'''

import urllib.request
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LassoCV
from math import sqrt
import matplotlib.pyplot as plt

## 讀取數據集
target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv")
data = urllib.request.urlopen(target_url)
xList = []
labels = []
names = []
firstLine = True
for line in data:
    if firstLine:
        names = str(line, encoding='utf-8').strip().split(";")
        firstLine = False
    else:
        row = str(line, encoding='utf-8').strip().split(";")
        labels.append(float(row[-1]))
        row.pop()
        floatRow = [float(num) for num in row]
        xList.append(floatRow)

nrows = len(xList)
ncols = len(xList[0])

## 計算均值方差
xMean = []
xSD = []
for i in range(ncols):
    col = [xList[j][i] for j in range(nrows)]
    mean = sum(col) / nrows
    xMean.append(mean)
    colDiff = [(xList[j][i] - mean) for j in range(nrows)]
    sumSq = sum([colDiff[i] * colDiff[i] for i in range(nrows)])
    stdDev = sqrt(sumSq / nrows)
    xSD.append(stdDev)

xNormalized = []
for i in range(nrows):
    rowNormalized = [(xList[i][j] - xMean[j]) / xSD[j] for j in range(ncols)]
    xNormalized.append(rowNormalized)


meanLabel = sum(labels) / nrows
sdLabel = sqrt(sum([(labels[i] - meanLabel) * (labels[i] - meanLabel) for i in range(nrows)]) / nrows)

labelNormalized = [(labels[i] - meanLabel) / sdLabel for i in range(nrows)]

## 未歸一化labels
Y = np.array(labels)

## 歸一化labels
# Y = np.array(labelNormalized)

## 未歸一化 X's
X = np.array(xList)

## 歸一化 Xss
X = np.array(xNormalized)

## Call LassoCV from sklearn.linear_model
wineModel = LassoCV(cv=10).fit(X, Y)

## 顯示結果

plt.figure()
plt.plot(wineModel.alphas_, wineModel.mse_path_, ':')
plt.plot(wineModel.alphas_, wineModel.mse_path_.mean(axis=-1),
         label='Average MSE Across Folds', linewidth=2)
plt.axvline(wineModel.alpha_, linestyle='--',
            label='CV Estimate of Best alpha')
plt.semilogx()
plt.legend()
ax = plt.gca()
ax.invert_xaxis()
plt.xlabel('alpha')
plt.ylabel('Mean Square Error')
plt.axis('tight')
plt.show()

# print out the value of alpha that minimizes the Cv-error
print("alpha Value that Minimizes CV Error  ", wineModel.alpha_)
print("Minimum MSE  ", min(wineModel.mse_path_.mean(axis=-1)))

在這裏插入圖片描述

alpha Value that Minimizes CV Error   0.010948337166040092
Minimum MSE   0.4338019871536978

2.代碼說明

上述代碼展示了執行 10 折交叉驗證的效果並繪製了圖形。代碼的第1部分從UCI網站讀入數據,轉化爲列表的列表,然後對屬性以及標籤進行歸一化。接着將列表轉換爲 numpy 數組X(屬性矩陣)以及數組 Y(標籤向量)。迴歸存在 2 個版本的定義。其中一個版本使用歸一化的數值,另一個版本使用非歸一化的數值。不論哪種定義,都可以將對應的非歸一化版本註釋掉,重新運行代碼看對屬性或者標籤進行歸一化的實際效果。使用一行代碼定義交叉驗證的數據份數(10),並且對模型進行了訓練。然後程序在 10 份數據的每一份上繪製錯誤隨 α 變化的曲線,同時繪製在 10 份數據上的錯誤平均值。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章