文章目錄
1.Python代碼
#!/usr/bin/env python3
# encoding: utf-8
'''
@file: WineCV.py
@time: 2020/6/13 0013 18:51
@author: Jack
@contact: [email protected]
'''
import urllib.request
import numpy as np
from sklearn import datasets, linear_model
from sklearn.linear_model import LassoCV
from math import sqrt
import matplotlib.pyplot as plt
## 讀取數據集
target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv")
data = urllib.request.urlopen(target_url)
xList = []
labels = []
names = []
firstLine = True
for line in data:
if firstLine:
names = str(line, encoding='utf-8').strip().split(";")
firstLine = False
else:
row = str(line, encoding='utf-8').strip().split(";")
labels.append(float(row[-1]))
row.pop()
floatRow = [float(num) for num in row]
xList.append(floatRow)
nrows = len(xList)
ncols = len(xList[0])
## 計算均值方差
xMean = []
xSD = []
for i in range(ncols):
col = [xList[j][i] for j in range(nrows)]
mean = sum(col) / nrows
xMean.append(mean)
colDiff = [(xList[j][i] - mean) for j in range(nrows)]
sumSq = sum([colDiff[i] * colDiff[i] for i in range(nrows)])
stdDev = sqrt(sumSq / nrows)
xSD.append(stdDev)
xNormalized = []
for i in range(nrows):
rowNormalized = [(xList[i][j] - xMean[j]) / xSD[j] for j in range(ncols)]
xNormalized.append(rowNormalized)
meanLabel = sum(labels) / nrows
sdLabel = sqrt(sum([(labels[i] - meanLabel) * (labels[i] - meanLabel) for i in range(nrows)]) / nrows)
labelNormalized = [(labels[i] - meanLabel) / sdLabel for i in range(nrows)]
## 未歸一化labels
Y = np.array(labels)
## 歸一化labels
# Y = np.array(labelNormalized)
## 未歸一化 X's
X = np.array(xList)
## 歸一化 Xss
X = np.array(xNormalized)
## Call LassoCV from sklearn.linear_model
wineModel = LassoCV(cv=10).fit(X, Y)
## 顯示結果
plt.figure()
plt.plot(wineModel.alphas_, wineModel.mse_path_, ':')
plt.plot(wineModel.alphas_, wineModel.mse_path_.mean(axis=-1),
label='Average MSE Across Folds', linewidth=2)
plt.axvline(wineModel.alpha_, linestyle='--',
label='CV Estimate of Best alpha')
plt.semilogx()
plt.legend()
ax = plt.gca()
ax.invert_xaxis()
plt.xlabel('alpha')
plt.ylabel('Mean Square Error')
plt.axis('tight')
plt.show()
# print out the value of alpha that minimizes the Cv-error
print("alpha Value that Minimizes CV Error ", wineModel.alpha_)
print("Minimum MSE ", min(wineModel.mse_path_.mean(axis=-1)))
alpha Value that Minimizes CV Error 0.010948337166040092
Minimum MSE 0.4338019871536978
2.代碼說明
上述代碼展示了執行 10 折交叉驗證的效果並繪製了圖形。代碼的第1部分從UCI網站讀入數據,轉化爲列表的列表,然後對屬性以及標籤進行歸一化。接着將列表轉換爲 numpy 數組X(屬性矩陣)以及數組 Y(標籤向量)。迴歸存在 2 個版本的定義。其中一個版本使用歸一化的數值,另一個版本使用非歸一化的數值。不論哪種定義,都可以將對應的非歸一化版本註釋掉,重新運行代碼看對屬性或者標籤進行歸一化的實際效果。使用一行代碼定義交叉驗證的數據份數(10),並且對模型進行了訓練。然後程序在 10 份數據的每一份上繪製錯誤隨 α 變化的曲線,同時繪製在 10 份數據上的錯誤平均值。