1.前言
由於數據的偏差與跨度會影響機器學習的成效,因此正規化(標準化)數據可以提升機器學習的成效
2.數據標準化
from sklearn import preprocessing #導入用於數據標準化的模塊
import numpy as np
data = np.array([[13,54,7,-5],
[67,98,11,34],
[-56,49,22,39]],dtype = np.float64)
print(data)
print(preprocessing.scale(data)) #preprocessing.scale實現數據標準化
#
[[ 13. 54. 7. -5.]
[ 67. 98. 11. 34.]
[-56. 49. 22. 39.]]
[[ 0.09932686 -0.59050255 -0.99861783 -1.40657764]
[ 1.17205693 1.40812146 -0.36791183 0.57618843]
[-1.27138379 -0.81761891 1.36652966 0.83038921]]
數據標準化後服從均值爲0,方差爲1的正太分佈
data_ = preprocessing.scale(data)
print(data_.mean(axis = 0))
print(data_.std(axis = 0))
3.對比標準化前後
from sklearn import preprocessing #導入用於數據標準化的模塊
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets.samples_generator import make_classification #用於生成數據的模塊
from sklearn.svm import SVC
import matplotlib.pyplot as plt
X, y = make_classification(n_samples=400,n_features=2,n_redundant=0,n_informative=2,random_state=42,n_clusters_per_class=1,scale=100) #特徵個數= n_informative() + n_redundant + n_repeated
plt.scatter(X[:,0],X[:,1],c=y)
3.1.數據標準化前
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
model = SVC()
model.fit(x_train, y_train)
print("分類準確度:",model.score(x_test, y_test))
#輸出
分類準確度: 0.48333333333333334
標準化前的預測準確率只有0.48
3.2.數據標準化後
數據的單位發生了變化, X 數據也被壓縮到差不多大小範圍.
X = preprocessing.scale(X)
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
model = SVC()
model.fit(x_train, y_train)
print("分類準確度:",model.score(x_test, y_test))
#輸出
分類準確度: 0.9166666666666666
標準化後的預測準確率提升至0.92