加載數據(Data Loading)

數據集：pima-indians.data.csv

“皮馬印第安人糖尿病問題”作爲測試數據集。其中包括768個患者的記錄數據，每條記錄的第一列爲記錄序號，後面跟着每條記錄的7個數值型數據特徵，最後第9列是0/1標籤值，表示患者是否是在5年之內感染的糖尿病。

#coding=utf-8
#加載數據
import numpy as np
import urllib

# url with dataset
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

# download the file
raw_data = urllib.urlopen(url)

# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=",")

# separate the data from the target attributes
X = dataset[:,0:7]            #特徵矩陣爲X
y = dataset[:,8]              #標籤爲y

數據歸一化(Data Normalization)

進行歸一化或標準化

歸一化：通過對原始數據進行線性變換把數據映射到[0,1]之間。

原理：不同變量往往量綱不同，歸一化可以消除量綱對最終結果的影響，使不同變量具有可比性。

其中min是樣本中最小值，max是樣本中最大值，在數據流場景下最大值與最小值是變化的。而且最大值與最小值非常容易受異常點影響，所以這種方法魯棒性較差，只適合傳統精確小數據場景。公式：
標準化：去除均值和方差的縮放，將數據按特徵減去其均值後除以其方差。使得對於每個特徵來說所有數據都聚集在0附近，方差爲1。

原理：公式表示的是原始值與均值之間差多少個標準差，是一個相對值，所以也有去除量綱的作用。同時還使得均值爲0，標準差爲1。因爲每個變量的重要程度正比於這個變量在這個數據集上的方差。如果讓每一維變量的標準差都爲1（即方差都爲1），則每維變量在計算距離的時候重要程度相同。

其中μ是樣本的均值，σ是樣本的標準差，可以通過現有樣本進行估計。在已有樣本足夠多的情況下比較穩定，適合嘈雜大數據場景。公式：

#數據歸一化
from sklearn import preprocessing

# normalize the data attributes
normalized_X = preprocessing.normalize(X)

# standardize the data attributes
standardized_X = preprocessing.scale(X) #直接將給定數據進行標準化

#scaler = preprocessing.StandardScaler().fit(X)  另一種方式，用sklearn.preprocessing.StandardScaler類標準化，可以保存訓練集中的參數（均值、方差）直接使用根據訓練集生成的scaler對象轉換測試集數據，使訓練集生成的參數作用於測試集。

使用場景：在涉及到計算點與點之間的距離時，歸一化或標準化都會對最後的結果有所提升。如果所有維度的變量在最後計算距離中發揮相同的作用，則應該選擇標準化；如果想保留原始數據中由標準差所反映的潛在權重關係，應該選擇歸一化。

特徵選擇(Feature Selection)

解決實際問題時，選擇合適的特徵或重新抽象、構建特徵非常重要。已經有許多現成的算法用於特徵選擇。下面的例子用ExtraTreesClassifier計算特徵的信息量：

from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()
model.fit(X,y)

# display the relative importance of each attribute
print(model.feature_importances_)

每個特徵的重要程度用浮點值表示出來，根據運行結果可知第二維特徵的區分能力最強。

分類器屬於Extremely Randomized Trees算法，它包含兩個類，分類用 ExtraTreesClassifier 迴歸用ExtraTreesRegressor。

基本算法

邏輯迴歸

from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))        #計算混淆矩陣以評估分類的準確性

混淆矩陣：

分類常用的評價指標有：混淆矩陣、分類準確率、召回率、f1-score等。sklearn.metrics 模塊覆蓋了其中大部分指標。

混淆矩陣（confusion matrix）是可視化工具，對分類模型進行性能評價的重要工具。特別用於監督學習，在無監督學習一般叫做匹配矩陣。其每一列代表預測值，每一行代表樣本的實際類別，所有正確的預測結果都在對角線上。結構：

樸素貝葉斯

用於還原訓練樣本數據的分佈密度，在多類別分類中有很好的效果。

from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

KNN

常被用作分類算法一部分，可用來評估特徵、特徵選擇。

from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

決策樹

決策樹有兩種類型，分別用於分類和迴歸(Classification and Regression Trees ,CART)。常用於特徵含有類別信息的分類或者回歸問題，適用於多分類。

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier

# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

根據結果可看出決策樹分類效果最好，這是因爲測試集與訓練集相同。

SVM

from sklearn import metrics
from sklearn.svm import SVC

# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)

# make predictions
expected = y
predicted = model.predict(X)

# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

同樣由於支持向量是在測試集上學得的，故也沒有錯誤。

優化算法參數

即常說的調參，用於選擇KNN中的K，SVM中的λ等。

搜索法

import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV

# prepare a range of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])

# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))                            #GridSearchCV實現了fit，predict，predict_proba等方法，並通過交叉驗證對參數空間進行求解，尋找最佳的參數。
grid.fit(X, y)
print(grid)

# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

隨機法
隨機從給定區間中選擇參數，遍歷這些參數評估算法的效果從中選擇最佳的。

import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV

# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha': sp_rand()}

# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)
rsearch.fit(X, y)
print(rsearch)

# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

根據一份簡單的入門材料做的實驗分析，訓練集與測試集重合，有待改進。

參考材料：
http://www.jianshu.com/p/1c6efdbce226
http://www.cnblogs.com/zhaokui/archive/2016/01/08/5112287.html

sklearn基本用法

加載數據(Data Loading)

數據歸一化(Data Normalization)

特徵選擇(Feature Selection)

基本算法

邏輯迴歸

樸素貝葉斯

KNN

決策樹

SVM

優化算法參數

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

Java ThreadPoolShutdown

“她”來了，陪伴賽道鉅變！爲GPT-4o加上你的一個數字分身

京東秒送售後系統退款業務重構心得| 京東零售技術團隊

貝葉斯學習

命令行

matlab在科學計算中的應用2

matlab在科學計算中的應用1

數據迴歸

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結