sklearn-基礎使用

數據清洗可以用pandas，數據predict的時候就要用到大名鼎鼎的sklearn了，裏面包含了很多基礎的算法，可以幫助Data Scientist 解決很多問題。

（a）data normalization

from sklearn import preprocessing

# normalize the data attributes

normalized_X = preprocessing.normalize(X) #range from 0 to 1

# standardize the data attributes

standardized_X = preprocessing.scale(X) #均值爲0 方差爲1

（b）feature extraction

from sklearn import metrics

from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier() #基於決策樹接近樹根

model.fit(X, y)

# display the relative importance of each attribute

print(model.feature_importances_)

rom sklearn.feature_selection import RFE

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

# create the RFE model and select 3 attributes

rfe = RFE(model, 3) #暴力破解呀，選出所有大小爲3的子集算出誤差最小的

rfe = rfe.fit(X, y)

# summarize the selection of the attributes

print(rfe.support_)

print(rfe.ranking_)

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

feature_selection模塊

Univariate feature selection：單變量的特徵選擇

單變量特徵選擇的原理是分別單獨的計算每個變量的某個統計指標，根據該指標來判斷哪些指標重要。剔除那些不重要的指標。

sklearn.feature_selection模塊中主要有以下幾個方法：

SelectKBest和SelectPercentile比較相似，前者選擇排名排在前n個的變量，後者選擇排名排在前n%的變量。而他們通過什麼指標來給變量排名呢？這需要二外的指定。

對於regression問題，可以使用f_regression指標。對於classification問題，可以使用chi2或者f_classif變量。

使用的例子：

from sklearn.feature_selection import SelectPercentile, f_classif

selector = SelectPercentile(f_classif, percentile=10)

還有其他的幾個方法，似乎是使用其他的統計指標來選擇變量：using common univariate statistical tests for each feature: false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.

文檔中說，如果是使用稀疏矩陣，只有chi2指標可用，其他的都必須轉變成dense matrix。但是我實際使用中發現f_classif也是可以使用稀疏矩陣的。

Recursive feature elimination：循環特徵選擇

不單獨的檢驗某個變量的價值，而是將其聚集在一起檢驗。它的基本思想是，對於一個數量爲d的feature的集合，他的所有的子集的個數是2的d次方減1（包含空集）。指定一個外部的學習算法，比如SVM之類的。通過該算法計算所有子集的validation error。選擇error最小的那個子集作爲所挑選的特徵。

由以下兩個方法實現：sklearn.feature_selection.RFE，sklearn.feature_selection.RFECV

L1-based feature selection：

該思路的原理是：在linear regression模型中，有的時候會得到sparse solution。意思是說很多變量前面的係數都等於0或者接近於0。這說明這些變量不重要，那麼可以將這些變量去除。

Tree-based feature selection：決策樹特徵選擇

基於決策樹算法做出特徵選擇

（c）Algorithm Development

As I have said, Scikit-Learn has implemented all the basic algorithms of machine learning. Let’s take a look at some of them.

Logistic Regression

Most often used for solving tasks of classification (binary), but multiclass classification (the so-called one-vs-all method) is also allowed. The advantage of this algorithm is that there’s the probability of belonging to a class for each object at the output.

from sklearn import metrics

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

Naive Bayes

Is also one of the most well-known machine learning algorithms, the main task of which is to restore the density of data distribution of the training sample. This method often provides good quality in multiclass classification problems.

from sklearn import metrics

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

k-Nearest Neighbours

The kNN (k-Nearest Neighbors) method is often used as part of a more complex classification algorithm. For instance, we can use its estimate as an object’s feature. Sometimes, a simple kNN provides great quality on well-chosen features. When parameters (metrics mostly) are set well, the algorithm often gives good quality in regression problems.

from sklearn import metrics

from sklearn.neighbors import KNeighborsClassifier

# fit a k-nearest neighbor model to the data

model = KNeighborsClassifier()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

Decision Trees

Classification and Regression Trees (CART) are often used in problems, in which objects have category features and used for regression and classification problems. The trees are very well suited for multiclass classification.

from sklearn import metrics

from sklearn.tree import DecisionTreeClassifier

# fit a CART model to the data

model = DecisionTreeClassifier()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

Support Vector Machines

SVM (Support Vector Machines) is one of the most popular machine learning algorithms used mainly for the classification problem. As well as logistic regression, SVM allows multi-class classification with the help of the one-vs-all method.

from sklearn import metrics

from sklearn.svm import SVC

# fit a SVM model to the data

model = SVC()

model.fit(X, y)

print(model)

# make predictions

expected = y

predicted = model.predict(X)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

In addition to classification and regression algorithms, Scikit-Learn has a huge number of more complex algorithms, including clustering, and also implemented techniques to create compositions of algorithms, including Bagging and Boosting.

How to Optimize Algorithm Parameters

One of the most difficult stages in creating really efficient algorithms is choosing correct parameters. It’s usually easier with experience, but one way or another, we have to do the search. Fortunately, Scikit-Learn provides many implemented functions for this purpose.

As an example, let’s take a look at the selection of the regularization parameter, in which several values are searched in turn:

import numpy as np

from sklearn.linear_model import Ridge

from sklearn.grid_search import GridSearchCV

# prepare a range of alpha values to test

alphas = np.array([1,0.1,0.01,0.001,0.0001,0])

# create and fit a ridge regression model, testing each alpha

model = Ridge()

grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))

grid.fit(X, y)

print(grid)

# summarize the results of the grid search

print(grid.best_score_)

print(grid.best_estimator_.alpha)

Sometimes it is more efficient to randomly select a parameter from the given range, estimate the algorithm quality for this parameter and choose the best one.

import numpy as np

from scipy.stats import uniform as sp_rand

from sklearn.linear_model import Ridge

from sklearn.grid_search import RandomizedSearchCV

# prepare a uniform distribution to sample for the alpha parameter

param_grid = {'alpha': sp_rand()}

# create and fit a ridge regression model, testing random alpha values

model = Ridge()

rsearch = RandomizedSearchCV(estimator=model, param_distributions=param_grid, n_iter=100)

rsearch.fit(X, y)

print(rsearch)

# summarize the results of the random parameter search

print(rsearch.best_score_)

print(rsearch.best_estimator_.alpha)

（d） crossvalidation

sklearn中的cross validation模塊，最主要的函數是如下函數：

sklearn.cross_validation.cross_val_score。他的調用形式是scores = cross_validation.cross_val_score(clf, raw data, raw target, cv=5, score_func=None)

參數解釋：

clf是不同的分類器，可以是任何的分類器。比如支持向量機分類器。clf = svm.SVC(kernel='linear', C=1)

cv參數就是代表不同的cross validation的方法了。如果cv是一個int數字的話，並且如果提供了raw target參數，那麼就代表使用StratifiedKFold分類方式，如果沒有提供raw target參數，那麼就代表使用KFold分類方式。

cross_val_score函數的返回值就是對於每次不同的的劃分raw data時，在test data上得到的分類的準確率。至於準確率的算法可以通過score_func參數指定，如果不指定的話，是用clf默認自帶的準確率算法。

還有其他的一些參數不是很重要。

cross_val_score具體使用例子見下：

>>> clf = svm.SVC(kernel='linear', C=1)

>>> scores = cross_validation.cross_val_score(

... clf, raw data, raw target, cv=5)

...

>>> scores

array([ 1. ..., 0.96..., 0.9 ..., 0.96..., 1. ])

除了剛剛提到的KFold以及StratifiedKFold這兩種對raw data進行劃分的方法之外，還有其他很多種劃分方法。但是其他的劃分方法調用起來和前兩個稍有不同（但是都是一樣的），下面以ShuffleSplit方法爲例說明：

>>> n_samples = raw_data.shape[0]

>>> cv = cross_validation.ShuffleSplit(n_samples, n_iter=3,

... test_size=0.3, random_state=0)

>>> cross_validation.cross_val_score(clf, raw data, raw target, cv=cv)

...

array([ 0.97..., 0.97..., 1. ])

還有的其他劃分方法如下：

cross_validation.Bootstrap

cross_validation.LeaveOneLabelOut

cross_validation.LeaveOneOut

cross_validation.LeavePLabelOut

cross_validation.LeavePOut

cross_validation.StratifiedShuffleSplit

他們的調用方法和ShuffleSplit是一樣的，但是各自有各自的參數。至於這些方法具體的意義，見machine learning教材。

sklearn-基礎使用

劍指offer-鏈表從後往前打印

sklearn-基礎使用

深入淺出的講解c++多態性

feature scaling的作用

logistic Regression

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結