sklearn特徵選擇和分類模型

sklearn特徵選擇和分類模型

數據格式:

這裏,原始特徵的輸入文件的格式使用libsvm的格式,即每行是label index1:value1 index2:value2這種稀疏矩陣的格式。

 

sklearn中自帶了很多種特徵選擇的算法。我們選用特徵選擇算法的依據是數據集和訓練模型。

 

下面展示chi2的使用例。chi2,採用卡方校驗的方法進行特徵選擇,比較適合0/1型特徵和稀疏矩陣。

from sklearn.externals.joblib import Memory
from sklearn.datasets import load_svmlight_file
mem = Memory("./mycache")
@mem.cache
def get_data():
    data = load_svmlight_file("labeled_fea.txt")
    return data[0], data[1]
X, y = get_data()
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

data  = SelectKBest(chi2, k=10000).fit_transform(X, y)

from sklearn.datasets import dump_svmlight_file
dump_svmlight_file(data, y, "labeled_chi2_fea.txt",False)

sklearn中分類模型也很多,接口統一,非常方便使用。

分類之前,可以不進行特徵選擇,也可以先獨立進行特徵選擇後再做分類,還可以通過pipeline的方式讓特徵選擇和分類集成在一起。


from sklearn.externals.joblib import Memory
from sklearn.datasets import load_svmlight_file
mem = Memory("./mycache")
@mem.cache
def get_data():
    data = load_svmlight_file("labeled_fea.txt")
    return data[0], data[1]

X, y = get_data()

train_X = X[0:800000]
train_y = y[0:800000]
test_X = X[800000:]
test_y = y[800000:]
print(train_X.shape)
print(test_X.shape)

from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import Perceptron
from sklearn.neighbors import NearestCentroid
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from time import time

#獨立的特徵選擇
ch2 = SelectKBest(chi2, k=10000)
train_X = ch2.fit_transform(train_X, train_y)
test_X = ch2.transform(test_X)

#根據一個分類模型,訓練模型後,進行測試
def benchmark(clf):
    print('_' * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    clf.fit(train_X, train_y)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)
    t0 = time()
    pred = clf.predict(test_X)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)
    score = metrics.accuracy_score(test_y, pred)
    print("accuracy:   %0.3f" % score)
    clf_descr = str(clf).split('(')[0]
    return clf_descr, score, train_time, test_time

clf = RandomForestClassifier(n_estimators=100)
#clf = RidgeClassifier(tol=1e-2, solver="lsqr")
#clf = Perceptron(n_iter=50)
#clf = LinearSVC()
#clf = GradientBoostingClassifier() 

#clf = SGDClassifier(alpha=.0001, n_iter=50,penalty="l1")
#clf = SGDClassifier(alpha=.0001, n_iter=50,penalty="elasticnet")

#clf = NearestCentroid()
#clf = MultinomialNB(alpha=.01)
#clf = BernoulliNB(alpha=.01)

#pipeline模型特徵選擇和分類模型結合在一起
#clf = Pipeline([ ('feature_selection', LinearSVC(penalty="l1", dual=False, tol=1e-3)), ('classification', LinearSVC())])

benchmark(clf)

 

值得注意的是,上面的程序訓練和預測階段都是在同一份程序執行。而實際應用中,訓練和預測是分開的。因此,要使用python的對象序列化特徵。每次訓練完之後,序列化模型對象,保存模型的狀態,預測時反序列化模型對象,還原模型的狀態。

 

 

 

 

參考資料:

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html

http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection

http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py



本文作者:linger

本文鏈接:http://blog.csdn.net/lingerlanlan/article/details/47960127





發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章