Python
Python機器學習的庫:scikit-learn
2.1: 特性:
簡單高效的數據挖掘和機器學習分析
對所有用戶開放,根據不同需求高度可重用性
基於Numpy, SciPy和matplotlib
開源,商用級別:獲得 BSD許可2.2 覆蓋問題領域:
分類(classification), 迴歸(regression), 聚類(clustering), 降維(dimensionality reduction)
模型選擇(model selection), 預處理(preprocessing)使用用scikit-learn
安裝scikit-learn: pip, easy_install, windows installer
安裝必要package:numpy, SciPy和matplotlib, 可使用Anaconda (包含numpy, scipy等科學計算常用
package)
安裝注意問題:Python解釋器版本(2.7 or 3.4?), 32-bit or 64-bit系統例子:
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree, preprocessing
from sklearn.externals.six import StringIO
import numpy as np
allElectronicsData=open(r'C://AllElectronics.csv')
reader=csv.reader(allElectronicsData)
headers=reader.next()
print(headers)
featrueList=[]
labelList=[]
for row in reader:
labelList.append(row[len(row)-1])
rowDict={}
for i in range(1,len(row)-1):
rowDict[headers[i]]=row[i]
featrueList.append(rowDict)
print(featrueList)
vec=DictVectorizer()
dummyX=vec.fit_transform(featrueList).toarray()
print("dummyX:"+str(dummyX))
print(vec.get_feature_names())
print("labelList:"+str(labelList))
lb=preprocessing.LabelBinarizer()
dummyY=lb.fit_transform(labelList)
print("dummyY:"+str(dummyY))
clf = tree.DecisionTreeClassifier(criterion="entropy")
clf=clf.fit(dummyX,dummyY)
print("clf:"+str(clf))
with open("allElectronicInformationGainOri.dot",'w') as f:
f=tree.export_graphviz(clf,out_file=f,feature_names=vec.get_feature_names())
oneRowX=dummyX[0,:]
print("oneRowx:"+str(oneRowX))
newRowX=oneRowX
newRowX[0]=1
newRowX[1]=0
print("newRowX:"+str(newRowX))
predictedY = clf.predict(newRowX)
print("predictedY:"+str(predictedY))