mallet在目錄/bin下面提供的是shell scripts,本文介紹的是在MyEclipse中使用命令行工具運行分類程序。
一、運行類Text2Vectors
在run的Arguments中的Program arguments中寫入--input e:/mallet/20_newsgroups/talk.politics.* --skip-header --output e:/mallet/news2.vectors
--input後面的文件表示輸入的文件地址
--skip-header表示每個文檔在接受兩個空行之後開始分析
--output指輸出文件名及位置
輸出結果:
Labels =
e:/mallet/20_newsgroups/talk.politics.guns
e:/mallet/20_newsgroups/talk.politics.mideast
e:/mallet/20_newsgroups/talk.politics.misc
這三個即匹配e:/mallet/20_newsgroups/talk.politics.*
在e:/mallet/下生成了文件news2.vectors
二、運行類vector2classify
在run的Arguments中的Program arguments 中寫入--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3
其中--trainer 選擇訓練的算法,本例中選擇NaiveBays
--training-portion 0.6指60%的數據作爲訓練數據,剩下40%的作爲測試數據
--num-trials 表示測試三次
輸出結果:
-------------------- Trial 0 --------------------
Trial 0 Training NaiveBayesTrainer with 1800 instances
Trial 0 Training NaiveBayesTrainer finished
Trial 0 Trainer NaiveBayesTrainer training data accuracy=
0.9511111111111111
Trial 0 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted
accuracy=0.8958333333333334
label
0
1 2
|total
0 guns
395 2
18 |415
1 mideast 2
360 33 |395
2
misc 52 18
320 |390
Trial 0 Trainer NaiveBayesTrainer test data accuracy= 0.8958333333333334
-------------------- Trial 1 --------------------
Trial 1 Training NaiveBayesTrainer with 1800 instances
Trial 1 Training NaiveBayesTrainer finished
Trial 1 Trainer NaiveBayesTrainer training data accuracy=
0.9522222222222222
Trial 1 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted
accuracy=0.8891666666666667
label
0
1 2
|total
0 guns
392 3
18 |413
1 mideast 5
350 30 |385
2
misc 58 19
325 |402
Trial 1 Trainer NaiveBayesTrainer test data accuracy= 0.8891666666666667
-------------------- Trial 2 --------------------
Trial 2 Training NaiveBayesTrainer with 1800 instances
Trial 2 Training NaiveBayesTrainer finished
Trial 2 Trainer NaiveBayesTrainer training data accuracy=
0.9533333333333334
Trial 2 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted
accuracy=0.895
label
0
1 2
|total
0 guns
392 .
21 |413
1 mideast 12
383 30 |425
2
misc 44 19
299 |362
Trial 2 Trainer NaiveBayesTrainer test data accuracy= 0.895
NaiveBayesTrainer
Summary. train accuracy mean = 0.9522222222222222 stddev =
9.072184232530348E-4 stderr = 5.237828008789275E-4
Summary. test accuracy mean = 0.8933333333333334 stddev =
0.002965855070008714 stderr = 0.0017123372230469474
參數輸入--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3等價於
--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3 --report train:confusion train:accuracy test:accuracy
其中的report可以輸出confusion, accuracy,
f1, 和
raw這些值,需要時可以選擇輸出
討論一、訓練集和測試集選擇
--training-portion 0.6 表示隨機選擇60%做訓練集,剩下的做測試集
默認的--training-portion參數是1.0,指所有的數據都做訓練,沒有做測試的
還有一個參數--validation-portion指做有效性
例如:--training-portion 0.6 --validation-portion 0.1
表示60%訓練,10%有效性,剩下的30%做測試。
儘管有效性設置在Mallet的分類算法中可以使用,但目前所有的算法都不能非常好地應用它
討論二:分開的數據
對於分開的訓練和測試數據,語法爲vectors2classify
--training-file train.vectors --testing-file test.vectors
還可以將數據分開,語法爲vectors2vectors --input
news2.vectors --training-portion .6
--training-file train.vectors --testing-file test.vectors
討論三:分類算法
mallet默認的分類算法是Naive Bayes, 但是 Maximum Entropy, Decision Tree,和 Winnow等算法都是可用的,選擇算法的語法爲vectors2classify --input news2.vectors --trainer MaxEnt --training-portion 0.7,上面的語法將選擇Maximum Entropy算法分類
還可以選擇多個算法,例如:vectors2classify --input
news2.vectors --trainer NaiveBayes --trainer MaxEnt
--training-portion 0.7
這樣兩個算法將分別進行訓練及測試工作
還可以用有參數的分類算法,例如:vectors2classify
--input news2.vectors --trainer "new MaxEntTrainer(0.01)"
--training-portion 0.6,這表示選擇了gaussian prior
variance爲0.01的Maximum Entropy算法分類
三、運行類vector2info,顯示各種信息
1、詞信息
通過語法--input e:malletnews2.vectors --print-infogain 10,可以將news2.vector中的前十位的信息增益詞顯示,顯示結果爲:
0 israel
1 israeli
2 arab
3 turkish
4 gun
5 turks
6 jews
7 armenia
8 muslim
9 armenian
2.類標籤信息
通過語法--input e:malletnews2.vectors --print-labels,顯示news2.vectors中的類別信息,運行結果爲:
guns
mideast
misc
3.詞/文檔矩陣
通過語法--input e:malletnews2.vectors --print-matrix siw,輸出news2.vectors中的詞/文檔矩陣信息,運行結果爲:
file:/e:/mallet/20_newsgroups/talk.politics.guns/55057 guns in 5 writes 1 you 2 。。。file:/e:/mallet/20_newsgroups/talk.politics.guns/54866 guns in 1 c 2 got 1 was 1 tear 1 gas 1 the 34 davidians 6 their 3 or 3 so 1 children 1 to 。。。
其中--print-matrix siw中的siw表示稀疏,整數,詞三個屬性,以下是三組參數的介紹
Print entries for all words in the vocabulary, or
just print the words
that actually occur in the document. |
|
a | all |
s | sparse, (default) |
Print word counts as integers or as binary presence/absence indicators. | |
b | binary |
i | integer, (default) |
How to indicate the word itself. | |
n | integer word index |
w | word string |
c | combination of
integer word index and word string, (default) |
e | empty, don't print
anything to indicate the identity of the word |