mallet之命令行工具

謝謝分享能不能在詳細點

mallet在目錄/bin下面提供的是shell scripts，本文介紹的是在MyEclipse中使用命令行工具運行分類程序。

一、運行類Text2Vectors

在run的Arguments中的Program arguments中寫入--input e:/mallet/20_newsgroups/talk.politics.* --skip-header --output e:/mallet/news2.vectors

--input後面的文件表示輸入的文件地址

--skip-header表示每個文檔在接受兩個空行之後開始分析

--output指輸出文件名及位置

輸出結果：

Labels =
   e:/mallet/20_newsgroups/talk.politics.guns
   e:/mallet/20_newsgroups/talk.politics.mideast
   e:/mallet/20_newsgroups/talk.politics.misc

這三個即匹配e:/mallet/20_newsgroups/talk.politics.*

在e:/mallet/下生成了文件news2.vectors

二、運行類vector2classify

在run的Arguments中的Program arguments 中寫入--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3

其中--trainer 選擇訓練的算法，本例中選擇NaiveBays

--training-portion 0.6指60%的數據作爲訓練數據，剩下40%的作爲測試數據

--num-trials 表示測試三次

輸出結果：

-------------------- Trial 0 --------------------

Trial 0 Training NaiveBayesTrainer with 1800 instances
Trial 0 Training NaiveBayesTrainer finished
Trial 0 Trainer NaiveBayesTrainer training data accuracy= 0.9511111111111111
Trial 0 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted accuracy=0.8958333333333334
      label   0   1   2 |total
0    guns 395   2 18 |415
1 mideast   2 360 33 |395
2    misc 52 18 320 |390

Trial 0 Trainer NaiveBayesTrainer test data accuracy= 0.8958333333333334

-------------------- Trial 1 --------------------

Trial 1 Training NaiveBayesTrainer with 1800 instances
Trial 1 Training NaiveBayesTrainer finished
Trial 1 Trainer NaiveBayesTrainer training data accuracy= 0.9522222222222222
Trial 1 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted accuracy=0.8891666666666667
      label   0   1   2 |total
0    guns 392   3 18 |413
1 mideast   5 350 30 |385
2    misc 58 19 325 |402

Trial 1 Trainer NaiveBayesTrainer test data accuracy= 0.8891666666666667

-------------------- Trial 2 --------------------

Trial 2 Training NaiveBayesTrainer with 1800 instances
Trial 2 Training NaiveBayesTrainer finished
Trial 2 Trainer NaiveBayesTrainer training data accuracy= 0.9533333333333334
Trial 2 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted accuracy=0.895
      label   0   1   2 |total
0    guns 392   . 21 |413
1 mideast 12 383 30 |425
2    misc 44 19 299 |362

Trial 2 Trainer NaiveBayesTrainer test data accuracy= 0.895

NaiveBayesTrainer
Summary. train accuracy mean = 0.9522222222222222 stddev = 9.072184232530348E-4 stderr = 5.237828008789275E-4
Summary. test accuracy mean = 0.8933333333333334 stddev = 0.002965855070008714 stderr = 0.0017123372230469474

參數輸入--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3等價於

--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3 --report train:confusion train:accuracy test:accuracy

其中的report可以輸出confusion, accuracy, f1, 和 raw這些值，需要時可以選擇輸出

討論一、訓練集和測試集選擇

--training-portion 0.6 表示隨機選擇60%做訓練集，剩下的做測試集

默認的--training-portion參數是1.0，指所有的數據都做訓練，沒有做測試的

還有一個參數--validation-portion指做有效性

例如：--training-portion 0.6 --validation-portion 0.1

表示60%訓練，10%有效性，剩下的30%做測試。

儘管有效性設置在Mallet的分類算法中可以使用，但目前所有的算法都不能非常好地應用它

討論二：分開的數據

對於分開的訓練和測試數據，語法爲vectors2classify --training-file train.vectors --testing-file test.vectors
還可以將數據分開，語法爲vectors2vectors --input news2.vectors --training-portion .6
--training-file train.vectors --testing-file test.vectors

討論三：分類算法

mallet默認的分類算法是Naive Bayes, 但是 Maximum Entropy, Decision Tree,和 Winnow等算法都是可用的，選擇算法的語法爲vectors2classify --input news2.vectors --trainer MaxEnt --training-portion 0.7，上面的語法將選擇Maximum Entropy算法分類

還可以選擇多個算法，例如：vectors2classify --input news2.vectors --trainer NaiveBayes --trainer MaxEnt --training-portion 0.7
這樣兩個算法將分別進行訓練及測試工作

還可以用有參數的分類算法，例如：vectors2classify --input news2.vectors --trainer "new MaxEntTrainer(0.01)" --training-portion 0.6，這表示選擇了gaussian prior variance爲0.01的Maximum Entropy算法分類

三、運行類vector2info，顯示各種信息

１、詞信息

通過語法--input e:malletnews2.vectors --print-infogain 10，可以將news2.vector中的前十位的信息增益詞顯示，顯示結果爲：

0 israel
1 israeli
2 arab
3 turkish
4 gun
5 turks
6 jews
7 armenia
8 muslim
9 armenian

2.類標籤信息

通過語法--input e:malletnews2.vectors --print-labels，顯示news2.vectors中的類別信息，運行結果爲：

guns
mideast
misc

3.詞/文檔矩陣

通過語法--input e:malletnews2.vectors --print-matrix siw，輸出news2.vectors中的詞/文檔矩陣信息，運行結果爲：

file:/e:/mallet/20_newsgroups/talk.politics.guns/55057 guns in 5 writes 1 you 2 。。。file:/e:/mallet/20_newsgroups/talk.politics.guns/54866 guns in 1 c 2 got 1 was 1 tear 1 gas 1 the 34 davidians 6 their 3 or 3 so 1 children 1 to 。。。

其中--print-matrix siw中的siw表示稀疏，整數，詞三個屬性，以下是三組參數的介紹

Print entries for all words in the vocabulary, or just print the words that actually occur in the document.
`a`	all
`s`	sparse, (default)
Print word counts as integers or as binary presence/absence indicators.
`b`	binary
`i`	integer, (default)
How to indicate the word itself.
`n`	integer word index
`w`	word string
`c`	combination of integer word index and word string, (default)
`e`	empty, don't print anything to indicate the identity of the word

mallet之命令行工具

工作中用到的腳本合集

微服務實踐Aspire項目發佈到遠程k8s集羣

通過f-string編寫簡潔高效的Python格式化輸出代碼

[轉帖]20個常用的Linux工具命令

[轉帖]PostgreSQL從小白到高手教程 - 第46講：poc-tpch測試

24-5-18 X

java 實現tfidf

nutch elipse 配置的一些事項

java實現 tfidf

多項分佈多項式分佈

位圖的索引的一個應用

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結