Mallet機器語言工具包-入門測試

Mallet主要用於文本分類，所以它設計思路都是偏向文本分類的。

由於需要用到裏面的最大熵以及貝葉斯算法所以得研究一下

主頁：http://mallet.cs.umass.edu/index.php

參考文章：http://mallet.cs.umass.edu/classifier-devel.php

http://mallet.cs.umass.edu/import-devel.php

網上找了下，材料不多，只能自己苦逼地去看官方提供的一些guide還有API，然後就研究源代碼了

我的目的是，把MALLET導入到自己的java項目中（用的是eclipse),然後靈活地用裏面一些算法，bayes，和最大熵算法進行文本分類。

導入到工程部分：

下載鏈接:http://mallet.cs.umass.edu/download.php 我這個時候的最新版本是2.0.7

這是壓縮包裏面的內容，把src文件夾以及lib裏面的jar包都拷貝到工程項目裏面，把jar包都加載到工程上

最終我的工程目錄是這樣的，src放我自己的一些類

malletSrc放mallet的源碼

mallet文件夾裏面放的都是對應的jar包

下面是我的研究筆記:

具體各個類的用法只能通過API和源碼以及自己的測試去分析了。

下面提供一些測試例子

爲了生成一個Instance得搞定下圖這幾個東西啊..REF:http://mallet.cs.umass.edu/import-devel.php

好像子類還好多，我只研究到我夠用的幾個東西就O了。

源代碼裏面的註釋:

An instance contains four generic fields of predefined name:
     "data", "target", "name", and "source".   "Data" holds the data represented
    `by the instance, "target" is often a label associated with the instance,
     "name" is a short identifying name for the instance (such as a filename),
     and "source" is human-readable sourceinformation, (such as the original text).

關於Data:

需要Alphabet以及FeatureVetor，配合使用，Alphabet用來保存各個屬性的名字，FeatureVector用來保存一個對象在各個屬性下的值

測試代碼1:

public static void main(String[] args) {
String[] attributeStr = new String[]{"長","寬","高"};
Alphabet dict = new Alphabet(attributeStr);
double[] values = new double[]{1,2,3};
FeatureVector vetor = new FeatureVector(dict, values);
System.out.println(vetor.toString());
}

輸出:

長(0)=1.0
寬(1)=2.0
高(2)=3.0

我們可以指定values對應與哪個屬性值，從0開始，比如長對應0，寬對應1，高對應2，測試如下

public static void main(String[] args) {
String[] attributeStr = new String[]{"長","寬","高"};
Alphabet dict = new Alphabet(attributeStr);
double[] values = new double[]{1,2,3};
int[] indices = new int[]{2,0,1};
FeatureVector vetor = new FeatureVector(dict, indices,values);
System.out.println(vetor.toString());
}

輸出:

長(0)=2.0
寬(1)=3.0
高(2)=1.0

一個比較地方需要注意的是如果指明的values的對應索引有重複，比如，2和3都指明它屬於長，那麼得到的值是累計的而不是覆蓋的，值爲5，這個就單詞統計的效果吧

String[] attributeStr = new String[]{"長","寬","高"};
Alphabet dict = new Alphabet(attributeStr);
double[] values = new double[]{1,2,3};
int[] indices = new int[]{2,0,0};
FeatureVector vetor = new FeatureVector(dict, indices,values);
System.out.println(vetor.toString());

輸出:

長(0)=5.0
高(2)=1.0

好吧先把Data搞定了。FeatureVector就是我需要的data

Source：我就讓它爲NULL了

Label:

/** You should never call this directly. New Label objects are
created on-demand by calling LabelAlphabet.lookupIndex(obj). */

上面是源代碼的一句話，Label需要通過LabelAlphabet來創建，所以再研究下LabelAlphabet，然後做以下測試

public static void main(String[] args) {
LabelAlphabet labels = new LabelAlphabet();
Label label = labels.lookupLabel("桌子");
System.out.println(label.toString());
}

輸出爲:桌子，這樣一來Label也搞定了

Name:作爲一個instance的id號，那麼就簡單的用整型作爲它的序號好了。

好了，這四個東西都搞定了，就可以創建Instance了,然後把Instance都加入到InstanceList裏面去之後就可以參考http://mallet.cs.umass.edu/classifier-devel.php

進行分類了，分類測試代碼如下:

import cc.mallet.*;
import cc.mallet.classify.Classifier;
import cc.mallet.classify.ClassifierTrainer;
import cc.mallet.classify.MaxEntTrainer;
import cc.mallet.types.Alphabet;
import cc.mallet.types.FeatureVector;
import cc.mallet.types.Instance;
import cc.mallet.types.InstanceList;
import cc.mallet.types.Label;
import cc.mallet.types.LabelAlphabet;
import cc.mallet.types.Labeling;
public class test {
String label;//實例的類別
double length;//長度
double width;//寬度
double high;
public test(String label,double length,double width,double high){
this.label = label;
this.length = length;
this.width = width;
this.high = high;
}
public static void main(String[] args) {
LabelAlphabet labels = new LabelAlphabet();
String[] attributeName = new String[]{"長","寬","高"};
Alphabet dic = new Alphabet(attributeName);
labels.lookupIndex("桌子");
labels.lookupIndex("椅子");
InstanceList list = new InstanceList(dic,labels);
int id = 0;
for(int i = 0; i < 100; ++i){
test temp = new test("桌子",4,2,3);
test temp2 = new test("椅子",0,0,0);
double[] tempArray = new double[3];
tempArray[0] = temp.length;
tempArray[1] = temp.width;
tempArray[2] = temp.high;
FeatureVector vec = new FeatureVector(dic, tempArray);
Instance ins = new Instance(vec, labels.lookupLabel(temp.label), ++id, null);
list.add(ins);
tempArray[0] = temp2.length;
tempArray[1] = temp2.width;
tempArray[2] = temp2.high;
vec = new FeatureVector(dic, tempArray);
ins = new Instance(vec, labels.lookupLabel(temp2.label), ++id, null);
list.add(ins);
}
//創造一個測試樣本
test testTemp = new test("未知",0,0,2);
double[] tempArray = new double[3];
tempArray[0] = testTemp.length;
tempArray[1] = testTemp.width;
tempArray[2] = testTemp.high;
FeatureVector vec = new FeatureVector(dic, tempArray);
Instance testIns = new Instance(vec,null, ++id, null);
//進行最大熵分類
ClassifierTrainer trainer = new MaxEntTrainer();
Classifier classifier = trainer.train(list);
Labeling label = classifier.classify(testIns).getLabeling();
System.out.println(label.getBestLabel().toString());
}
}

輸出結果：

得到的分類結果爲椅子左下角

關於那個異常，它備註了，(This is not necessarily cause for alarm. Sometimes this happens close to the maximum, where the function may be very flat.)

好吧，先研究這樣吧，基本夠我用了，筆記就先這樣記着

Mallet機器語言工具包-入門測試

[S2SH]_2_學習順序

4-通過java調用libsvm

5-swing下的界面皮膚更換

我的友情鏈接

Mallet機器語言工具包-入門測試

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結