用WVToolTest實現TFIDF

 

先來貼源碼吧:
package edu.wvtool.test;
 
import java.io.FileWriter;
 
import edu.udo.cs.wvtool.config.WVTConfiguration;
import edu.udo.cs.wvtool.config.WVTConfigurationFact;
import edu.udo.cs.wvtool.generic.output.WordVectorWriter;
import edu.udo.cs.wvtool.generic.stemmer.DummyStemmer;
import edu.udo.cs.wvtool.generic.tokenizer.NGramTokenizer;
import edu.udo.cs.wvtool.generic.tokenizer.SimpleTokenizer;
import edu.udo.cs.wvtool.generic.vectorcreation.TFIDF;
import edu.udo.cs.wvtool.main.WVTDocumentInfo;
import edu.udo.cs.wvtool.main.WVTFileInputList;
import edu.udo.cs.wvtool.main.WVTWordVector;
import edu.udo.cs.wvtool.main.WVTool;
import edu.udo.cs.wvtool.wordlist.WVTWordList;
 
public class WVToolTest4 {
   
    public static void main(String[] args) throws Exception { 
           //初始化一個WVTool對象 
           WVTool wvt = new WVTool(true); 
     
            //初始化一個configuration對象 
           WVTConfiguration config = new WVTConfiguration(); 
 
    //配置config       config.setConfigurationRule(WVTConfiguration.STEP_STEMMER, new WVTConfigurationFact(new DummyStemmer())); 
           
           //定義兩個輸入類別的文件
           WVTFileInputList list = new WVTFileInputList(2); 
          
            // Add entries 
            //爲輸入添加一個文檔信息對象 (WVTDocumentInfo),其中sourceName對象可以是一個文件夾的名稱,也可以是一個文件名稱, 最後一個0這個文檔信息對象的類別  
            //樣本數據 
            list.addEntry(new WVTDocumentInfo("E:/VSMTest/edu.txt", "txt", "utf-8", "chinese", 0)); 
            list.addEntry(new WVTDocumentInfo("E:/VSMTest/gov.txt, "txt", "utf-8", "chinese", 1)); "
            //生成wordList 
           WVTWordList wordList = wvt.createWordList(list, config); 
            //對wordList中詞頻做出一個限制,即詞頻在1<n<5之間 
           wordList.pruneByFrequency(1, 5); 
     
            //生成詞組文件 
           wordList.storePlain(new FileWriter("E:/VSMTest/wordlist.txt")); 
             
           FileWriter outFile = new FileWriter("E:/VSMTest/wv.txt");
           WordVectorWriter wvw = new WordVectorWriter(outFile, true);
 
           config.setConfigurationRule(WVTConfiguration.STEP_OUTPUT, new WVTConfigurationFact(wvw));
 
           config.setConfigurationRule(WVTConfiguration.STEP_VECTOR_CREATION, new WVTConfigurationFact(new TFIDF()));
 
           // Create the vectors
           wvt.createVectors(list, config, wordList);
 
           // Close the output file
           wvw.close();
           outFile.close();
        }
}
 
樣本數據內容:
E:/VSMTest/edu.txt內容:
Education in its broadest, general sense is the means through which the aims and habits of a group of people lives on from one generation to the next.China Education! China Education!
 
E:/VSMTest/gov.txt
This article is about the People's Republic of China.
 
運算過程中是先統計各單詞出現詞頻TF、文檔數N、文檔頻率DF
輸出結果:
wordlist.txt內容(去除了停用詞):
Education
broadest
general
sense
means
aims
habits
group
people
lives
generation
China
article
People
Republic
 
wv.txt內容:
E:/VSMTest/edu.txt; 0:0.6882472016116853 1:0.22941573387056174 2:0.22941573387056174 3:0.22941573387056174 4:0.22941573387056174 5:0.22941573387056174 6:0.22941573387056174 7:0.22941573387056174 8:0.22941573387056174 9:0.22941573387056174 10:0.22941573387056174
E:/VSMTest/gov.txt; 12:0.5773502691896257 13:0.5773502691896257 14:0.5773502691896257
 
其中值爲歸一化後的TFIDF值。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章