用WVToolTest實現TFIDF

原創

enlai1988

2019-02-22 21:47

先來貼源碼吧：

package edu.wvtool.test;

import java.io.FileWriter;

import edu.udo.cs.wvtool.config.WVTConfiguration;

import edu.udo.cs.wvtool.config.WVTConfigurationFact;

import edu.udo.cs.wvtool.generic.output.WordVectorWriter;

import edu.udo.cs.wvtool.generic.stemmer.DummyStemmer;

import edu.udo.cs.wvtool.generic.tokenizer.NGramTokenizer;

import edu.udo.cs.wvtool.generic.tokenizer.SimpleTokenizer;

import edu.udo.cs.wvtool.generic.vectorcreation.TFIDF;

import edu.udo.cs.wvtool.main.WVTDocumentInfo;

import edu.udo.cs.wvtool.main.WVTFileInputList;

import edu.udo.cs.wvtool.main.WVTWordVector;

import edu.udo.cs.wvtool.main.WVTool;

import edu.udo.cs.wvtool.wordlist.WVTWordList;

public class WVToolTest4 {

public static void main(String[] args) throws Exception {

//初始化一個WVTool對象

WVTool wvt = new WVTool(true);

//初始化一個configuration對象

WVTConfiguration config = new WVTConfiguration();

//配置config config.setConfigurationRule(WVTConfiguration.STEP_STEMMER, new WVTConfigurationFact(new DummyStemmer()));

//定義兩個輸入類別的文件

WVTFileInputList list = new WVTFileInputList(2);

// Add entries

//爲輸入添加一個文檔信息對象（WVTDocumentInfo）,其中sourceName對象可以是一個文件夾的名稱，也可以是一個文件名稱, 最後一個0這個文檔信息對象的類別

//樣本數據

list.addEntry(new WVTDocumentInfo("E:/VSMTest/edu.txt", "txt", "utf-8", "chinese", 0));

list.addEntry(new WVTDocumentInfo("E:/VSMTest/gov.txt, "txt", "utf-8", "chinese", 1)); "

//生成wordList

WVTWordList wordList = wvt.createWordList(list, config);

//對wordList中詞頻做出一個限制，即詞頻在1<n<5之間

wordList.pruneByFrequency(1, 5);

//生成詞組文件

wordList.storePlain(new FileWriter("E:/VSMTest/wordlist.txt"));

FileWriter outFile = new FileWriter("E:/VSMTest/wv.txt");

WordVectorWriter wvw = new WordVectorWriter(outFile, true);

config.setConfigurationRule(WVTConfiguration.STEP_OUTPUT, new WVTConfigurationFact(wvw));

config.setConfigurationRule(WVTConfiguration.STEP_VECTOR_CREATION, new WVTConfigurationFact(new TFIDF()));

// Create the vectors

wvt.createVectors(list, config, wordList);

// Close the output file

wvw.close();

outFile.close();

}

樣本數據內容：

E:/VSMTest/edu.txt內容：

Education in its broadest, general sense is the means through which the aims and habits of a group of people lives on from one generation to the next.China Education! China Education!

E:/VSMTest/gov.txt

This article is about the People's Republic of China.

運算過程中是先統計各單詞出現詞頻TF、文檔數N、文檔頻率DF

輸出結果：

wordlist.txt內容（去除了停用詞）：

Education

broadest

general

sense

means

aims

habits

group

people

lives

generation

China

article

People

Republic

wv.txt內容：

E:/VSMTest/edu.txt; 0:0.6882472016116853 1:0.22941573387056174 2:0.22941573387056174 3:0.22941573387056174 4:0.22941573387056174 5:0.22941573387056174 6:0.22941573387056174 7:0.22941573387056174 8:0.22941573387056174 9:0.22941573387056174 10:0.22941573387056174

E:/VSMTest/gov.txt; 12:0.5773502691896257 13:0.5773502691896257 14:0.5773502691896257

其中值爲歸一化後的TFIDF值。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

用WVToolTest實現TFIDF

用WVToolTest實現TFIDF

我的友情鏈接

Struts2錯誤1：Exception starting filter struts2

Ajax初體驗

我的友情鏈接

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結