TF-IDF 詞頻-逆文檔頻率 JAVA實現源碼分析

原創

2020-02-23 01:58

TF-IDF（term frequency–inverse document frequency）是一種用於信息檢索與數據挖掘的常用加權技術。TF意思是詞頻(Term Frequency)，IDF意思是逆文本頻率指數(Inverse Document Frequency)。

詞頻（TF）表示詞條（關鍵字）在文本中出現的頻率。

這個數字通常會被歸一化(一般是詞頻除以文章總詞數), 以防止它偏向長的文件。

公式： 即：

其中 ni,j 是該詞在文件 dj 中出現的次數，分母則是文件 dj 中所有詞彙出現的次數總和；

逆向文件頻率 (IDF) ：

某一特定詞語的IDF，可以由總文件數目除以包含該詞語的文件的數目，再將得到的商取對數得到。

如果包含詞條t的文檔越少, IDF越大，則說明詞條具有很好的類別區分能力。

公式：

其中，|D| 是語料庫中的文件總數。 |{j:ti∈dj}| 表示包含詞語 ti 的文件數目（即 ni,j≠0 的文件數目）。如果該詞語不在語料庫中，就會導致分母爲零，因此一般情況下使用 1+|{j:ti∈dj}|

即：

TF-IDF（term frequency–inverse document frequency）

某一特定文件內的高詞語頻率，以及該詞語在整個文件集合中的低文件頻率，可以產生出高權重的TF-IDF。因此，TF-IDF傾向於過濾掉常見的詞語，保留重要的詞語。

公式：

java代碼實現

完整代碼： https://github.com/huaban/jieba-analysis

首先需要準備停用詞庫以及idf詞庫

stop_words.txt

idf_dict.txt

/**
	 * tfidf分析方法
	 * @param content 需要分析的文本/文檔內容
	 * @param topN 需要返回的tfidf值最高的N個關鍵詞，若超過content本身含有的詞語上限數目，則默認返回全部
	 * @return
	 */
	public List<Keyword> analyze(String content,int topN){
		List<Keyword> keywordList=new ArrayList<>();
		//讀取停用詞以及已知的idf庫
		if(stopWordsSet==null) {
			stopWordsSet=new HashSet<>();
			loadStopWords(stopWordsSet, this.getClass().getResourceAsStream("/stop_words.txt"));
		}
		if(idfMap==null) {
			idfMap=new HashMap<>();
			loadIDFMap(idfMap, this.getClass().getResourceAsStream("/idf_dict.txt"));
		}
		//對於給定內容，先進行分詞，然後計算每個詞的tf值，即詞頻
        //使用了jieba分詞的API
		Map<String, Double> tfMap=getTF(content);
		for(String word:tfMap.keySet()) {
			
			if(idfMap.containsKey(word)) {
                //如果在idf庫中找到該詞，則直接相乘得出該詞的tf-idf值
				keywordList.add(new Keyword(word,idfMap.get(word)*tfMap.get(word)));
			}else{
             // 若該詞不在idf文檔中，則使用平均的idf值(可能定期需要對新出現的網絡詞語進行納入)
				keywordList.add(new Keyword(word,idfMedian*tfMap.get(word)));
            }
		}
		
		Collections.sort(keywordList);
		
		if(keywordList.size()>topN) {
			int num=keywordList.size()-topN;
			for(int i=0;i<num;i++) {
				keywordList.remove(topN);
			}
		}
		return keywordList;
	}

/**
	 * tf值計算公式
	 * tf=N(i,j)/(sum(N(k,j) for all k))
	 * N(i,j)表示詞語Ni在該文檔d（content）中出現的頻率，sum(N(k,j))代表所有詞語在文檔d中出現的頻率之和
	 * @param content
	 * @return
	 */
private Map<String, Double> getTF(String content) {
		Map<String,Double> tfMap=new HashMap<>();
		if(content==null || content.equals(""))
			return tfMap; 
		
		JiebaSegmenter segmenter = new JiebaSegmenter();
		List<String> segments=segmenter.sentenceProcess(content);
		Map<String,Integer> freqMap=new HashMap<>();
		
		int wordSum=0;
		for(String segment:segments) {
			//停用詞不予考慮，單字詞不予考慮
			if(!stopWordsSet.contains(segment) && segment.length()>1) {
				wordSum++;
				if(freqMap.containsKey(segment)) {
					freqMap.put(segment,freqMap.get(segment)+1);
				}else {
					freqMap.put(segment, 1);
				}
			}
		}
		
		// 計算double型的tf值
		for(String word:freqMap.keySet()) {
			tfMap.put(word,freqMap.get(word)*0.1/wordSum);
		}
		
		return tfMap; 
	}

/**
	 * 默認jieba分詞的停詞表
	 * url:https://github.com/yanyiwu/nodejieba/blob/master/dict/stop_words.utf8
	 * @param set
	 * @param filePath
	 */
	private void loadStopWords(Set<String> set, InputStream in){
		BufferedReader bufr;
		try
		{
			bufr = new BufferedReader(new InputStreamReader(in));
			String line=null;
			while((line=bufr.readLine())!=null) {
				set.add(line.trim());
			}
			try
			{
				bufr.close();
			}
			catch (IOException e)
			{
				e.printStackTrace();
			}
		}
		catch (Exception e)
		{
			e.printStackTrace();
		}
	}

/**
	 * idf值本來需要語料庫來自己按照公式進行計算，不過jieba分詞已經提供了一份很好的idf字典，所以默認直接使用jieba分詞的idf字典
	 * url:https://raw.githubusercontent.com/yanyiwu/nodejieba/master/dict/idf.utf8
	 * @param set
	 * @param filePath
	 */
	private void loadIDFMap(Map<String,Double> map, InputStream in ){
		BufferedReader bufr;
		try
		{
			bufr = new BufferedReader(new InputStreamReader(in));
			String line=null;
			while((line=bufr.readLine())!=null) {
				String[] kv=line.trim().split(" ");
				map.put(kv[0],Double.parseDouble(kv[1]));
			}
			try
			{
				bufr.close();
			}
			catch (IOException e)
			{
				e.printStackTrace();
			}
			
			// 計算idf值的中位數
			List<Double> idfList=new ArrayList<>(map.values());
			Collections.sort(idfList);
			idfMedian=idfList.get(idfList.size()/2);
		}
		catch (Exception e)
		{
			e.printStackTrace();
		}
	}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

TF-IDF 詞頻-逆文檔頻率 JAVA實現源碼分析

java代碼實現

Android啓動過程-萬字長文(Android14)

【SQL進階】CASE語句的使用

optional install error: Error: Unsupported URL Type: npm:vue-loader@^16.1.0

這種嵌套字典類型的數據，我想把它讀取到df裏，如何操作？

微調真的能讓LLM學到新東西嗎:引入新知識可能讓模型產生更多的幻覺

iNeuOS工業互聯網操作系統，增加電力IEC104協議

微服務實踐k8s&dapr開發部署實驗（3）訂閱發佈

chromedriver版本

kbgressdb之數據結構V0.2

微信公衆號、小程序 code換取openid接口報 48001錯誤

微服務實戰（十八）通過AOP的方式自動完成微服務token驗證

用swagger2 ，維護API文檔很麻煩嗎？返回值無法說明？ --- 還是沒用對方法？

Mybatis Plus 自定義SQL+多表查詢結果+分頁

微服務實戰（十六）微服務到底該如何拆分、設計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結