mahout中LDA簡介以及示例

原創

犀利-sharp

2020-02-21 02:50

翻譯自： https://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation

簡介：

Latent Dirichlet Allocation (Blei et al, 2003)是一個強大的學習方法將words聚到一些topics裏面，以及把一些document表示成topics的一些集合。

主題模型就是document在topics上的概率分佈，和words在topics上的分佈的一個層次貝葉斯模型，舉個例子，一個topic是包括“體育”，“籃球”，"全壘打"等詞，一個document講述一些在籃球比賽中使用違禁藥，可能包含"體育"，“籃球”，“違禁藥”，這些詞，是事先被人類定義的標籤，算法只不過給這些詞跟概率關聯上。模型中參數估計的目的是把這些topic學習出來，一個document跟這些topic的概率是多少。

另一個理解主題模型的視角是把他看作類似於 Dirichlet Process Clustering 的混合模型，從一個正常的混合模型開始,我們有一個全局混合的幾個分佈，我們可以說每一個document都有他全局分佈之上自己的一個分佈，在dirichlet process clustering中，每一個document在全局混合分佈上有他自己的隱變量決定他屬於哪個模型，在LDA中每一個詞又有在document上的一個分佈。

我們按照一定概率混合一些模型來解釋已觀測到的數據，每一個被觀測到的數據假設是來自於許多模型中的一個，但是我們並不知道來自於哪一個，所以我們用一個稱之爲隱含變量的名字來指他從哪裏來。

Collapsed Variational Bayes

CVB算法在LDA mahout的實現中結合了variational bayes 和 gibbs sampling .

使用方法：

mahout中LDA的實現需要工作在一個稀疏的詞頻的向量上，詞頻一定要是一個非負數的，在概率模型中，負數沒有意義，確保用的是TF而不是IDF作爲詞頻。

調用方法如下：

bin/mahout cvb \
    -i <input path for document vectors> \
    -dict <path to term-dictionary file(s) , glob expression supported> \
    -o <output path for topic-term distributions>
    -dt <output path for doc-topic distributions> \
    -k <number of latent topics> \
    -nt <number of unique features defined by input document vectors> \
    -mt <path to store model state after each iteration> \
    -maxIter <max number of iterations> \
    -mipd <max number of iterations per doc for learning> \
    -a <smoothing for doc topic distributions> \
    -e <smoothing for term topic distributions> \
    -seed <random seed> \
    -tf <fraction of data to hold for testing> \
    -block <number of iterations per perplexity check, ignored unless test_set_percentage>0> \

選擇topic的數量的時候，建議多試幾次。

在運行LDA之後，可以使用工具打印出來結果：

bin/mahout ldatopics \
    -i <input vectors directory> \
    -d <input dictionary file> \
    -w <optional number of words to print> \
    -o <optional output working directory. Default is to console> \
    -h <print out help> \
    -dt <optional dictionary type (text|sequencefile). Default is text>

示例：

在mahout/examples/bin/build-reuters.sh 有詳細的示例腳本，腳本自動下載數據集，建立lucence索引，把lucence索引再變成向量的形式,註釋掉最後兩行，讓他運行你的LDA,打印出來結果。

把樣例改成你所需要的形式，需要自己建立lucence索引，需要一個adapter，剩下的東西都差不多。

參數估計：

使用EM算法。

站內首發文章

犀利-sharp

發佈了91 篇原創文章 · 獲贊 3 · 訪問量 6萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

mahout中LDA簡介以及示例

solr searching 過程解析

Thread 狀態詳解

理解solr中的 Analyzer,Tokenizer,Filter

awk and hadoop之mapper

awk join操作

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結