JgibbLDA輸出結果說明與示例

2015-12-13 21:14:00

JgibbLDA輸出以下幾個文件：

.others文件存儲LDA模型參數，如alpha、beta等。

.phi文件存儲topic-word分佈，每一個元素是p(word|topic),每一行是一個主題，列內容為詞語(應該是設定的top多少的詞)。

.theta文件存儲document-topic分佈，每一個元素是p(topic|document),每一行是一個文檔，列內容是主題機率。

.tassign文件是訓練預料中單詞的主題指定（歸屬），每一行是一個語料文檔。

.twords文件是存放每個topic下面選出的top words以及對應的權重

wordmap.txt是整個corpus中出現的distinctive的所有詞，詞的id是按照出現的順序來編的，但是在wordmap.txt裏詞是按照字母順序來排的。

下面舉例說明結果：

test_input.txt中有4篇文檔，前兩個文檔是關於sport的（足球），後兩個文檔是關於travel。test_input.txt內容如下：

4 sport Spanish football association competition club tickets scored win winners keeper shots best goal campaign season's Champions League France team France Football Federation president national team training session Champions record European competition without recording a single victory quit my job to travel passport world travel is a luxury for the privileged the rich or the retired travel stories Have a long-term plan visa-free destinations Central Station City of London dry gin drinking building older foundations River Fleet flavour gin and tonic be served with cubed ice fruit floral spicy earthy savoury citrusLDA 模型參數如下：alpha 0.5 beta 0.1 topicNum 2 niters 1000 savestep 1000 twords 10

設置的是2個topic，每個topic下面有10個詞。先看wordmap.txt中的內容，由於test_input.txt中不重複的詞有81個，所以裡面第一行是總詞數，第一個詞從編碼0開始，具體如下：81 competition 4 Central 54 ice 74 earthy 78 without 28 building 62 passport 38 Federation 21 record 26 club 5 Spanish 1 plan 51 floral 76 League 17 goal 13 drinking 61 Fleet 66 keeper 10 destinations 53 foundations 64 is 40 Have 49 dry 59 City 56 spicy 77 European 27 my 34 privileged 44 Station 55 savoury 79 served 71 London 58 campaign 14 tonic 69 shots 11 job 35 tickets 6 be 70 season's 15 session 25 fruit 75 for 42 association 3 recording 29 best 12 training 24 gin 60 world 39 and 68 of 57 national 23 River 65 retired 47 older 63 France 18 win 8 winners 9 a 30 or 46 stories 48 flavour 67 cubed 73 victory 32 rich 45 football 2 team 19 Football 20 citrus 80 single 31 the 43 Champions 16 with 72 scored 7 luxury 41 quit 33 to 36 visa-free 52 travel 37 sport 0 president 22 long-term 50.tassign文件每行對應一個document，其中的元素是 word_id : topic_id，意思是第word_id個詞是屬於第topic_id的，具體如下：0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0 8:0 9:0 10:0 11:0 12:0 13:0 14:0 15:0 16:0 17:1 18:0 19:0 18:0 20:0 21:0 22:0 23:0 19:0 24:0 25:0 16:0 26:0 27:0 4:0 28:0 29:0 30:0 31:0 32:0 33:1 34:1 35:1 36:1 37:1 38:0 39:1 37:1 40:0 30:1 41:1 42:1 43:1 44:1 43:1 45:1 46:1 43:1 47:1 37:1 48:1 49:1 30:1 50:1 51:1 52:1 53:1 54:1 55:0 56:1 57:1 58:0 59:1 60:1 61:1 62:1 63:1 64:1 65:1 66:1 67:1 60:1 68:1 69:0 70:1 71:1 72:1 73:1 74:0 75:1 76:1 77:1 78:1 79:0 80:0.twords文件直接就是每個topic下的出現頻率最高的詞以及權重：Topic 0th: competition 0.04030710172744722 Champions 0.04030710172744722 France 0.04030710172744722 team 0.04030710172744722 sport 0.02111324376199616 Spanish 0.02111324376199616 football 0.02111324376199616 association 0.02111324376199616 club 0.02111324376199616 tickets 0.02111324376199616 Topic 1th: travel 0.05525846702317291 the 0.05525846702317291 a 0.03743315508021391 gin 0.03743315508021391 League 0.0196078431372549 quit 0.0196078431372549 my 0.0196078431372549 job 0.0196078431372549 to 0.0196078431372549 world 0.0196078431372549下面是最重要的兩個輸出文件 .phi以及.theta

.phi是topic-word矩陣，本測試中topic只有2個，因而行數是2，列中的word並不是在參數中設置的topic word個數，這個topic word個數只是控制顯示多少個word的，實際上計算中用的是所有的word，因而這裡topic word矩陣的列是所有的word，即wordmap.txt中的所有word，所以列的維度是81. .phi文件如下：

0.0211130.0211130.0211130.0211130.0403070.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0403070.0019190.0403070.0403070.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0211130.0019190.0019190.0019190.0019190.0019190.0211130.0019190.0211130.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0211130.0019190.0019190.0211130.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0019190.0211130.0019190.0019190.0019190.0019190.0211130.0019190.0019190.0019190.0019190.0211130.0211130.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0196080.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0017830.0374330.0017830.0017830.0196080.0196080.0196080.0196080.0552580.0017830.0196080.0017830.0196080.0196080.0552580.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0017830.0196080.0196080.0017830.0196080.0374330.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0196080.0017830.0196080.0196080.0196080.0196080.0017830.0196080.0196080.0196080.0196080.0017830.001783

.theta矩陣是document-topic矩陣，那麼本測試中有4個document、2個topic，則該矩陣是4行2列的，具體如下：