基於gibbsLDA的文本分類

之前幾篇文章講到了文檔主題模型,但是畢竟我的首要任務還是做分類任務,而涉及主題模型的原因主要是用於text representation,因爲考慮到Topic Model能夠明顯將文檔向量降低維度,當然TopicModel可以做比這更多的事情,但是對於分類任務,我覺得這一點就差不多了。

 

LDA之前已經說到過,是一個比較完善的文檔主題模型,這次試用的是JGibbsLDA開源的LDA代碼做LDA的相關工作,簡單易用,用法官網上有,也可以自行谷歌。

 

按照官網上的參數和格式規範,就可以訓練生成語料相關的結果了,一共會產生以下幾個文件:

  1. model-final.twords:topic-word,也就是每個主題對應的單詞分佈
  2. model-final.others:LDA的一些參數
  3. model-final.phi該文件是一個主題數×詞數量的矩陣
  4. model-final.tassign:這個是統計文檔單詞的tf-idf
  5. model-final.theta:這個就是我們需要的,表示文檔對應的主題概率
  6. wordmap.txt:這個是用來統計單詞詞頻

當然我們需要用到的是model-final.theta這個文件,並將它作爲文檔神經網絡分類器的輸入文章向量;

 

然後開始我們的實驗:

實驗語料:20_newsgroups,包含20類的分類新聞,並將測試集和訓練集按照1:1分開

實驗環境:JDK1.8 windows7 

使用LDA開源工具:JGibbsLDA

分類器使用:100*300*20的簡單三層神經BP神經網絡,神經網絡的工具選取的是JOONE

 

首先,將預料進行預處理,去掉停用詞和無關的詞語(如日期年份郵件地址等),這個實驗沒有使用詞幹化處理,原因是開始準備使用Lucene的詞幹化處理工具,但是其處理效果很不好,會把does詞幹化成doe,把integrate 詞幹化成intergr 這就達不到我們的目的,而之後使用Stanford的coreNLP詞幹化工具,coreNLP詞幹化效果不錯,但是其處理是基於上下文的,導致處理速度過慢,達不到預期效果,所以最後沒有做詞幹化處理

 

由於LDA對於短文本的效果並不好,所以我們針對語料進行了篩選,選擇了文本長度大於5000的文章,當然這個是我自己定義的,不一定具備什麼道理,經過這個處理之後,訓練文本的數量減少到了126個測試文本數量減少到了121個(之前都是9500個訓練文本和測試文本) PS:這個實驗只是用來測試LDA的Text Presentation性能,所以對於小部分文本進行測試就達到了實驗的目的。

 

訓練文本trainScale處理後的形式(這裏這是列舉了一行)

126
archive atheism resources alt atheism archive resources modified december version atheist resources addresses atheist organizations usa freedom religion foundation darwin fish bumper stickers assorted atheist paraphernalia freedom religion foundation write ffrf box madison wi telephone evolution designs evolution designs sell darwin fish fish symbol christians stick cars feet word darwin written inside deluxe moulded plastic fish postpaid write evolution designs laurel canyon north hollywood san francisco bay area darwin fish lynn gold mailing net lynn directly price fish american atheist press aap publish atheist books critiques bible lists biblical contradictions book bible handbook ball foote american atheist press isbn edition bible contradictions absurdities atrocities immoralities ball foote bible contradicts aap based king james version bible write american atheist press box austin tx cameron road austin tx telephone fax prometheus books sell books including haught holy horrors write east amherst street buffalo york telephone alternate address newer older prometheus books glenn drive buffalo ny african americans humanism organization promoting black secular humanism uncovering history black freethought publish quarterly newsletter aah examiner write norm allen jr african americans humanism box buffalo ny united kingdom rationalist press association national secular society islington high street holloway road london ew london nl british humanist association south place ethical society lamb conduit passage conway hall london wc rh red lion square london wc rl fax national secular society publish freethinker monthly magazine founded germany ibka internationaler bund der konfessionslosen und atheisten postfach berlin germany ibka publish journal miz materialien und informationen zur zeit politisches journal der konfessionslosesn und atheisten hrsg ibka miz vertrieb postfach berlin germany atheist books write ibdk internationaler ucherdienst der konfessionslosen postfach hannover germany telephone books fiction thomas disch santa claus compromise short story ultimate proof santa exists characters events fictitious similarity living dead gods uh walter miller jr canticle leibowitz gem atomic doomsday novel monks spent lives copying blueprints saint leibowitz filling sheets paper ink leaving white lines letters edgar pangborn davy atomic doomsday novel set clerical church example forbids produce describe substance atoms philip dick philip dick dick wrote philosophical thought provoking short stories novels stories bizarre times approachable wrote sf wrote truth religion technology believed met sort god remained sceptical novels relevance galactic pot healer fallible alien deity summons group earth craftsmen women remote planet raise giant cathedral beneath oceans deity demand faith earthers pot healer joe fernwright unable comply polished ironic amusing novel maze death noteworthy description technology based religion valis schizophrenic hero searches hidden mysteries gnostic christianity reality fired brain pink laser beam unknown divine origin accompanied dogmatic dismissively atheist friend assorted odd characters divine invasion god invades earth making young woman pregnant returns star system terminally ill assisted dead man brain wired hour listening music margaret atwood handmaid tale story based premise congress mysteriously assassinated fundamentalists charge nation set book diary woman life live christian theocracy women property revoked bank accounts closed sinful luxuries outlawed radio readings bible crimes punished retroactively doctors performed legal abortions hunted hanged atwood writing style difficult tale grows chilling authors bible dull rambling work criticized worth reading ll fuss exists versions true version books fiction peter de rosa vicars christ bantam press de rosa christian catholic enlighting history papal immoralities adulteries fallacies german translation gottes erste diener die dunkle seite des papsttums droemer knaur michael martin atheism philosophical justification temple university press philadelphia usa detailed scholarly justification atheism outstanding appendix defining terminology usage tendentious area argues negative atheism belief existence god positive atheism belief existence god includes refutations challenging arguments god attention paid refuting contempory theists platinga swinburne isbn hardcover paperback case christianity temple university press comprehensive critique christianity considers contemporary defences christianity ultimately demonstrates unsupportable incoherent isbn james turner god creed johns hopkins university press baltimore md usa subtitled origins unbelief america examines unbelief agnostic atheistic mainstream alternative view focusses period considering france britain emphasis american england developments religious history secularization atheism god creed intellectual history fate single idea belief god exists isbn hardcover paper george seldes editor thoughts ballantine books york usa dictionary quotations kind concentrating statements writings explicitly implicitly person philosophy view includes obscure suppressed opinions popular observations traces expressed twisted idea centuries number quotations derived cardiff men religion noyes views religion isbn paper richard swinburne existence god revised edition clarendon paperbacks oxford book second volume trilogy began coherence theism concluded faith reason work swinburne attempts construct series inductive arguments existence god arguments tendentious rely imputation late century western christian values aesthetics god supposedly simple conceived decisively rejected mackie miracle theism revised edition existence god swinburne includes appendix incoherent attempt rebut mackie mackie miracle theism oxford posthumous volume comprehensive review principal arguments existence god ranges classical philosophical positions descartes anselm berkeley hume al moral arguments newman kant sidgwick restatements classical theses plantinga swinburne addresses positions push concept god realm rational kierkegaard kung philips replacements god lelie axiarchism book delight read formalistic written martin works refreshingly direct compared hand waving swinburne james haught holy horrors illustrated history religious murder madness prometheus books religious persecution ancient times christians library congress catalog card number norm allen jr african american humanism anthology listing african americans humanism gordon stein anthology atheism rationalism prometheus books anthology covering wide range subjects including devil evil morality history freethought comprehensive bibliography edmund cohen mind bible believer prometheus books study christian fundamentalists net resources small mail based archive server mantis uk carries archives alt atheism moderated articles assorted files send mail archive uk send atheism mail reply mathew ?

其中的每一行都表示一個文檔,行的單詞表示文檔的單詞,使用的是詞袋模型,因此詞的順序對於結果沒有關係

第一行的126表示126篇文檔

 

然後我們將這個訓練文本應用於LDA的處理,主要代碼如下:

<span style="font-size: 18px;">	public void lda(){
		LDACmdOption ldaOption = new LDACmdOption();   
        ldaOption.est = true;  
        ldaOption.K=100;  //表示100個主題
        ldaOption.beta = 0.1;  //beta參數
        ldaOption.alpha = 10.0/ldaOption.K; //alpha參數 
        ldaOption.niters = 500; //迭代代數
        ldaOption.savestep=200; //每隔200代就保存一下
        ldaOption.modelName="model-train"; //模型名稱
        ldaOption.dir="D:\\J2ee_workspace\\LDATest";  //訓練文本所在目錄
        ldaOption.dfile="trainScale";				//訓練文本文件
        
        Estimator estimator = new Estimator();  
        estimator.init(ldaOption);  
        estimator.estimate();   //開始參數估計
	}</span>
 代碼中的具體參數都給出了註釋,訓練出來的model-final.theta結果如下:(這裏只展示model-final.theta的部分內容)

1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.3045054945054945;0.002307692307692308;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;0.0012087912087912088;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.004505494505494505;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.5671428571428572;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.017692307692307695;0.0078021978021978015;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.002307692307692308;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;0.027582417582417584;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.02208791208791209;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0012087912087912088;1.0989010989010989E-4;0.01989010989010989;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;0.0078021978021978015;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;1.0989010989010989E-4;
1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.563985837126961E-4;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;0.35058168942842693;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;0.0010622154779969652;5.563985837126961E-4;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;0.640414769853313;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;0.0010622154779969652;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.563985837126961E-4;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.0581689428426914E-5;5.563985837126961E-4;5.0581689428426914E-5;

 需要說明的是,我對JGibbsLDA代碼做了部分修改,使之滿足我的神經網絡分類器的輸出格式要求,上面的前20行表示類別信息,中間數字爲1的所在位置表示這個類別,比如上面前20列表示這個文本屬於類別1, 20列之後表示這個文檔的主題分佈,我使用了100個類,所以是100個數字

 

有了訓練文本產生的LDA模型就可以對測試數據按照生成的模型產生測試文檔向量,在這裏,生成測試文檔向量的方法有多種,當然最簡單的是將測試文檔再次丟進訓練文檔,重新跑個LDA模型出來,這種方法顯然耗時,所以不建議採用,當然如果測試文檔數量比較大的話而訓練文檔數量小的話還是可以試一試的,一般會採用第二種方法:對於新的文檔,在訓練文檔生成的模型基礎之上在生成新的文檔的向量,這個一般的做法是隻對新的文檔進行Gibbs採樣,而模型的twords不變。JGibbsLDA有比較容易的實現方法:


	public void generateWithLDAModel(){
		 LDACmdOption ldaOption = new LDACmdOption();   
	        ldaOption.inf = true;  
	        ldaOption.estc = false;  
	        ldaOption.dir = "D:\\J2ee_workspace\\LDATest";   
	        ldaOption.modelName = "model-final"; //根據訓練文檔生成的模型文件,注意文件的位置需要在根目錄下
	        ldaOption.dfile = "testScale";  //測試文檔路徑
	        Inferencer inferencer = new Inferencer();   
	        inferencer.init(ldaOption);  
	        Model newModel = inferencer.inference();
	        newModel.saveModelTheta("./vector/test/testScale");//新生成的文檔向量文件存放的位置
	        
	}

 生成新的測試文檔向量文件如下(只列出幾行):

1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.001765650080256822;1.6051364365971107E-4;0.001765650080256822;0.004975922953451044;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.4158908507223114;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.07078651685393259;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.0033707865168539327;1.6051364365971107E-4;0.09486356340288925;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.001765650080256822;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.38218298555377206;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.001765650080256822;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;1.6051364365971107E-4;0.006581059390048154;
1;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;0.06335379892555641;3.8372985418265546E-5;4.22102839600921E-4;0.24102072141212588;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;8.058326937835764E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;4.22102839600921E-4;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;0.6876822716807367;3.8372985418265546E-5;0.0011895625479662318;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;3.8372985418265546E-5;

上面的表示意義和之前的訓練文檔向量一樣

 

有了這些個文件,就可以丟到JOONE神經網絡分類器(三層100*300*20的簡單BP神經網絡)裏面去分類了:

分類效果如下:

在121個測試用例中,正確的分類用例爲100個,準確率約爲81%,對於這個結果,我還是覺得可以接受的,雖然可能對於這樣的效果還不如簡單的tf-idf+SVM模型,但是這個實驗主要是想探尋LDA的降維做法對於分類任務是不是可行的,所以對於文檔維度爲100,81%的結果我覺得還是勉強能接受的。




發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章