【轉載】彙總:LDA理論、變形、優化、應用、工具庫

原文地址:http://site.douban.com/204776/widget/notes/12599608/note/287085506/


2013-07-08 19:22:18


2013-07-08 19:22:18
http://www.douban.com/note/287085419/

啥了不說了,這幾天簡直成魔了。
自己的LDA框架也整理好了,接下來重新梳理一遍這邊就算任督二脈打通啦!

#LDA理論
——Topic Model相關論文彙總
http://site.douban.com/204776/widget/notes/12599608/note/286839088/
##Survey:
1. 基於文檔主題結構的關鍵詞抽取方法研究
劉知遠的博士論文,他是當時微博關鍵詞應用的作者我記得。
在短文本上也提出了一些方法改進。
2. Parameter estimation for text analysis
這篇絕對是重量級。


#Short-Text:
1. Automatic Keyphrase Extraction by Bridging Vocabulary Gap

#Practice / In Action (especially in Chinese)
1. A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese 
2. A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora
Statistical Substring Reduction in Linear Time
3. The Mathematics of Statistical Machine Translation: Parameter Estimation

##Anecdote:
LDA數學八卦
rickjin寫的,統計之都上連載的。
http://vdisk.weibo.com/s/qghK5

##LDA variation:
最近有個女人極其強大,總結了各種LDA變形。
在她發的兩篇近期論文裏:
1. On the design of LDA models for aspect-based opinion mining
2. The FLDA model for aspect-based opinion mining: addressing the cold start problem (WWW'13)



##我看過的幾乎LDA paper所有打包
有一定是加過重點的(-noted):
有上面提到的一些論文,但比那個多的多。
可以直接看裏面noted的文件夾,因爲沒note過的我覺得沒用。
http://vdisk.weibo.com/s/BA3xC














#LDA優化
——LDA優化實現論文彙總
http://site.douban.com/204776/widget/notes/12599608/note/286923972/
覺得比較有實際應用上的價值,因爲文本數量有時候很多,實現上的優化就很必要了。

快速推理算法:
Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

在線學習:
Online Learning for Latent Dirichlet Allocation
http://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf
http://videolectures.net/nips2010_hoffman_oll/
www.ece.duke.edu/~lcarin/Lingbo4.15.2011.pptx

文本流的推理算法;
Topic models over text streams: a study of batch and online unsupervised learning
Efficient Methods for Topic Model Inference on Streaming Document Collections

分佈式學習:
Distributed Inference for Latent Dirichlet Allocation
PLDA+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing











#LDA應用
——LDA應用變形
http://site.douban.com/204776/widget/notes/12599608/note/286930572/
說說LDA在不同應用上的幾個變形,都有細微調整也都帶來了新的問題。

##情感分析
Opinion Integration Through Semi-supervised Topic Modeling
把傳統的Topic Model作爲非監督的典型,拓展成了半監督。加入了模型的先驗信息,對於一些汽車產品,從維基百科中提出它的各個特徵的描述,然後訓練成先驗信息。

Jointly Modeling Aspects and Opinions with a MaxEnt-LDA Hybrid
聯合抽取主題和觀點。引入監督學習的方法,區分主題和情感詞彙。進一步再用LDA進行聚類。



##學術挖掘
比如KDD2013今年也有的作者建模,再比如學術熱點探測……
The author-topic model for authors and documents
同時對作者和主題進行建模。每個作者再限定該作者只能對應一個主題,每個作者也是主題上的一個分佈,同時用作者~主題的分佈取代文檔~主題的分佈。

Joint latent topic models for text and citations
對主題和引用同事建模,建立引用關係鏈接。

Detecting Topic Evolution in Scientific Literature: How Can Citations Help?
通過引用信息,建立主題進化模型。



##社會媒體主題

Twitter的研究太多了,小站SNA部分也總結過很多了。不多寫了。





















#LDA工具庫
——LDA工具庫
http://site.douban.com/204776/widget/notes/12599608/note/287084873/
(這部分還缺R,等我自己用過再做評價)



先發一個格式比較好的鏈接(但不全):
http://mengjunxie.github.io/ae-lda/topic-modeling.html
 





####
Latent Dirichlet allocation
http://www.cs.princeton.edu/~blei/lda-c/
This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data. LDA allows you to analyze of corpus, and extract the topics that combined to form its documents. For example, click here to see the topics estimated from a small corpus of Associated Press documents. LDA is fully described in Blei et al. (2003) .



####
Discrete Component Analysis
http://www.nicta.com.au/people/buntinew/discrete_component_analysis
The Discrete Component Analysis (DCA) software is being developed as a stand-alone package, and as a plug-in to the Elefant system, a machine learning toolbox from NICTA. Currently the software is being run in stand-alone mode using the data streaming libraries from the older and now unsupported MPCA system, developed at Helsinki Institute for IT. The software itself is written in the C language and compiles on a Linux and a Mac OS X environment.

The models presented here are known under many names, such as latent Dirichlet allocation, multi-aspect models, multinomial PCA, and non-negative matrix factorisation.



####
Infinite LDA
http://www.arbylon.net/projects/knowceans-ilda/readme.txt
https://bitbucket.org/gchrupala/colada/wiki/Resources
Implementations of Latent Dirichlet Allocation (LDA) and
Hierarchical Dirichlet Processes (HDP)

@author Gregor Heinrich, gregor :: arbylon : net
@version 0.96
@date 1 Mar 2011 

 - History: ILDA version 0.1: May 2008, LDA version 0.1: Feb. 2005, based
   on http://arbylon.net/projects/LdaGibbsSampler.java 

 - Simple implementations of Gibbs sampling for LDA and HDP
 
 - Scientific documentation: see texts lda.pdf and ilda.pdf
 
 - Technical documentation: see Javadoc and source (packages *.corpus and 
   *.utils are from knowceans-tools on SourceForge)
 
 - Data documentation: see nips/readme.txt including source references
 
 - License: All code is licensed under GPL v3.0. 
 
 - If the code is used in scientific work, please refer to its source
   via the URL: 
   
   http://arbylon.net/projects/knowceans-ilda.zip
           
   or the documentation of the ILDA or LDA implementations:
   
   G. Heinrich. "Infinite LDA" -- implementing the HDP with minimum code
   complexity. TN2011/1, http://arbylon.net/publications/ilda.pdf, 2011
   
   G. Heinrich. Parameter estimation for text analysis. Technical report,
   No. 09RP008-FIGD, Fraunhofer IGD, 2009 
 
TODO:

 - Diverse checks, e.g., Antoniak distribution sampling, hyperparameter
   estimators, general quantitative validation of HDP model

 - Output formatting
 
 - Visual matrix implementation for HDP / IldaGibbs
 





####
MAchine Learning for LanguagE Toolkit 
http://mallet.cs.umass.edu/
MALLET is open source software [License]. For research use, please remember to cite MALLET.
Download MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
MALLET includes sophisticated tools for document classification: efficient routines for converting text to "features", a wide variety of algorithms (including Naïve Bayes, Maximum Entropy, and Decision Trees), and code for evaluating classifier performance using several commonly used metrics. 





####
Multithreaded LDA
https://sites.google.com/site/rameshnallapati/software
Multithreaded extension of Blei's LDA implementation. C Ramesh Nallapati Speeds up the computation by orders of magnitude depending on the number of processors.





####
GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation
https://sites.google.com/site/rameshnallapati/software
GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. It is very fast and is designed to analyze hidden/latent topic structures of large-scale datasets including large collections of text/Web documents. LDA was first introduced by David Blei et al [Blei03]. There have been several implementations of this model in C (using Variational Methods), Java, and Matlab. We decided to release this implementation of LDA in C/C++ using Gibbs Sampling to provide an alternative to the topic-model community.

GibbsLDA++ is useful for the following potential application areas:

Information retrieval and search (analyzing semantic/latent topic/concept structures of large text collection for a more intelligent information search).
Document classification/clustering, document summarization, and text/web mining community in general.
Content-based image clustering, object recognition, and other applications of computer vision in general.
Other potential applications in biological data.







####
Gensim
http://radimrehurek.com/gensim/
Gensim is a FREE Python library
Scalable statistical semantics
Analyze plain-text documents for semantic structure
Retrieve semantically similar documents






####
Stanford Topic Modeling Toolbox
http://nlp.stanford.edu/software/tmt/tmt-0.4/
The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to:


Import and manipulate text from cells in Excel and other spreadsheets.
Train topic models (LDA, Labeled LDA, and PLDA new) to create summaries of the text.
Select parameters (such as the number of topics) via a data-driven process.
Generate rich Excel-compatible outputs for tracking word usage across topics, time, and other groupings of data.
The Stanford Topic Modeling Toolbox was written at the Stanford NLP group by: 
Daniel Ramage and Evan Rosen, first released in September 2009.




####
Matlab Topic Modeling Toolbox 1.4
http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm
Installation & Licensing

Download the zipped toolbox (18Mb). 
NOTE: this toolbox now works with 64 bit compilers. If you are looking for the old version of this toolbox that has the code for 32 bit compilers, download this version
 
The program is free for scientific use. Please contact the authors, if you are planning to use the software for commercial purposes. The software must not be further distributed without prior permission of the author. By using this software, you are agreeing to this license statement.
 
Type 'help function' at command prompt for more information on each function
 
Read these notes on data format for a description on the input and output format for the different topic models
 
Note for MAC and Linux users: some of the Matlab functions are implemented with mex code (C code linked to Matlab). For windows based platforms, the dll's are already provided in the distribution package. For other platforms, please compile the mex functions by executing "compilescripts" at the Matlab prompt



















#
最後的最後,
發個Topic Modeling Bibliography
http://www.cs.princeton.edu/~mimno/topics.html
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章