《Mining Text Data》閱讀筆記---第1章 An Introduction to Text Mining

原創

sunfoot001

2020-02-23 16:36

這是一本關於文本挖掘的很厚的英文電子書，看英文大部頭，很容易邊看邊忘記。

1.An Introduction to Text Mining

1.1 介紹
文本挖掘的三個問題：
a. 主要的算法模型是什麼？與其他數據挖掘的區別？
b. 有哪些可用的工具和技術？（模型是形而上的，技術是形而下的）
c. 有哪些關鍵的應用領域？

文本挖掘的特點：
a. 文本數據的高維度和稀疏性
b.文本數據可以在多層次進行分析，如單詞，句，篇章，文本集合。
文本的語義表示很有用，如NER.

1.2 算法
本section介紹文本挖掘所覆蓋的各種topic及其算法。
a. Information Extraction from Text Data:
   Information Extraction is one of the key problems of text mining, which serves as a starting
   point for many text mining algorithms.

b. Text Summarization:
   Another common function needed in many text mining applications is to summarize the text documents.

c. Unsupervised Learning Methods from Text Data:
The two main unsupervised learning methods commonly used in the context of text data are clustering and topic modeling.

d. LSI and Dimensionality Reduction for Text Mining:
representing the underlying data in compressed format for indexing and retrieval.
這點有點類似Text Summarization了。

e Supervised Learning Methods for Text Data

f. Transfer Learning with Text Data:
   用武之處： For example, labeled English documents are copious and easy to find. On the other hand, it is much
   harder to obtain labeled Chinese documents. 英語的實體庫等如此open，的確是很大的機會去轉移到中文上去。

g. Probabilistic Techniques for Text Mining:

h. Mining Text Streams:
文本數據類似音頻流一樣的輸入，需要進行on-line連續處理，傳統的off-line批處理不適用了。

i. Cross-Lingual Mining of Text Data:

j. Text Mining in Multimedia Networks:

k. Text Mining in Social Media:

l. Opinion Mining from Text Data:
這是最常見的應用了。

m. Text Mining from Biomedical Data:
這是在一個專業領域的應用了。

1.3 將來的方向
a. Scalable and robust methods for natural language understanding:
目前NLP的許多方法要scale to multiple domains比較困難，有監督學習對訓練數據量的要求太高。

b. Domain adaptation and transfer learning
這也是解決有監督學習缺乏訓練數據的問題。

c. Contextual analysis of text data:

d. Parallel text mining:

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

《Mining Text Data》閱讀筆記---第1章 An Introduction to Text Mining

Keras學習---RNN模型建立篇

python下讀sougou中文語料文件

Keras學習---數據預處理篇

Python 代碼性能優化技巧

關於RNNLM的思考，特別是與HMM，n-gram的區別

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結