Latent Semantic Analysis (LSA) Tutorial 潛語義分析LSA介紹 一

 

Latent Semantic Analysis (LSA) Tutorial

譯:http://www.puffinwarellc.com/index.php/news-and-articles/articles/33.html

WangBen 2011-09-16 beijing

        

潛語義分析LSA介紹

Latent Semantic Analysis (LSA), also known as Latent Semantic Indexing (LSI) literally means analyzing documents to find the underlying meaning or concepts of those documents. If each word only meant one concept, and each concept was only described by one word, then LSA would be easy since there is a simple mapping from words to concepts.

 

Latent Semantic Analysis (LSA)也被叫做Latent Semantic Indexing (LSI),從字面上的意思理解就是通過分析文檔去發現這些文檔中潛在的意思和概念。假設每個詞僅表示一個概念,並且每個概念僅僅被一個詞所描述,LSA將非常簡單從詞到概念存在一個簡單的映射關係)

one to one mapping between words and concepts

Unfortunately, this problem is difficult because English has different words that mean the same thing (synonyms), words with multiple meanings, and all sorts of ambiguities that obscure the concepts to the point where even people can have a hard time understanding.

 

不幸的是,這個問題並沒有如此簡單,因爲存在不同的詞表示同一個意思(同義詞),一個詞表示多個意思,所有這種二義性(多義性)都會混淆概念以至於有時就算是人也很難理解。

 

confused mapping between words and concepts

For example, the word bank when used together with mortgage, loans, and rates probably means a financial institution. However, the word bank when used together with lures, casting, and fish probably means a stream or river bank.

 

例如,銀行這個詞和抵押、貸款、利率一起出現時往往表示金融機構。但是,和魚餌,投擲、魚一起出現時往往表示河岸。

 

How Latent Semantic Analysis Works

潛語義分析工作原理

 

Latent Semantic Analysis arose from the problem of how to find relevant documents from search words. The fundamental difficulty arises when we compare words to find relevant documents, because what we really want to do is compare the meanings or concepts behind the words. LSA attempts to solve this problem by mapping both words and documents into a "concept" space and doing the comparison in this space.

 

潛語義分析(Latent Semantic Analysis)源自問題:如何從搜索query中找到相關的文檔。當我們試圖通過比較詞來找到相關的文本時,存在着難以解決的侷限性,那就是在搜索中我們實際想要去比較的不是詞,而是隱藏在詞之後的意義和概念。潛語義分析試圖去解決這個問題,它把詞和文檔都映射到一個‘概念’空間並在這個空間內進行比較(注:也就是一種降維技術)。

 

Since authors have a wide choice of words available when they write, the concepts can be obscured due to different word choices from different authors. This essentially random choice of words introduces noise into the word-concept relationship. Latent Semantic Analysis filters out some of this noise and also attempts to find the smallest set of concepts that spans all the documents.

 

當文檔的作者寫作的時候,對於詞語有着非常寬泛的選擇。不同的作者對於詞語的選擇有着不同的偏好,這樣會導致概念的混淆。這種對於詞語的隨機選擇在 詞-概念 的關係中引入了噪音。LSA濾除了這樣的一些噪音,並且還能夠從全部的文檔中找到最小的概念集合(爲什麼是最小?)。

 

In order to make this difficult problem solvable, LSA introduces some dramatic simplifications.

1.     Documents are represented as "bags of words", where the order of the words in a document is not important, only how many times each word appears in a document.

2.     Concepts are represented as patterns of words that usually appear together in documents. For example "leash", "treat", and "obey" might usually appear in documents about dog training.

3.     Words are assumed to have only one meaning. This is clearly not the case (banks could be river banks or financial banks) but it makes the problem tractable.

To see a small example of LSA, take a look at the next section.

 

爲了讓這個難題更好解決,LSA引入一些重要的簡化:

    1. 文檔被表示爲”一堆詞(bags of words)”,因此詞在文檔中出現的位置並不重要,只有一個詞的出現次數。

    2. 概念被表示成經常出現在一起的一些詞的某種模式。例如“leash”(栓狗的皮帶)、“treat”、“obey”(服從)經常出現在關於訓練狗的文檔中。

    3. 詞被認爲只有一個意思。這個顯然會有反例(bank表示河岸或者金融機構),但是這可以使得問題變得更加容易。(這個簡化會有怎樣的缺陷呢?)

 

接下來看一個LSA的小例子,Next Part:

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章