Latent Semantic Analysis (LSA) Tutorial 潛語義分析LSA介紹 七

WangBen 20110916 Beijing


Advantages, Disadvantages, and Applications of LSA

LSA的優勢、劣勢以及應用

Latent SemanticAnalysis has many nice properties that make it widely applicable to manyproblems.

1.    First, the documents and words end up being mapped to thesame concept space. In this space we can cluster documents, cluster words, andmost importantly, see how these clusters coincide so we can retrieve documentsbased on words and vice versa.

2.    Second, the concept space has vastly fewer dimensionscompared to the original matrix. Not only that, but these dimensions have beenchosen specifically because they contain the most information and least noise.This makes the new concept space ideal for running further algorithms such astesting different clustering algorithms.

3.    Last, LSA is an inherently global algorithm that looks attrends and patterns from all documents and all words so it can find things thatmay not be apparent to a more locally based algorithm. It can also be usefullycombined with a more local algorithm such as nearest neighbors to become moreuseful than either algorithm by itself.

LSA有着許多優良的品質使得其應用廣泛:

         1. 第一,文檔和詞被映射到了同一個“概念空間”。在這個空間中,我們可以把聚類文檔,聚類詞,最重要的是可以知道不同類型的聚類如何聯繫在一起的,這樣我們可以通過詞來尋找文檔,反之亦然。

         2. 第二,概念空間比較原始矩陣來說維度大大減少。不僅如此,這種維度數量是刻意爲之的,因爲他們包含了大部分的信息和最少的噪音。這使得新產生的概念空間對於運行之後的算法非常理想,例如嘗試不同的聚類算法。

         3. 最後LSA天生是全局算法,它從所有的文檔和所有的詞中找到一些東西,而這是一些局部算法不能完成的。它也能和一些更局部的算法(最近鄰算法nearest neighbors)所結合發揮更大的作用。

There are a fewlimitations that must be considered when deciding whether to use LSA. Some ofthese are:

1.    LSA assumes a Gaussian distribution and Frobenius normwhich may not fit all problems. For example, words in documents seem to followa Poisson distribution rather than a Gaussian distribution.

2.    LSA cannot handle polysemy (words with multiple meanings)effectively. It assumes that the same word means the same concept which causesproblems for words like bank that have multiple meanings depending on whichcontexts they appear in.

3.    LSA depends heavily on SVD which is computationallyintensive and hard to update as new documents appear. However recent work hasled to a new efficient algorithm which can update SVD based on new documents ina theoretically exact sense.

當選擇使用LSA時也有一些限制需要被考量。其中的一些是:

         1. LSA假設Gaussiandistribution and Frobenius norm,這些假設不一定適合所有問題。例如,文章中的詞符合Poissondistribution而不是Gaussian distribution。

         2. LSA不能夠有效解決多義性(一個詞有多個意思)。它假設同樣的詞有同樣的概念,這就解決不了例如bank這種詞需要根據語境才能確定其具體含義的。

         3. LSA嚴重依賴於SVD,而SVD計算複雜度非常高並且對於新出現的文檔很難去做更新。然而,近期出現了一種可以更新SVD的非常有效的算法。

In spite ofthese limitations, LSA is widely used for finding and organizing searchresults, grouping documents into clusters, spam filtering, speech recognition,patent searches, automated essay evaluation, etc.

不考慮這些限制,LSA被廣泛應用在發現和組織搜索結果,把文章聚類,過濾作弊,語音識別,專利搜索,自動文章評價等應用之上。

As an example,iMetaSearch uses LSA to map search results and words to a “concept” space.Users can then find which results are closest to which words and vice versa.The LSA results are also used to cluster search results together so that yousave time when looking for related results.

例如,iMetaSearch利用LSA去把搜索結果和詞映射到“概念”空間。用戶可以知道那些結果和那些詞更加接近,反之亦然。LSA的結果也被用在把搜索結果聚在一起,這樣你可以節約尋找相關結果的時間。



發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章