Latent Semantic Analysis (LSA) Tutorial 潛語義分析LSA介紹 六

   
   
   
WangBen 20110916 Beijing


Part 4 - Clustering by Color

用顏色聚類

We can also turnthe numbers into colors. For instance, here is a color display that correspondsto the first 3 dimensions of the Titles matrix that we showed above. Itcontains exactly the same information, except that blue shows negative numbers,red shows positive numbers, and numbers close to 0 are white. For example,Title 9, which is strongly positive in all 3 dimensions, is also strongly redin all 3 dimensions.

我們可以把數字轉換爲顏色。例如,下圖表示了標題矩陣3個維度的顏色分佈。除了藍色表示負值,紅色表示正值,它包含了和矩陣同樣的信息。例如,標題9在所有三個維度上正數值都較大,那麼它在3個維度上都會很紅。

We can use thesecolors to cluster the titles. We ignore the first dimension for clusteringbecause all titles are red. In the second dimension, we have the followingresult.

我們能夠利用這些顏色來把標題聚類。我們在聚類中忽略第一維度,因爲所有的都是紅色。在第二維度,我們有如下結果:

Dim2

Titles

red

6-7, 9

blue

1-5, 8

Using the thirddimension, we can split each of these groups again the same way. For example,looking at the third dimension, title 6 is blue, but title 7 and title 9 arestill red. Doing this for both groups, we end up with these 4 groups.

在加上第三維度,我們可以繼續劃分。例如,在維度3上,標題6是藍色,但是標題7和標題9依然是紅色。最終我們得到如下幾個分組:

Dim2

Dim3

Titles

red

red

7, 9

red

blue

6

blue

red

2, 4-5, 8

blue

blue

1, 3

It’s interestingto compare this table with what we get when we graph the results in the nextsection.


Part 5 - Clustering by Value

按值聚類

Leaving out thefirst dimension, as we discussed, let's graph the second and third dimensionsusing a XY graph. We'll put the second dimension on the X axis and the thirddimension on the Y axis and graph each word and title. It's interesting tocompare the XY graph with the table we just created that clusters thedocuments.

去掉維度1,讓我們用xy軸座標圖來畫出第二維和第三維。第二維作爲X、第三維作爲Y,並且把每個詞和標題都畫上去。比較下這個圖和剛纔聚類的表格會非常有意思。

In the graphbelow, words are represented by red squares and titles are represented by bluecircles. For example the word "book" has dimension values (0.15,-0.27, 0.04). We ignore the first dimension value 0.15 and graph"book" to position (x = -0.27, y = 0.04) as can be seen in the graph.Titles are similarly graphed.

在下圖中,詞表示爲紅色方形,標題表示爲藍色圓圈。例如,詞“book”有座標值(0.15, -0.27,0.04)。這裏我們忽略第一維度0.15 把點畫在(x = -0.27, y =0.04)。標題也是一樣。


One advantage ofthis technique is that both words and titles are placed on the same graph. Notonly can we identify clusters of titles, but we can also label the clusters bylooking at what words are also in the cluster. For example, the lower leftcluster has titles 1 and 3 which are both about stock market investing. Thewords "stock" and "market" are conveniently located in thecluster, making it easy to see what the cluster is about. Another example isthe middle cluster which has titles 2, 4, 5, and, to a somewhat lesser extent,title 8. Titles 2, 4, and 5 are close to the words "value" and"investing" which summarizes those titles quite well.

這個技術的一個有點是詞和標題都在一張圖上。不僅我們可以區分標題的聚類,而且我們可以把聚類中的詞給這個聚類打上標籤。例如左下的聚類中有標題1和標題3都是關於股票市場投資(stock market investing)的。Stock和market可以方便的定位在這個聚類中,讓描述這個聚類變得容易。其它也類似。


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章