Programming Collecive Intelligence 筆記 Making Recommendations

現在recommendation是非常普遍的一項技術，在網上購物Amazon會推薦你可能感興趣的商品，在電影，音樂網站，會推薦你可能喜歡的音樂或電影。那麼這兒就來看看，這些推薦是怎麼樣實現的

Collaborative Filtering

日常生活中，最簡單的獲取推薦的方法就是問朋友，你可能知道某些朋友的品位比較高，愛好和你比較相像。不過這種方法並不是一直管用，因爲朋友知道的畢竟是很有限的，相信每個人都會有很糾結不知道去哪兒吃飯，或不知道什麼商品更值得買的時候。

那麼這時候就需要一個Collaborative Filtering算法，A collaborative filtering algorithm usually works by searching a large group of people and finding a smaller set with tastes similar to yours.

這樣就是把你的朋友的範圍進行擴展，當人多了，自然信息就多了

Collecting Preferences

The first thing you need is a way to represent different people and their preferences.

上面說了Collaborative Filtering算法，要從很多人中找出和你興趣相近的人，那麼首先的一步就是怎麼樣來表示個人和他的興趣，以便於後面的數據處理。

通用的做法就是把每個人都當作一個向量, 而每個興趣的特徵點都作爲向量的一維, 這兒需要把所有的興趣都進行量化,不然無法進行數據的計算處理. 比如, 你很喜歡, 標上數值5, 一般標上3.

而在python表示這種向量就用字典,很方便

critics={

'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 3.5}

}

上面就表示了lisa和gene分別對各個電影的喜歡程度,用1到5的數值來表示

Finding Similar Users

上面我用向量的形式表示出需要進行Collaborative Filtering的user, 那麼下面的問題是怎麼樣從中發現similar的user

既然我們用向量來表示user, 那麼發現similar的user, 其實就是去計算向量間距離最短的問題, 找出那些最相近的向量

I’ll show you two systems for calculating similarity scores: Euclidean distance and Pearson correlation .

Euclidean distance

歐氏距離就是兩點間絕對距離, 這個很好理解

>> from math import sqrt
>> sqrt(pow(5-4,2)+pow(4-1,2))
3.1622776601683795

上面的代碼就計算了(5,4)和(4,1)兩點間的距離

However, you need a function that gives higher values for people who are similar.

>> 1/(1+sqrt(pow(5-4,2)+pow(4-1,2)))
0.2402530733520421

Pearson correlation

歐氏距離比較簡單, 但有個問題, 對樣東西的打分是主觀的, 每個人打分的標準是不一樣的, 有人打分偏高, 有人偏低, 所以算絕對距離對這種情況無法處理.

pearson相關係數用於計算向量各維度間的比例, 兩個向量的維度間比例相近, 就認爲兩向量相似

如向量(1,2) 和 (4,8), 如果用歐氏距離去算差的很遠的, 但是用pearson相關係數去計算, 相似度就是1, 完全相似.

There are many other functions such as the Jaccard coefficient or Manhattan distance that you can use as your similarity function.

Ranking the Critics
Now that you have functions for comparing two people, you can create a function that scores everyone against a given person and finds the closest matches.

Recommending Items
Finding a good critic to read is great, but what I really want is a movie recommendation right now.

上面通過計算向量間距離, 我們已經可以找到和某個user最相近的那些users, 但我們的目的是進行電影推薦, 那麼下面應該怎麼做了

現在有下面5個相似的user對night, lady, luck這3部電影的評分, 來看看怎樣來推薦電影了

Critic          Similarity Night S.xNight Lady S.xLady Luck S.xLuck
Rose             0.99       3.0        2.97     2.5      2.48       3.0    2.97
Seymour       0.38       3.0        1.14     3.0      1.14       1.5    0.57
Puig              0.89       4.5        4.02                              3.0    2.68
LaSalle          0.92       3.0        2.77     3.0      2.77       2.0   1.85
Matthews     0.66       3.0        1.99      3.0      1.99
Total                                       12.89                8.38               8.07
Sim. Sum                                  3.84                2.95                3.18
Total/Sim. Sum                        3.35                  2.83               2.53

首先電影評分*Similarity得到相對的評分, 如 Similarity * Night = S.xNight, 這樣越相似的user的評分的權重越高

把所有user對電影的相對評分相加得到總評分, 直接把總評分作爲推薦依據, 會導致被越多用戶評分的電影的越佔便宜, 所以就那就用總評分除上所有評論用戶的similarity和來得到Total/Sim. Sum, 用這個作爲推薦的依據.

Not only do you get a ranked list of movies, but you also get a guess at what my rating for each movie would be.

以上我們就完成了一個推薦系統, 我們可以把其中的用戶和電影替換爲其他任意對象, 來完成各種各樣的推薦系統.

Item-Based Filtering

The way the recommendation engine has been implemented so far requires the use of all the rankings from every user in order to create a dataset. This will probably work well for a few thousand people or items, but a very large site like Amazon has millions of customers and products—comparing a user with every other user and then comparing every product each user has rated can be very slow.

我們上面介紹的方法對於小數據集沒有問題, 不過對於象Amazon等這樣的大數據集, 就會很慢, 因爲你每次都要去計算任意兩個對象的相似度.這種方法被稱爲user-based collaborative filtering . An alternative is known as item-based collaborative filtering . In cases with very large datasets, item-based collaborative filtering can give better results, and it allows many of the calculations to be performed in advance so that a user needing recommendations can get them more quickly.

這邊假設推薦系統都是用來爲user推薦item的, 上面我們的方法是, 先找到和該user相似的user集合, 然後根據這些user所喜歡的item來推薦.

那麼其實item之間本身也是有相似度, 那麼如果我們事先算出和每個item相似的item集合, 在對user進行item推薦的時候, 只需要以該user喜歡的item的相似item集合來進行推薦.

這樣做的一個依據是comparisons between items will not change as often as comparisons between users

因爲你用戶的興趣可能是不斷變的, 所以用戶之間的關係是不斷變化的, 而事物之間的關係是相對穩定的, 比如兩部電影的關係, 是比較客觀的

那麼怎麼計算item間的相似度了, 前面我們算user的相似度, 可以把這個矩陣倒置, 以item爲向量, 以user的評價爲維, 來計算item的相似度

這種方法剛開始user的評價不多的時候, item間的相似度關係會頻繁變動, 但當user的評價達到一定數量級的時候, 這個相似度關係會變的穩定. 其實你也可以通過其他方法來算item間相似度, 比如對電影, 可以計算電影介紹, 影評的相似度

那麼得到了item間的相似度, 怎麼進行推薦

假設user對Snakes, Superman, Dupree進行了評價, 那麼怎樣基於他的評價給他進行推薦新的電影

下面列出了和其他電影之間的相似度, 假設只有Night, Lady, Luck

Movie         Rating   Night R.xNight Lady R.xLady Luck R.xLuck
Snakes         4.5      0.182 0.818    0.222 0.999    0.105 0.474
Superman     4.0     0.103   0.412    0.091 0.363    0.065 0.258
Dupree         1.0      0.148 0.148     0.4      0.4        0.182 0.182
Total                        0.433 1.378    0.713 1.764    0.352 0.914
Normalized                         3.183                2.598               2.473

計算方法如下

Rating * Night = R.xNight

Total-R.x/Total-Night = Normalized

User-Based or Item-Based Filtering?

Item-based filtering is significantly faster than user-based when getting a list of recommendations for a large dataset, but it does have the additional overhead of maintaining the item similarity table.

Item-based filtering usually outperforms user-based filtering in sparse datasets, and the two perform about equally in dense datasets.

Programming Collecive Intelligence 筆記 Making Recommendations

Lucene in action 筆記 term vector

數論(算法概述)

Classify Text With NLTK

Extracting Information from Text With NLTK

Hadoop- The Definitive Guide 筆記

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結