# 推薦系統系列：商品關聯分析

### 商品關聯分析

relevance: 主要用在互聯網的內容和文檔上，比如搜索引擎算法文檔中之間的關聯性。

association: 用在實際的事物之上，比如電子商務網站上的商品之間的關聯度。

Support（尿布，啤酒）≥5% and Confidence（尿布，啤酒）≥65%。

### FP-Growth算法

FP-Growth(頻繁模式增長)算法是韓家煒老師在2000年提出的關聯分析算法，它採取如下分治策略：將提供頻繁項集的數據庫壓縮到一棵頻繁模式樹（FP-Tree），但仍保留項集關聯信息；該算法和Apriori算法最大的不同有兩點：第一，不產生候選集，第二，只需要兩次遍歷數據庫，大大提高了效率。

### 參考

#### LLR

``````  private double doItemSimilarity(long itemID1, long itemID2, long preferring1, long numUsers) throws TasteException {
DataModel dataModel = getDataModel();
long preferring1and2 = dataModel.getNumUsersWithPreferenceFor(itemID1, itemID2);
if (preferring1and2 == 0) {
return Double.NaN;
}
long preferring2 = dataModel.getNumUsersWithPreferenceFor(itemID2);
double logLikelihood =
LogLikelihood.logLikelihoodRatio(preferring1and2,
preferring2 - preferring1and2,
preferring1 - preferring1and2,
numUsers - preferring1 - preferring2 + preferring1and2);
return 1.0 - 1.0 / (1.0 + logLikelihood);
}
``````

long preferring1and2 = dataModel.getNumUsersWithPreferenceFor(itemID1, itemID2);
long preferring1 = dataModel.getNumUsersWithPreferenceFor(itemID1);
long preferring2 = dataModel.getNumUsersWithPreferenceFor(itemID2);
long numUsers = dataModel.getNumUsers();

k11: preferring1and2
k12: preferring2 - preferring1and2
k21: preferring1 - preferring1and2
k22: numUsers - preferring1 - preferring2 + preferring1and2

Event A Everything but A
Event B k11 k12
Everything but B k21 k22

`LLR = 2 sum(k) (H(k) - H(rowSums(k)) - H(colSums(k)))`

`H = function(k) {N = sum(k) ; return (sum(k/N * log(k/N + (k==0)))}`

``````/**
* Calculates the Raw Log-likelihood ratio for two events, call them A and B.  Then we have:
* <p/>
* <tbody><tr><td>&nbsp;</td><td>Event A</td><td>Everything but A</td></tr>
* <tr><td>Event B</td><td>A and B together (k_11)</td><td>B, but not A (k_12)</td></tr>
* <tr><td>Everything but B</td><td>A without B (k_21)</td><td>Neither A nor B (k_22)</td></tr></tbody>
* </table>
*
* @param k11 The number of times the two events occurred together
* @param k12 The number of times the second event occurred WITHOUT the first event
* @param k21 The number of times the first event occurred WITHOUT the second event
* @param k22 The number of times something else occurred (i.e. was neither of these events
* @return The raw log-likelihood ratio
*
* <p/>
* Credit to http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html for the table and the descriptions.
*/
public static double logLikelihoodRatio(long k11, long k12, long k21, long k22) {
Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22 >= 0);
// note that we have counts here, not probabilities, and that the entropy is not normalized.
double rowEntropy = entropy(k11 + k12, k21 + k22);
double columnEntropy = entropy(k11 + k21, k12 + k22);
double matrixEntropy = entropy(k11, k12, k21, k22);
if (rowEntropy + columnEntropy < matrixEntropy) {
// round off error
return 0.0;
}
return 2.0 * (rowEntropy + columnEntropy - matrixEntropy);
}
``````
``````  /**
* Merely an optimization for the common two argument case of {@link #entropy(long...)}
* @see #logLikelihoodRatio(long, long, long, long)
*/
private static double entropy(long a, long b) {
return xLogX(a + b) - xLogX(a) - xLogX(b);
}
``````

### Entropy (information theory)

Mahout on Spark: What’s New in Recommenders, part 2

Here similar means that they were liked by the same people. We’ll use another technique to narrow the items down to ones of the same genre later.

Intro to Cooccurrence Recommenders with Spark

rp = recommendations for a given user
hp = history of purchases for a given user
A = the matrix of all purchases by all users
rp = [A^tA]hp

This would produce reasonable recommendations, but is subject to skewed results due to the dominance of popular items. To avoid that, we can apply a weighting called the log likelihood ratio (LLR), which is a probabilistic measure of the importance of a cooccurrence.

The magnitude of the value in the matrix determines the strength of similarity of row item to the column item. We can use the LLR weights as a similarity measure that is nicely immune to unimportant similarities.

#### ItemSimilarityDriver

Creating the indicator matrix [AtA] is the core of this type of recommender. We have a quick flexible way to create this using text log files and creating output that’s in an easy form to digest. The job of data prep is greatly streamlined in the Mahout 1.0 snapshot. In the past a user would have to do all the data prep themselves. Translating their own user and item ids into Mahout ids, putting the data into text files, one element per line, and feeding them to the recommender. Out the other end you’d get a Hadoop binary file called a sequence file and you’d have to translate the Mahout ids into something your application could understand. No more.

### MAP

《基於mahout on spark + elastic search搭建item推薦系統》