在上一家公司用LLR做過相似商品計算,最近在找工作,在這裏複習一下.
LLR方法的核心是分析事件的計數,特別是事件同事發生的計數. 我們需要的計數一般包括:
1. 兩個事件同時發生的次數(k_11)
2. 一個事件發生而另一個事件沒有發生的次數(k_12,k_21)
3. 兩個事件都沒有發生(k_22)
EventA | Everything but A | |
---|---|---|
Event B | A and B together(k_11) | B,but not A(k_12) |
Everything but B | A without B(K_21) | Neither A nor B(k_22) |
一旦有了這些計數計算log-likelihood ratio分數就很簡單了.
LLR=2 sum(k)(H(k)-H(rowSums(k))-H(colSums(k)))
H表示香農熵. 在R可以如下計算:
H = function(k){N=sum(k);return (sum(k/N*log(k/N+(k==0)))}
下面是Mahout的代碼
/**
* Calculates the Raw Log-likelihood ratio for two events, call them A and B. Then we have:
* <p/>
* <table border="1" cellpadding="5" cellspacing="0">
* <tbody><tr><td> </td><td>Event A</td><td>Everything but A</td></tr>
* <tr><td>Event B</td><td>A and B together (k_11)</td><td>B, but not A (k_12)</td></tr>
* <tr><td>Everything but B</td><td>A without B (k_21)</td><td>Neither A nor B (k_22)</td></tr></tbody>
* </table>
*
* @param k11 The number of times the two events occurred together
* @param k12 The number of times the second event occurred WITHOUT the first event
* @param k21 The number of times the first event occurred WITHOUT the second event
* @param k22 The number of times something else occurred (i.e. was neither of these events
* @return The raw log-likelihood ratio
*
* <p/>
* Credit to http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html for the table and the descriptions.
*/
public static double logLikelihoodRatio(long k11, long k12, long k21, long k22) {
Preconditions.checkArgument(k11 >= 0 && k12 >= 0 && k21 >= 0 && k22 >= 0);
// note that we have counts here, not probabilities, and that the entropy is not normalized.
double rowEntropy = entropy(k11 + k12, k21 + k22);
double columnEntropy = entropy(k11 + k21, k12 + k22);
double matrixEntropy = entropy(k11, k12, k21, k22);
if (rowEntropy + columnEntropy < matrixEntropy) {
// round off error
return 0.0;
}
return 2.0 * (rowEntropy + columnEntropy - matrixEntropy);
}
private static double xLogX(long x) {
return x == 0 ? 0.0 : x * Math.log(x);
}
/**
* Merely an optimization for the common two argument case of {@link #entropy(long...)}
* @see #logLikelihoodRatio(long, long, long, long)
*/
private static double entropy(long a, long b) {
return xLogX(a + b) - xLogX(a) - xLogX(b);
}
/**
* Merely an optimization for the common four argument case of {@link #entropy(long...)}
* @see #logLikelihoodRatio(long, long, long, long)
*/
private static double entropy(long a, long b, long c, long d) {
return xLogX(a + b + c + d) - xLogX(a) - xLogX(b) - xLogX(c) - xLogX(d);
}
參考:
http://tdunning.blogspot.hk/2008/03/surprise-and-coincidence.html