這是cs276 information retrieval & web search 的筆記2,這裏總結關於IR 系統中,rank的一些概率 模型,BIM,BM25
introduction
IR系統的核心就是ranking,也就是非常直觀的任務,對於user的一個query q q q , IR系統希望給每個檢索出來的文檔一個分數 score,按照score由高到低反饋給用戶,而概率模型的核心就是對這個分數,用概率建模。
P ( R ∣ q , d ) P(R|q,d) P ( R ∣ q , d ) 其中 R R R 是一個binary事件變量,R = 1 R=1 R = 1 表示相關,R = 0 R=0 R = 0 表示不相關。q q q 表示user的查詢,而 d d d 表示文檔
由於我們只care的是rank(相對大小)而不是 P r o b Prob P r o b 絕對大小。因此在概率模型中我們使用的metric通常是
O d d ( R ∣ q , d ) = P ( R = 1 ∣ q , d ) P ( R = 0 ∣ q , d ) Odd(R|q,d)=\frac{P(R=1|q,d)}{P(R=0|q,d)} O d d ( R ∣ q , d ) = P ( R = 0 ∣ q , d ) P ( R = 1 ∣ q , d )
下面介紹兩種概率模型,BIM,BM25,分別基於 二項分佈和泊松分佈
BIM( binary independent model)
首先將document 向量化成 x = ( x 1 , … , x i , … , x T ) , T x = (x_1,\dots ,x_i,\dots ,x_T ),T x = ( x 1 , … , x i , … , x T ) , T 表示 query 中 t e r m term t e r m 的數量,x i = 1 ⟺ t e r m i x_i =1 \iff term_i x i = 1 ⟺ t e r m i 在document d d d 中
那麼 score
O ( R ∣ q , d ) = O ( Q ∣ q , x ) = Pr ( R = 1 ∣ q , x ) Pr ( R = 0 ∣ q , x ) = Pr ( R = 1 ∣ q ) Pr ( x ∣ R = 1 , q ) Pr ( x ∣ q ) Pr ( R = 0 ∣ q ) Pr ( x ∣ R = 0 , q ) Pr ( x ∣ q ) ( b a y e s r u l e ) = O ( R ∣ q ) Pr ( x ∣ R = 1 , q ) Pr ( x ∣ R = 0 , q )
\begin{aligned}
O(R|q,d) &= O(Q|q,x)\\
&= \frac{\Pr(R=1|q,x)}{\Pr(R=0|q,x)}\\
&= \frac{\frac{\Pr(R=1|q)\Pr(x|R=1,q)}{\Pr(x|q)}}{\frac{\Pr(R=0|q)\Pr(x|R=0,q)}{\Pr(x|q)}} (bayes\ rule)\\
&=O(R|q)\frac{\Pr(x|R=1,q)}{\Pr(x|R=0,q)}
\end{aligned}
O ( R ∣ q , d ) = O ( Q ∣ q , x ) = Pr ( R = 0 ∣ q , x ) Pr ( R = 1 ∣ q , x ) = Pr ( x ∣ q ) Pr ( R = 0 ∣ q ) Pr ( x ∣ R = 0 , q ) Pr ( x ∣ q ) Pr ( R = 1 ∣ q ) Pr ( x ∣ R = 1 , q ) ( b a y e s r u l e ) = O ( R ∣ q ) Pr ( x ∣ R = 0 , q ) Pr ( x ∣ R = 1 , q )
因爲 q q q 對於每個document d d d 來說,都是一樣的,可以看做一個常量,因此我們重點關心的是
Pr ( x ∣ R = 1 , q ) Pr ( x ∣ R = 0 , q ) (1) \frac{\Pr(x|R=1,q)}{\Pr(x|R=0,q)} \tag{1} Pr ( x ∣ R = 0 , q ) Pr ( x ∣ R = 1 , q ) ( 1 )
binary independent model 的核心就是兩個假設:
每個 t e r m i term_i t e r m i 是一個二項分佈
t e r m i term_i t e r m i 獨立
基於獨立性假設對於等式 ( 1 ) (1) ( 1 ) , 我們有
s c o r e ( q , d ) = ∏ i = 1 T Pr ( x i ∣ R = 1 , q ) Pr ( x i ∣ R = 0 , q ) = ∏ x i = 1 Pr ( x i = 1 ∣ R = 1 , q ) Pr ( x i = 1 ∣ R = 0 , q ) ∏ x i = 0 Pr ( x i = 0 ∣ R = 1 , q ) Pr ( x i = 0 ∣ R = 0 , q ) l e t p i = Pr ( x i = 1 ∣ R = 1 , q ) , r i = Pr ( x i = 0 ∣ R = 0 , q ) = ∏ x i = 1 p i r i ∏ x i = 0 1 − p i 1 − r i = ∏ x i = 1 p i r i ( ∏ x i = 1 1 − r i 1 − p i 1 − p i 1 − r i ) ∏ x i = 0 1 − p i 1 − r i = ∏ x i = 1 p i ( 1 − r i ) r i ( 1 − p i ) ( ∏ i = 0 T 1 − p i 1 − r i − > c o n s t a n t )
\begin{aligned}
score(q,d) &=\prod_{i=1}^T \frac{\Pr(x_i|R=1,q)}{\Pr(x_i|R=0,q)}\\
&=\prod_{x_i=1}\frac{\Pr(x_i=1|R=1,q)}{\Pr(x_i=1|R=0,q)}\prod_{x_i=0}\frac{\Pr(x_i=0|R=1,q)}{\Pr(x_i=0|R=0,q)}\\
\mathbf{let}\ p_i&=\Pr(x_i=1|R=1,q),r_i=\Pr(x_i=0|R=0,q)\\
&=\prod_{x_i=1}\frac{p_i}{r_i}\prod_{x_i=0}\frac{1 - p_i}{1-r_i}\\
&=\prod_{x_i=1}\frac{p_i}{r_i}(\prod_{x_i=1}\frac{1-r_i}{1-p_i} \frac{1-p_i}{1-r_i})\prod_{x_i=0}\frac{1 - p_i}{1-r_i}\\
&=\prod_{x_i=1} \frac{p_i(1-r_i)}{r_i(1-p_i)}(\prod_{i=0}^T \frac{1-p_i}{1-r_i} ->constant)
\end{aligned}
s c o r e ( q , d ) l e t p i = i = 1 ∏ T Pr ( x i ∣ R = 0 , q ) Pr ( x i ∣ R = 1 , q ) = x i = 1 ∏ Pr ( x i = 1 ∣ R = 0 , q ) Pr ( x i = 1 ∣ R = 1 , q ) x i = 0 ∏ Pr ( x i = 0 ∣ R = 0 , q ) Pr ( x i = 0 ∣ R = 1 , q ) = Pr ( x i = 1 ∣ R = 1 , q ) , r i = Pr ( x i = 0 ∣ R = 0 , q ) = x i = 1 ∏ r i p i x i = 0 ∏ 1 − r i 1 − p i = x i = 1 ∏ r i p i ( x i = 1 ∏ 1 − p i 1 − r i 1 − r i 1 − p i ) x i = 0 ∏ 1 − r i 1 − p i = x i = 1 ∏ r i ( 1 − p i ) p i ( 1 − r i ) ( i = 0 ∏ T 1 − r i 1 − p i − > c o n s t a n t )
因此,最終我們需要計算的就是
s c o r e ( q , x ) = ∏ x i = 1 p i ( 1 − r i ) r i ( 1 − p i ) (2)
score(q,x)=\prod_{x_i=1} \frac{p_i(1-r_i)}{r_i(1-p_i)}\tag{2}
s c o r e ( q , x ) = x i = 1 ∏ r i ( 1 − p i ) p i ( 1 − r i ) ( 2 )
Retrieval Status Value
將 ( 2 ) (2) ( 2 ) 式取對數,我們得到,
R S V = ∑ log p i ( 1 − r i ) r i ( 1 − p i ) (3)
RSV = \sum \log \frac{p_i(1-r_i)}{r_i(1-p_i)}\tag{3}
R S V = ∑ log r i ( 1 − p i ) p i ( 1 − r i ) ( 3 )
define
c i = log p i ( 1 − r i ) r i ( 1 − p i ) (4)
c_i= \log \frac{p_i(1-r_i)}{r_i(1-p_i)}\tag{4}
c i = log r i ( 1 − p i ) p i ( 1 − r i ) ( 4 )
接下來的任務便是估計 c i c_i c i 了
estimate ci
假設對於整個集合中的文檔,我們能夠得到如下的統計表格
doc
Relevant
Non-Relevant
Total
x i = 1 x_i=1 x i = 1
s s s
n − s n-s n − s
n n n
x i = 0 x_i=0 x i = 0
S − s S-s S − s
N − n − S + s N-n-S+s N − n − S + s
N − n N-n N − n
sum
S S S
N − S N-S N − S
N N N
(note 小 s s s 通常很難估計)
p i = s S , r i = n − s N − S , 1 − r i = N − S + s − n N − S , 1 − p i = S − s S
p_i=\frac{s}{S},r_i=\frac{n-s}{N-S},1-r_i=\frac{N-S+s-n}{N-S},1-p_i=\frac{S-s}{S}
p i = S s , r i = N − S n − s , 1 − r i = N − S N − S + s − n , 1 − p i = S S − s
假設 N − S ≈ N N-S \approx N N − S ≈ N
log 1 − r i r i = log N − n − S + s n − s ≈ log N − n n ≈ log N n = I D F !
\begin{aligned}
\log \frac{1-r_i}{r_i}&=\log \frac{N-n-S+s}{n-s}\\
&\approx\log \frac{N-n}{n}\\
&\approx \log \frac{N}{n}=IDF!
\end{aligned}
log r i 1 − r i = log n − s N − n − S + s ≈ log n N − n ≈ log n N = I D F !
最難估計的是 p i p_i p i 一種可以的方法是不停的在系統中迭代,來估計s s s ,不過這對系統和數據的需求都很大,而且急需真實的系統場景和環境
這個模型還有一個問題,他沒有將term frequency 納入其中,也就是說,通常來說,一篇document 如果其中的term frequency 出現的次數很多,通常它的相關性會高一些,BIM並未對此建模
BM25(Best Match 25)
我們來看看下面一個基於(possion distribution )泊松分佈的 模型
BM25對term frequency建模,同時它提出了一個 eliteness 的概念
所謂 eliteness 也就是說在query中的某些 term,他是是特別的(整篇文檔都是關於這些term的,這完全有道理,比如paper 的key words)
(注 未特別說明本文所有圖片均引用自 ref1)
其實可以將 eliteness 看做一個 “隱”變量(hiden variable)
同時它表示 Pr ( T F i = i ) \Pr(TF_i=i) Pr ( T F i = i ) 爲possion 泊松分佈
基於以上假設,我們可以建立如下模型
R S V e l i t e = ∑ t f i > 0 c i e l i t e ( t f i )
RSV^{elite}=\sum_{tf_i>0}c_{i}^{elite}(tf_i)
R S V e l i t e = t f i > 0 ∑ c i e l i t e ( t f i )
whre,
c i e l i t e ( t f i ) = log Pr ( T F i = t f i ∣ R = 1 ) Pr ( T F i = 0 ∣ R = 0 ) Pr ( T F i = t f i ∣ R = 0 ) Pr ( T F i = 0 ∣ R = 1 )
c_{i}^{elite}(tf_i) = \log \frac{\Pr(TF_i=tf_i|R=1)\Pr(TF_i=0|R=0)}{\Pr(TF_i=tf_i|R=0)\Pr(TF_i=0|R=1)}
c i e l i t e ( t f i ) = log Pr ( T F i = t f i ∣ R = 0 ) Pr ( T F i = 0 ∣ R = 1 ) Pr ( T F i = t f i ∣ R = 1 ) Pr ( T F i = 0 ∣ R = 0 )
and,
Pr ( T F i = t f i ∣ R ) = Pr ( E i = E l i t e ∣ R ) Pr ( T F i = t f i ∣ E i = E l i t e ) + Pr ( E i = E l i t e ˉ ∣ R ) Pr ( T F i = t f i ∣ E i = E l i t e ˉ )
\Pr(TF_i=tf_i|R)=\Pr(E_i=Elite|R)\Pr(TF_i= tf_i | E_i=Elite) + \Pr(E_i=\bar{Elite}|R)\Pr(TF_i= tf_i | E_i=\bar{Elite})
Pr ( T F i = t f i ∣ R ) = Pr ( E i = E l i t e ∣ R ) Pr ( T F i = t f i ∣ E i = E l i t e ) + Pr ( E i = E l i t e ˉ ∣ R ) Pr ( T F i = t f i ∣ E i = E l i t e ˉ )
這就對term項建立了雙泊松模型:
Pr ( T F i = k ∣ R ) = π λ k k ! e − λ + ( 1 − λ ) μ k k ! e − μ
\Pr(TF_i=k|R)=\pi \frac{\lambda^k}{k!}e^{-\lambda} + (1 - \lambda)\frac{\mu^k}{k!}e^{-\mu}
Pr ( T F i = k ∣ R ) = π k ! λ k e − λ + ( 1 − λ ) k ! μ k e − μ
parameter π , μ , λ \pi,\mu,\lambda π , μ , λ 都非常難以估計,因此,這裏又採用了近似!(PS:我發現近似真實個非常神奇的東西)
approximation
觀察這個函數有幾個性質:
c i ( 0 ) = 0 c_i(0)=0 c i ( 0 ) = 0
c i ( t f i ) c_i(tf_i) c i ( t f i ) 隨着 t f i tf_i t f i 單調遞增
c i ( t f i ) c_i(tf_i) c i ( t f i ) 有一個漸進線
saturation function
sf : t f k + t f \frac{tf}{k+tf} k + t f t f
BM25
c i B M 25 v 2 = log N d f i ( k 1 + 1 ) t f i k 1 + t f i (5) c_i^{BM25v2}=\log \frac{N}{df_i}\frac{(k_1+1)tf_i}{k_1+tf_i}\tag{5} c i B M 2 5 v 2 = log d f i N k 1 + t f i ( k 1 + 1 ) t f i ( 5 )
以上公式與 tf-idf非常像,不過 tf項是有界的(saturation function)),分子係數意義不大,僅僅是讓 tf=1時這個值爲1,估計是想讓這項的值變化不那麼陡峭吧
document length normalization
基於兩個intuition
長文本通常更加可能觀察到term
長文本觀察到的term frequency 通常更高
d l = ∑ t f ∈ d t f i dl=\sum_{tf\in d} tf_i d l = ∑ t f ∈ d t f i
a v d l avdl a v d l : 整個文本集合中的平均文本長度
B = ( ( 1 − b ) + b d l a v d l ) , b ∈ [ 0 , 1 ] B=((1-b) + b\frac{dl}{avdl}), b\in[0,1] B = ( ( 1 − b ) + b a v d l d l ) , b ∈ [ 0 , 1 ]
b b b 正則因子
let t f i ′ = t f i B tf_i'=\frac{tf_i}{B} t f i ′ = B t f i , 帶入 equation ( 5 ) (5) ( 5 ) 我們得到真正的
c i B M 25 ( t f i ) = c i B M 25 v 2 ( t f i ′ ) = log N d f i ( k 1 + 1 ) t f i k 1 ( ( 1 − b ) + b d l a v d l ) + t f i (6)
c_i^{BM25}(tf_i)=c_i^{BM25v2}(tf_i')
=\log \frac{N}{df_i}\frac{(k_1+1)tf_i}{k_1((1-b)+b\frac{dl}{avdl}) + tf_i}\tag{6}
c i B M 2 5 ( t f i ) = c i B M 2 5 v 2 ( t f i ′ ) = log d f i N k 1 ( ( 1 − b ) + b a v d l d l ) + t f i ( k 1 + 1 ) t f i ( 6 )
and,
R S V B M 25 = ∑ c i B M 25 ( t f i ) (7)
RSV^{BM25}=\sum c_i^{BM25}(tf_i)\tag{7}
R S V B M 2 5 = ∑ c i B M 2 5 ( t f i ) ( 7 )
other feature
文本特徵
Zones(區域) : Title,author,abstract,body,anchors。。
Proximity(近鄰): 分類文本,鄰近文本查找
…
非文本特徵
File type
File age
Page Rank
…
(PS: 這些特徵都需要從真實的系統,業務場景中去查找,所以寫在這裏再多,僅僅紙上談兵肯定是不行的,所以這裏僅僅是對知識做個總結,表明有這個東西)
code 簡單代碼實現
code
resource
S. E. Robertson and K. Spärck Jones. 1976. Relevance Weighting of Search Terms. Journal of the American Society for Information Sciences 27(3): 129–146.
C. J. van Rijsbergen. 1979. Information Retrieval. 2nd ed. London: Butterworths, chapter 6. http://www.dcs.gla.ac.uk/Keith/Preface.html
K. Spärck Jones, S. Walker, and S. E. Robertson. 2000. A probabilistic model of information retrieval: Development and comparative experiments. Part 1. Information Processing and Management 779–808.
S. E. Robertson and H. Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval 3(4): 333-389.
ref
slides in https://web.stanford.edu/class/cs276/ (Probabilistic IR: the binary independence model, BM25, BM25F | Evaluation methods & NDCG|Learning to rank)
版權聲明
本作品爲作者原創文章,採用知識共享署名-非商業性使用-相同方式共享 4.0 國際許可協議
作者: taotao
僅允許非商業轉載,轉載請保留此版權聲明,並註明出處