建立索引是 information retrieval 的一個核心問題，這一節簡單記錄關於index的相關筆記.
所有內容均來自 stanford cs276 information retrieval & web search

文章目錄

text preprocessing

一些術語

tokenization
1. 將文本編程token流，通常處理 "I’m "…
normalization
1. text 和query term 映射統一(e.g. U.S.A USA)
stemming
1. 形式(中文通常沒這個問題,e.g. authorize, authorization)
Stop words
1. omit 掉一些無用的，e.g a,an …

inverted index

term -> (frq,posting_list(docId))

frq = len(posting_list)

simple construction

text -> tokenstream -> list ( term, docId) -> sort->unique -> merge

IR 系統兩個核心的問題是 index高效和storage 高效，

比如這種索引可以應對簡單 boolean query,

也就是說 term1 & term2 的問題，僅需對兩個term的posting list merge就行了，並且效率在 $O(frq_1+frq_2)$

注意可以證明任意的boolean 查詢都能保證在 $O(\sum frq_i)$ 也就是說linear on sum of size of posting_list

但是這種索引如果用來處理短語查詢(phase query) 是有問題的

一種簡單的解決方案是建立positional index，着種索引類似 inverted index，不過還存儲了位置信息，因此storage用的會更多

posional index

(term , frq[Int], List[(docId, List[post[Int]]])

trem : frq
- docId1 : list(pos in docId2)
- docId2 : list(pos in docId2)

查詢也是高效的(POSITIONALINTERSECT(p1, p2, k), ref1 P42, 本質上還是基於有序list的merge)

index construction

由於index 通常非常大，不可能簡單地把它全部存在memory中，肯定是要存到disk上的，(這裏的計算機模型還是簡單的PC機，沒考慮集羣)，因此除了簡單考慮memory中的算法效率外，還得儘量讓算法具備disk-友好性(note 就是能夠利用disk訪問的時候詢盤耗時的特點，多訪問block少訪問多的小數據)

簡單總結幾種index

Block sort-based index(BSBI)

回憶前面提到的 simple construction 方法，會存在一個問題，要將所有的文本集合產生的 term dict 放在memory中，當集合很大的時候就會導致memory不夠用，因此必須要用second storage(簡單起見，後面就稱爲disk)

BSBI 構造非常簡單

algo

將text集合分成 $n$ 個塊(block), 每個塊在memory中構造好inverted-index，然後寫到disk

對每個block中的inverted-index做multi-way merge

block process

這個算法比較蠢的地方就在於每個block的處理，它採用的是上面提到的(simple construction 的做法) 先簡單的生成 List[(termID,docId)] 然後再排序，注意這是很慢的，有一個sort的過程,後面的SPIMI就是在這個上做了一些改進

multi-way merge

multi-way merge 的過程很簡單，就是在memory中維護一個 buffer，存貯當前的term的posting_list，同時用一個priority-queue維護每個block的最小的 termId，每次卻出queue中最小的termId，從磁盤中讀取posting_list,合併到buffer中，噹噹前的term的posting_list處理完畢，或者，buffur到達給定的size，寫到disk中
(note 注意這是一個非常典型的有限緩衝區讀寫問題)

簡單實現

import heapq
class BSBIIndex(BSBIIndex):
    def merge(self, indices, merged_index,buffur_size=1024*1024):
        """Merges multiple inverted indices into a single index
        
        Parameters
        ----------
        indices: List[InvertedIndexIterator]
            A list of InvertedIndexIterator objects, each representing an
            iterable inverted index for a block
        merged_index: InvertedIndexWriter
            An instance of InvertedIndexWriter object into which each merged 
            postings list is written out one at a time
        """
        ### Begin your code
        cur_term=None
        buffer= []
    
        for merged_item in heapq.merge(*indices,key=lambda x : x[0]):
            if cur_term is None or cur_term == merged_item[0]:
                cur_term = merged_item[0]
                buffer.extend(merged_item[1])
            else :
                merged_index.append(cur_term,buffer)
                cur_term = merged_item[0]
                buffer=[]
                buffer.extend(merged_item[1])
            
            if len(buffer) > buffur_size:
                merged_index.append(cur_term,buffer)
                buffer = []

complexity

假設總的term 爲 $T$ , 分爲 $n$ 塊

$O(n*T/nlog(T/n) + log(n)T)$

同時 BSBI還用了一個map term-> termID,這個也是耗費存儲的

single pass in memory indexing

其實我感覺這算法就是正常人的寫法, ?

因爲這個posting_list本來就是遞增的所以在處理每個block的時候僅僅需要維護一個dict然後每次append到posting_list的末尾就行了，滿了之後放到磁盤中

不太清楚 single-pass的意義，因爲每個block仍舊需要merge的過程

可以加色blockmerge的過程近似線性，block很少啊
complexity $\Theta(T)$

index compression

這東西大概是研究index的編解碼，不太感興趣…

補充一個作業中用到的吧

variable-byte-code

很簡單就是將一個數字按照7bit一個byte表示, 僅僅表示有效的byte(全0的byte就不要了)，最後一個byte(最高位)做高bit爲1，其他byte最高bit爲0，這樣就能分辨一個數字結尾了

e.g

11111000011111 -> 1111100 0011111 -> 00011111 11111100

code

補充

這個算法對於assigment中的例子來說 57M-> 31M

code 簡單實現

code

ref

Introduction to Information Retrieval, by C. Manning, P. Raghavan, and H. Schütze (Cambridge University Press, 2008).

版權聲明

本作品爲作者原創文章，採用知識共享署名-非商業性使用-相同方式共享 4.0 國際許可協議

作者: taotao

僅允許非商業轉載，轉載請保留此版權聲明，並註明出處

信息檢索(IR)筆記1: 倒排索引(Inverted Index)

文章目錄

text preprocessing

inverted index

simple construction

posional index

index construction

Block sort-based index(BSBI)

complexity

single pass in memory indexing

index compression

code 簡單實現

ref

c++中range-based for 的性能分析

信息檢索(IR)筆記1: 倒排索引(Inverted Index)

信息檢索(IR)筆記2: Rank: 基於概率的rank model

kickstart 2018 :Scrambled Words(hash+complexity)

consistent hash : 一致性hash 簡單筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結