僅用不到150行代碼,我開發出了一個搜索引擎

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"全文搜索無處不在。在Scribd(一個文檔分享平臺)上搜索一本書,在Netflix上搜索一部電影,在亞馬遜上搜索衛生紙商品,或者通過谷歌搜索東西,你都在搜索大量的非結構化數據。更令人感到驚奇地是,即使你搜索的是數百萬(或數十億)條記錄,也能夠獲得毫秒級的響應體驗。在這篇文章中,我們將探索全文搜索引擎的基本組件,並用它們來構建一個可以搜索數百萬個文檔、根據相關性對文檔進行排名的搜索引擎。我們將用不到150行的Python代碼來開發這個搜索引擎。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這篇文章中所有的代碼都可以在Github上找到("},{"type":"link","attrs":{"href":"https:\/\/github.com\/bartdegoede\/python-searchengine\/?fileGuid=4icAb1em6vATdAA7","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/bartdegoede\/python-searchengine\/"}]},{"type":"text","text":")。我將在文章中提供代碼片段和鏈接,你可以嘗試自己運行它們。你可以安裝運行示例所需的組件(pip install -r requirements.txt),然後運行python run.py("},{"type":"link","attrs":{"href":"https:\/\/github.com\/bartdegoede\/python-searchengine\/blob\/master\/run.py?fileGuid=4icAb1em6vATdAA7","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/bartdegoede\/python-searchengine\/blob\/master\/run.py"}]},{"type":"text","text":")。它會下載所有的數據,並運行帶排名和不帶排名的搜索示例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在開始構建搜索引擎之前,我們需要一些非結構化的數據。我們將搜索英文維基百科中的文章摘要。維基百科被打包成一個約785MB的壓縮XML文件包,其中包含了約627萬篇摘要。我寫了一個簡單的函數("},{"type":"link","attrs":{"href":"https:\/\/github.com\/bartdegoede\/python-searchengine\/blob\/master\/download.py?fileGuid=4icAb1em6vATdAA7","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/bartdegoede\/python-searchengine\/blob\/master\/download.py"}]},{"type":"text","text":")用來下載XML壓縮包,當然你也可以手動下載這個文件。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"數據準備"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個文件是一個包含所有摘要的大型XML文件。每一個摘要內容都包含在標籤中,看起來大致如下所示(我省略了我們不感興趣的標籤):"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"\n Wikipedia: London Beer Flood\n <url>https:\/\/en.wikipedia.org\/wiki\/London_Beer_Flood\n <abstract>The London Beer Flood was an accident at Meux & Co's Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the wooden vats of fermenting porter burst.\n ...\n\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們感興趣的是title、url和abstract這幾個標籤。爲了方便訪問數據,我們將用Python數據類("},{"type":"link","attrs":{"href":"https:\/\/realpython.com\/python-data-classes\/?fileGuid=4icAb1em6vATdAA7","title":"","type":null},"content":[{"type":"text","text":"https:\/\/realpython.com\/python-data-classes\/"}]},{"type":"text","text":")來表示文檔。我們將添加一個屬性來連接標題和摘要內容,代碼可以在這裏找到("},{"type":"link","attrs":{"href":"https:\/\/github.com\/bartdegoede\/python-searchengine\/blob\/master\/search\/documents.py?fileGuid=4icAb1em6vATdAA7","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/bartdegoede\/python-searchengine\/blob\/master\/search\/documents.py"}]},{"type":"text","text":")。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"from dataclasses import dataclass\n@dataclass\nclass Abstract:\n \"\"\"Wikipedia abstract\"\"\"\n ID: int\n title: str\n abstract: str\n url: str\n @property\n def fulltext(self):\n return ' '.join([self.title, self.abstract])\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後,我們從XML中提取摘要數據,對其進行解析,並創建Abstract實例。我們將通過流的方式來讀取 XML,不會將整個文件加載到內存中。我們將按照加載順序爲每個文檔分配一個ID(即第一個文檔ID=1,第二個文檔ID=2,以此類推)。相關代碼可以在這裏找到("},{"type":"link","attrs":{"href":"https:\/\/github.com\/bartdegoede\/python-searchengine\/blob\/master\/load.py?fileGuid=4icAb1em6vATdAA7","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/bartdegoede\/python-searchengine\/blob\/master\/load.py"}]},{"type":"text","text":")。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import gzip\nfrom lxml import etree\nfrom search.documents import Abstract\ndef load_documents():\n # open a filehandle to the gzipped Wikipedia dump\n with gzip.open('data\/enwiki.latest-abstract.xml.gz', 'rb') as f:\n doc_id = 1\n # iterparse will yield the entire `doc` element once it finds the\n # closing `` tag\n for _, element in etree.iterparse(f, events=('end',), tag='doc'):\n title = element.findtext('.\/title')\n url = element.findtext('.\/url')\n abstract = element.findtext('.\/abstract')\n yield Abstract(ID=doc_id, title=title, url=url, abstract=abstract)\n doc_id += 1\n # the `element.clear()` call will explicitly free up the memory\n # used to store the element\n element.clear()\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"建立索引"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將把這些數據保存成“倒排索引”。我們可以把它想象成一本書後面的索引,它有一個按字母順序排列的單詞和概念的清單,讀者可以在相應的頁碼上找到它們。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/6e\/03\/6ec764bb39b59b523e5b31255a5a0e03.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們需要創建一個字典,將語料庫所有的單詞與它們所在文檔ID映射起來,看起來是這樣的:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"{\n ...\n \"london\": [5245250, 2623812, 133455, 3672401, ...],\n \"beer\": [1921376, 4411744, 684389, 2019685, ...],\n \"flood\": [3772355, 2895814, 3461065, 5132238, ...],\n ...\n}\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意,在上面的例子中,字典中的單詞都是小寫的。在構建索引之前,我們需要把原始文本分解爲單詞。我們首先將文本分解爲單詞,然後對每個單詞應用零個或多個過濾器(如小寫或詞幹篩選),以提高查詢與文本匹配的機率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/61\/91\/61fb7d8d6cdbb6c10ab06ee629a66391.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"解析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將進行非常簡單的解析,只根據空格來拆分文本。然後,我們對每個單詞進行過濾:將單詞轉成小寫,移除標點符號,移除英語中最常見的25個單詞(包括“維基百科”這個單詞,因爲它出現在每一個標題和摘要中),並提取詞幹(確保不同形式的同一個單詞映射到相同的詞幹,如brewery和breweries)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分解和小寫過濾器非常簡單:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import Stemmer\nSTEMMER = Stemmer.Stemmer('english')\ndef tokenize(text):\n return text.split()\ndef lowercase_filter(text):\n return [token.lower() for token in tokens]\ndef stem_filter(tokens):\n return STEMMER.stemWords(tokens)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用正則表達式來移除標點符號:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"import re\nimport string\nPUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))\ndef punctuation_filter(tokens):\n return [PUNCTUATION.sub('', token) for token in tokens]\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"停頓詞是非常常見的單詞,(幾乎)會出現在語料庫的每一篇文檔中。因此,當我們在搜索它們時,它們不會對搜索有多大貢獻(因爲幾乎每個文檔都會匹配到),它們只會佔用更多的空間,所以我們會在進行索引時過濾掉它們。維基百科摘要語料庫的每個標題中都包含“Wikipedia”一詞,因此我們將把這個詞也添加到停頓詞清單中。我們還去掉了英語中最常見的25個單詞。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"# top 25 most common words in English and \"wikipedia\":\n# https:\/\/en.wikipedia.org\/wiki\/Most_common_words_in_English\nSTOPWORDS = set(['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',\n 'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',\n 'do', 'at', 'this', 'but', 'his', 'by', 'from', 'wikipedia'])\ndef stopword_filter(tokens):\n return [token for token in tokens if token not in STOPWORDS]\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將所有這些過濾器組合在一起變成analyze函數,它將處理每個摘要中的文本,將文本拆分爲單詞(更確切地說是節點),然後對每一個單詞進行過濾。順序很重要,因爲我們使用了一個沒有經過提取詞幹的停頓詞清單,所以應該在stem_filter之前應用stopword_filter。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"def analyze(text):\n tokens = tokenize(text)\n tokens = lowercase_filter(tokens)\n tokens = punctuation_filter(tokens)\n tokens = stopword_filter(tokens)\n tokens = stem_filter(tokens)\n return [token for token in tokens if token]\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"爲語料庫建立索引"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將創建一個Index類來存儲index和documents。documents字典按照ID來存儲數據類,index的鍵就是單詞,值就是單詞所在的文檔的ID:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"class Index:\n def __init__(self):\n self.index = {}\n self.documents = {}\n def index_document(self, document):\n if document.ID not in self.documents:\n self.documents[document.ID] = document\n for token in analyze(document.fulltext):\n if token not in self.index:\n self.index[token] = set()\n self.index[token].add(document.ID)\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"搜索"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在我們已經爲所有單詞建立了索引,接下來的搜索就要用到分析文檔時所使用的分析器,這樣就可以得到與索引中的單詞相匹配的單詞。對於每個單詞,我們將在字典中進行查找,查找單詞所在的文檔ID。每一個單詞都要這麼做,然後找出在所有這些集合中都存在的文檔ID(也就是說,目標文檔需要包含所有的查詢單詞)。然後,我們將獲取文檔ID結果列表,並從documents中獲取實際的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"def _results(self, analyzed_query):\n return [self.index.get(token, set()) for token in analyzed_query]\ndef search(self, query):\n \"\"\"\n Boolean search; this will return documents that contain all words from the\n query, but not rank them (sets are fast, but unordered).\n \"\"\"\n analyzed_query = analyze(query)\n results = self._results(analyzed_query)\n documents = [self.documents[doc_id] for doc_id in set.intersection(*results)]\n return documents\nIn [1]: index.search('London Beer Flood')\nsearch took 0.16307830810546875 milliseconds\nOut[1]:\n[Abstract(ID=1501027, title='Wikipedia: Horse Shoe Brewery', abstract='The Horse Shoe Brewery was an English brewery in the City of Westminster that was established in 1764 and became a major producer of porter, from 1809 as Henry Meux & Co. It was the site of the London Beer Flood in 1814, which killed eight people after a porter vat burst.', url='https:\/\/en.wikipedia.org\/wiki\/Horse_Shoe_Brewery'),\n Abstract(ID=1828015, title='Wikipedia: London Beer Flood', abstract=\"The London Beer Flood was an accident at Meux & Co's Horse Shoe Brewery, London, on 17 October 1814. It took place when one of the wooden vats of fermenting porter burst.\", url='https:\/\/en.wikipedia.org\/wiki\/London_Beer_Flood')]\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣會讓搜索非常精確,特別是在使用較長的字符串進行搜索時(搜索包含的單詞越多,文檔中包含所有這些單詞的可能性就越小)。我們可以優化搜索函數,允許用戶指定當有一個單詞匹配時就算匹配整個搜索,以提高召回率(recall)而不是精確度:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"def search(self, query, search_type='AND'):\n \"\"\"\n Still boolean search; this will return documents that contain either all words\n from the query or just one of them, depending on the search_type specified.\n We are still not ranking the results (sets are fast, but unordered).\n \"\"\"\n if search_type not in ('AND', 'OR'):\n return []\n analyzed_query = analyze(query)\n results = self._results(analyzed_query)\n if search_type == 'AND':\n # all tokens must be in the document\n documents = [self.documents[doc_id] for doc_id in set.intersection(*results)]\n if search_type == 'OR':\n # only one token has to be in the document\n documents = [self.documents[doc_id] for doc_id in set.union(*results)]\n return documents\nIn [2]: index.search('London Beer Flood', search_type='OR')\nsearch took 0.02816295623779297 seconds\nOut[2]:\n[Abstract(ID=5505026, title='Wikipedia: Addie Pryor', abstract='| birth_place = London, England', url='https:\/\/en.wikipedia.org\/wiki\/Addie_Pryor'),\n Abstract(ID=1572868, title='Wikipedia: Tim Steward', abstract='|birth_place = London, United Kingdom', url='https:\/\/en.wikipedia.org\/wiki\/Tim_Steward'),\n Abstract(ID=5111814, title='Wikipedia: 1877 Birthday Honours', abstract='The 1877 Birthday Honours were appointments by Queen Victoria to various orders and honours to reward and highlight good works by citizens of the British Empire. The appointments were made to celebrate the official birthday of the Queen, and were published in The London Gazette on 30 May and 2 June 1877.', url='https:\/\/en.wikipedia.org\/wiki\/1877_Birthday_Honours'),\n ...\nIn [3]: len(index.search('London Beer Flood', search_type='OR'))\nsearch took 0.029065370559692383 seconds\nOut[3]: 49627\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"相關度"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們已經用Python實現了一個非常快的搜索引擎,但還少了個東西,那就是相關度。現在我們只返回一個無序的文檔列表,並由用戶來確定他們真正感興趣的是哪些文檔。如果返回的是一個大型的結果集,那將是一件很痛苦的事情,或者根本不可能確定哪些是用戶真正感興趣的(在我們的OR示例中,返回將近50000個結果)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相關度的概念是這樣來的:我們可以給每個文檔分配一個分數,表示它與查詢的匹配度,並根據這個分數進行排序。給文檔分配分數的一種簡單的方法是計算文檔出現檢索詞的頻率。畢竟,文檔出現某個檢索詞的次數越多,它就越有可能與我們的搜索相關!"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"檢索詞頻率"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將Abstract數據類擴展一下,在建立索引時計算並存儲它的檢索詞頻率。這樣,當我們想對無序列表中的文檔進行排序時,就可以很容易地使用這些數字:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"# in documents.py\nfrom collections import Counter\nfrom .analysis import analyze\n@dataclass\nclass Abstract:\n # snip\n def analyze(self):\n # Counter will create a dictionary counting the unique values in an array:\n # {'london': 12, 'beer': 3, ...}\n self.term_frequencies = Counter(analyze(self.fulltext))\n def term_frequency(self, term):\n return self.term_frequencies.get(term, 0)\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們要確保在建立索引時生成這些頻率計數:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"# in index.py we add `document.analyze()\ndef index_document(self, document):\n if document.ID not in self.documents:\n self.documents[document.ID] = document\n document.analyze()\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來我們要修改搜索函數,以便對結果集中的文檔進行排名。我們先從索引和文檔存儲中獲取文檔,對於結果集中的每個文檔,我們簡單地將每個檢索詞在該文檔中出現的頻率相加起來。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"def search(self, query, search_type='AND', rank=True):\n # snip\n if rank:\n return self.rank(analyzed_query, documents)\n return documents\ndef rank(self, analyzed_query, documents):\n results = []\n if not documents:\n return results\n for document in documents:\n score = sum([document.term_frequency(token) for token in analyzed_query])\n results.append((document, score))\n return sorted(results, key=lambda doc: doc[1], reverse=True)\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"逆文本頻率"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣已經好多了,但仍然有一些明顯的不足。在評估搜索相關度時,我們認爲所有的搜索條件都是等價的。但實際上,某些檢索詞可能只有很小的識別度,甚至沒有。例如,如果一個文檔集合大都包含了“啤酒”這個單詞,“啤酒”這個單詞會經常出現在幾乎每個文檔中(我們已經試圖通過從索引中移除25個最常見的英語單詞來解決這個問題)。對於這種情況,搜索“啤酒”實質上是進行了另一種隨機的排序。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了解決這個問題,我們將在評分算法中添加另一個組件,以減少索引中出現頻率較高的檢索詞對最終分數的影響。我們可以使用檢索詞集合頻率(即這個檢索詞在所有文檔中出現的頻率),但實際上使用的是逆文本頻率(即索引中有多少個文檔包含這個檢索詞)。因爲我們要對文檔進行排序,所以需要文檔級別的統計信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們用索引中的文檔數量(N)除以包含檢索詞的文檔數量,並對其取對數,得出檢索詞的逆文本頻率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/70\/47\/70c798ce308250a2207947314be1b847.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然後,在進行排名時,我們將檢索詞頻率與逆文本頻率相乘,這樣語料庫中出現較少的檢索詞將對相關度得分有更大的影響。我們很容易就能根據索引中可用的數據計算出逆文本頻率:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"# index.py\nimport math\ndef document_frequency(self, token):\n return len(self.index.get(token, set()))\ndef inverse_document_frequency(self, token):\n # Manning, Hinrich and Schütze use log10, so we do too, even though it\n # doesn't really matter which log we use anyway\n # https:\/\/nlp.stanford.edu\/IR-book\/html\/htmledition\/inverse-document-frequency-1.html\n return math.log10(len(self.documents) \/ self.document_frequency(token))\ndef rank(self, analyzed_query, documents):\n results = []\n if not documents:\n return results\n for document in documents:\n score = 0.0\n for token in analyzed_query:\n tf = document.term_frequency(token)\n idf = self.inverse_document_frequency(token)\n score += tf * idf\n results.append((document, score))\n return sorted(results, key=lambda doc: doc[1], reverse=True)\n"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"未來的工作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這是一個很基礎的搜索引擎,只需要幾行Python代碼!你可以在Github上找到所有的代碼("},{"type":"link","attrs":{"href":"https:\/\/github.com\/bartdegoede\/python-searchengine?fileGuid=4icAb1em6vATdAA7","title":"","type":null},"content":[{"type":"text","text":"https:\/\/github.com\/bartdegoede\/python-searchengine"}]},{"type":"text","text":"),我還提供了一個實用函數,可以下載維基百科摘要並構建索引。安裝好必要的組件,在Python控制檯中運行它,並享受數據結構和搜索給你帶來的樂趣吧。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然,這個項目是爲了解釋搜索的概念,以及搜索爲何會如此之快(我可以用Python這樣的“慢”語言在我的筆記本電腦上搜索627萬個文檔,並進行排名)。一些開源項目(如Lucene)利用了非常高效的數據結構,甚至優化了磁盤搜索,還有一些項目(如Elasticsearch和Solr)將Lucene擴展到可以運行在數百臺甚至數千臺機器上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們也可以考慮對這個只具有基本功能的搜索引擎做一些擴展。例如,我們假設文檔中的每個字段對相關度都有相同的貢獻,而標題中的檢索詞匹配應該比摘要中的檢索詞匹配具有更大的權重。另外,我們也可以對解析器進行擴展,既可以匹配所有檢索詞,有可以匹配單個檢索詞。我們也可以忽略某些檢索詞,或者支持檢索詞之間的AND或OR關係。我們也可以將索引持久化到磁盤上,打破單檯筆記本電腦內存的限制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/bart.degoe.de\/building-a-full-text-search-engine-150-lines-of-code\/?fileGuid=4icAb1em6vATdAA7","title":"","type":null},"content":[{"type":"text","text":"https:\/\/bart.degoe.de\/building-a-full-text-search-engine-150-lines-of-code\/"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}</abstract></url>
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章