xapian 搜索引擎介紹與使用入門

Xapian 是一個開源搜索引擎庫，使用 C++ 編寫，並提供綁定(bindings )以允許從多種編程語言使用。它是一個高度適應性的工具包，允許開發人員輕鬆地將高級索引和搜索功能添加到自己的應用程序中。Xapian 支持多種加權模型和豐富的布爾查詢運算符。最新穩定版本是 1.4.24，發佈於 2023 年 11 月 6 日。

Xapian是20年前就開源的搜索引擎，整體比較穩定，功能層面較lucene有差距，但是足夠成熟可用。唯一的缺憾是GPL V2協議。

安裝

編譯安裝core

下載最新的tar包，解壓並編譯安裝：

tar xf xapian-core-1.4.24.tar.xz 
cd xapian-core-1.4.24/
./configure --prefix=/opt
make
make install

安裝多語言綁定

需要先下載xapian-bindings-1.4.24，然後解壓並編譯：

tar xf xapian-bindings-1.4.24.tar.xz 
cd xapian-bindings-1.4.24/
./configure XAPIAN_CONFIG=/data/xapian-core-1.4.24/xapian-config --with-java --with-python3
make
make install

configure 時，需要指定XAPIAN_CONFIG的路徑，就是上面core裏的路徑
--with-java --with-python3 是隻編譯java 和 python3的綁定

使用

c++ 使用

可以在core目錄，新建一個demo目錄，新增src/main.cpp

#include <iostream>
#include <string>
#include "xapian.h"

const std::string index_data_path = "./index_data";
const std::string doc_id1 = "doc1";
const std::string doc_title1 = "如何 構建 搜索引擎 搜索 引擎";
const std::string doc_content1 = "how to build search engine";
const std::string doc_id2 = "doc2";
const std::string doc_title2 = "搜索 是 一個 基本 技能";
const std::string doc_content2 = "search is a basic skill";

const int DOC_ID_FIELD = 101;

void build_index()
{
	std::cout << "--- build_index" << std::endl;

	Xapian::WritableDatabase db(index_data_path, Xapian::DB_CREATE_OR_OPEN);

	Xapian::TermGenerator indexer;

	Xapian::Document doc1;
	doc1.add_value(DOC_ID_FIELD, doc_id1); // custom property
	doc1.set_data(doc_content1); // payload
	indexer.set_document(doc1);
	indexer.index_text(doc_title1); // could use space seperated text line like terms or article
	db.add_document(doc1);

	Xapian::Document doc2;
	doc2.add_value(DOC_ID_FIELD, doc_id2); // custom property
	doc2.set_data(doc_content2);
	indexer.set_document(doc2);
	indexer.index_text(doc_title2);
	db.add_document(doc2);

	db.commit();
}

void search_op_or()
{
	std::cout << "--- search_op_or" << std::endl;

	Xapian::Database db(index_data_path);

	Xapian::Enquire enquire(db);
	Xapian::QueryParser qp;

	// std::string query_str = "search engine";
	// Xapian::Query query = qp.parse_query(query_str);
	Xapian::Query term1("搜索");
	Xapian::Query term2("引擎");
	Xapian::Query query = Xapian::Query(Xapian::Query::OP_OR, term1, term2);

	std::cout << "query is: " << query.get_description() << std::endl;

	enquire.set_query(query);

	Xapian::MSet matches = enquire.get_mset(0, 10); // find top 10 results
	std::cout << matches.get_matches_estimated() << " results found" << std::endl;
	std::cout << "matches 1-" << matches.size() << std::endl;

	for (Xapian::MSetIterator it = matches.begin(); it != matches.end(); ++it)
	{
		Xapian::Document doc = it.get_document();
		std::string doc_id = doc.get_value(DOC_ID_FIELD);
		std::cout << "rank: " << it.get_rank() + 1 << ", weight: " << it.get_weight() << ", match_ratio: " << it.get_percent() << "%, match_no: " << *it << ", doc_id: " << doc_id << ", doc content: [" << doc.get_data() << "]\n" << std::endl;
	}
}

void search_op_and()
{
	std::cout << "--- search_op_and" << std::endl;

	Xapian::Database db(index_data_path);

	Xapian::Enquire enquire(db);
	Xapian::QueryParser qp;

	Xapian::Query term1("搜索");
	Xapian::Query term2("技能");
	Xapian::Query query = Xapian::Query(Xapian::Query::OP_AND, term1, term2);

	std::cout << "query is: " << query.get_description() << std::endl;

	enquire.set_query(query);

	Xapian::MSet matches = enquire.get_mset(0, 10); // find top 10 results, like split page
	std::cout << matches.get_matches_estimated() << " results found" << std::endl;
	std::cout << "matches 1-" << matches.size() << std::endl;

	for (Xapian::MSetIterator it = matches.begin(); it != matches.end(); ++it)
	{
		Xapian::Document doc = it.get_document();
		std::string doc_id = doc.get_value(DOC_ID_FIELD);
		std::cout << "rank: " << it.get_rank() + 1 << ", weight: " << it.get_weight() << ", match_ratio: " << it.get_percent() << "%, match_no: " << *it << ", doc_id: " << doc_id << ", doc content: [" << doc.get_data() << "]\n" << std::endl;
	}
}

int main(int argc, char** argv)
{
	std::cout << "hello xapian" << std::endl;

	build_index();
	search_op_or();
	search_op_and();

	return 0;
}

cmake 文件

cmake_minimum_required(VERSION 3.24)

project(xapian_demo)

set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

include_directories(
    ../include
)

link_directories(
    ../.libs
)

file(GLOB SRC
    src/*.h
    src/*.cpp
)

add_executable(${PROJECT_NAME} ${SRC})

target_link_libraries(${PROJECT_NAME}
    xapian uuid
)

編譯、測試：

#cmake .
-- Configuring done
-- Generating done
-- Build files have been written to: /data/xapian-core-1.4.24/demo

#make
Consolidate compiler generated dependencies of target xapian_demo
[ 50%] Building CXX object CMakeFiles/xapian_demo.dir/src/main.cpp.o
[100%] Linking CXX executable xapian_demo
[100%] Built target xapian_demo

#./xapian_demo 
hello xapian
--- build_index
--- search_op_or
query is: Query((搜索 OR 引擎))
2 results found
matches 1-2
rank: 1, weight: 0.500775, match_ratio: 100%, match_no: 1, doc_id: doc1, doc content: [how to build search engine]

rank: 2, weight: 0.0953102, match_ratio: 19%, match_no: 2, doc_id: doc2, doc content: [search is a basic skill]

--- search_op_and
query is: Query((搜索 AND 技能))
1 results found
matches 1-1
rank: 1, weight: 0.500775, match_ratio: 100%, match_no: 2, doc_id: doc2, doc content: [search is a basic skill]

python 使用

上面c++的測試僅有幾條數據，python我們來上點壓力。
搜索數據源是包含上百萬數據的xml，文件裏數據格式是給manticore使用的sphinxxml格式：

<sphinx:document id="3669513577616591688"><domain_rank><![CDATA[0]]></domain_rank><page_rank><![CDATA[0]]></page_rank><author_rank><![CDATA[0]]></author_rank><update_ts><![CDATA[1671120000000]]></update_ts><crawl_ts><![CDATA[1702765056760]]></crawl_ts><index_ts><![CDATA[1703141806692]]></index_ts><freq><![CDATA[0]]></freq><pv><![CDATA[0]]></pv><comment><![CDATA[0]]></comment><forward><![CDATA[0]]></forward><up><![CDATA[0]]></up><title_lac><![CDATA[南充市 首席 風水 大師   羅 李華   百科 詞典]]></title_lac><title_jieba><![CDATA[南充市 首席 風水 大師   羅李華   百科詞典]]></title_jieba><summary_lac><![CDATA[百科 詞典 ， 主要 收錄 知名 人物 、 企業 、 行業 相關 詞條 爲主 ， 是 由 各 大網民 申請 供稿 ， 由 專職 人員 嚴格 審覈 編輯 而成 ， 力求 做到 每一個 詞條 權威 、 真實 、 客觀 、 專業 ， 旨在 打造 一個 值得 大家 信賴 的 權威 百科 平臺 。]]></summary_lac><summary_jieba><![CDATA[百科詞典 ， 主要 收錄 知名 人物 、 企業 、 行業 相關 詞條 爲主 ， 是 由 各大 網民 申請 供稿 ， 由 專職人員 嚴格 審覈 編輯 而成 ， 力求 做到 每 一個 詞條 權威 、 真實 、 客觀 、 專業 ， 旨在 打造 一個 值得 大家 信賴 的 權威 百科 平臺 。]]></summary_jieba><url><![CDATA[https://www.baikecidian.cn/h-nd-9709.html]]></url><domain><![CDATA[www.baikecidian.cn]]></domain><keywords_lac><![CDATA[]]></keywords_lac><image_link><![CDATA[0]]></image_link><post_ts><![CDATA[1538215160000]]></post_ts></sphinx:document>

因此，我們先編寫一個讀取程序：

import xmltodict

def read_sphinx_xml(file_path):
    file = open(file_path, 'r', encoding='utf-8')

    xml_str = ''
    end_tag = '</sphinx:document>'
    for line in file:
        if end_tag in line:
            try:
                xml_str = xml_str + line
                xml_dict = xmltodict.parse(xml_str)
                yield xml_dict['sphinx:document']
            except Exception as e:
                print(xml_str)
                print(e)
            xml_str = ''
        else:
            xml_str = xml_str + line

然後，調用xapian的binding接口來構建索引：

def list_files(path):
    return [item for item in os.listdir(path) if ".txt" in item]

DOC_ID_FIELD = 101
DOC_TITLE_FIELD = 102

### Start of example code.
def index(datapath, dbpath):
    # Create or open the database we're going to be writing to.
    db = xapian.WritableDatabase(dbpath, xapian.DB_CREATE_OR_OPEN)
    termgenerator = xapian.TermGenerator()
    count = 0
    for file in list_files("/data"):
        print(f'start load data from {file}')
        for fields in read_sphinx_xml(f"/data/{file}"):
            title = fields.get('title_jieba', '')
            summary = fields.get('summary_jieba', '')
            identifier = fields.get('@id', '')
            
            if summary is None:
                summary = ''
            if title is None:
                continue
            
            count = count + 1

            doc = xapian.Document()
            termgenerator.set_document(doc)

            #  title 放大5倍
            termgenerator.index_text(title * 5  + ' ' + summary)
            # 存入數據
            doc.add_value(DOC_ID_FIELD, identifier)
            doc.add_value(DOC_TITLE_FIELD, title)
            doc.set_data(identifier + ' ' + title)
  
            # indexer.
            idterm = u"Q" + identifier
            doc.add_boolean_term(idterm)
            db.replace_document(idterm, doc)
            if count % 10000 == 0:
                print(f'loaded {count}')

注意：

xapian對字段支持的不夠好，需要用suffix實現，這裏測試就將title放大5倍混合summary進行建立索引
doc.add_value 可以存儲字段值，後續可以doc.get_value讀取
doc.set_data 可以用來存儲doc的完整信息，方便顯示，doc信息會存儲在獨立的doc文件中
這裏add_boolean_term和replace_document，可以實現相同id的數據覆蓋

下面來看查詢

#!/usr/bin/env python

import json
import sys
import xapian
import support
import time

def search(dbpath, querystring, offset=0, pagesize=10):
    # offset - defines starting point within result set
    # pagesize - defines number of records to retrieve

    # Open the database we're going to search.
    db = xapian.Database(dbpath)

    # Set up a QueryParser with a stemmer and suitable prefixes
    queryparser = xapian.QueryParser()

    query = queryparser.parse_query(querystring)
    print(query)
    # Use an Enquire object on the database to run the query
    enquire = xapian.Enquire(db)
    enquire.set_query(query)
    start_time = time.time()
    # And print out something about each match
    matches = []
    for match in enquire.get_mset(offset, pagesize):
        print(f'rank: {match.rank}  weight: {match.weight} docid: {match.document.get_value(101).decode("utf-8")} title: {match.document.get_value(102).decode("utf-8")}')
        # print(match.document.get_data().decode('utf8'))
        matches.append(match.docid)
    print(f'cost time {1000 * (time.time() - start_time)}ms')
    # Finally, make sure we log the query and displayed results
    support.log_matches(querystring, offset, pagesize, matches)

if len(sys.argv) < 3:
    print("Usage: %s DBPATH QUERYTERM..." % sys.argv[0])
    sys.exit(1)

search(dbpath = sys.argv[1], querystring = " ".join(sys.argv[2:]))

解釋：

xapian.QueryParser() 可以解析查詢query，可以使用+ -，默認是or`查詢
依然通過xapian.Enquire對象查詢，通過get_mset獲取結果
doc可以通過document.get_value讀取存儲的字段值，可以通過get_data讀取存儲的doc信息，要顯示需要先decode('utf8')

下面來測試查詢，在已構建的330萬+索引數據上，搜索 21 世紀十大奇蹟都有哪些

默認的or查詢，耗時46ms：

(base) [root@dev demo]#python3 py_search.py ./test_index_2/ '21 世紀 十大 奇蹟 都 有 哪些'
Query((21@1 OR 世紀@2 OR 十大@3 OR 奇蹟@4 OR 都@5 OR 有@6 OR 哪些@7))
rank: 0  weight: 36.96501079176272 docid: 270926605591973127 title: 21 世紀 的 十大 奇蹟 ( 王金寶 )
rank: 1  weight: 26.66735387825444 docid: 1202595084889677840 title: 淮安 十大 裝修 公司 排行榜 都 有 哪些
rank: 2  weight: 26.637435058757113 docid: 4515279401098254828 title: 十大 輕奢 首飾 品牌 耳環 ( 十大 輕奢 首飾 品牌 耳環 排名 )
rank: 3  weight: 25.896035383457647 docid: 2734857435606641662 title: 中國 十大 奇蹟 都 是 什麼
rank: 4  weight: 25.705459264178575 docid: 7786914994161493217 title: 每個 民族 都 有 傷痕 和 血淚 ( 二 ) , 再說 說 曾經 創造 奇蹟 的 蒙古 帝國 !
rank: 5  weight: 25.5095343276925 docid: 1500823194476917788 title: 真正 復古 的 奇蹟 手遊安卓 下載 2022   十大 真正 復古 的 奇蹟 手遊 推薦   ...
rank: 6  weight: 25.47914915723924 docid: 868651613852701914 title: 21 世紀 有 哪些 著名 的 科學家 有 哪些 ? 急 ?
rank: 7  weight: 25.41860730241055 docid: 7128947999947583631 title: 西安 臨潼區 必玩 十大 景區 , 西安 臨潼區 有 哪些 景點 推薦 、 旅遊 ...
rank: 8  weight: 25.16026635261191 docid: 6074515952166234396 title: 世界 建築史 上 堪稱 逆天 的 十大 工程 , 個個 都 是 奇蹟 !
rank: 9  weight: 24.89609264689645 docid: 5578567283356182005 title: 20 世紀 的 科技 發明 有 哪些   20 世紀 有 哪些 重大 科學 發現 和 科學   ...
cost time 46.19002342224121ms
'21 世紀 十大 奇蹟 都 有 哪些'[0:10] = 461487 2291460 457410 1416736 3245773 1156355 3030607 2498966 2025338 254698

如何優化查詢耗時呢，我們可以先預測，這裏 十大 奇蹟 是核心詞，我們可以要求必出，因此查詢串可以變爲： 21 世紀 +十大 +奇蹟都有哪些

(base) [root@dev demo]#python3 py_search.py ./test_index_2/ '21 世紀 +十大 +奇蹟 都 有 哪些'
Query(((十大@3 AND 奇蹟@4) AND_MAYBE (21@1 OR 世紀@2 OR (都@5 OR 有@6 OR 哪些@7))))
rank: 0  weight: 36.96293887882541 docid: 270926605591973127 title: 21 世紀 的 十大 奇蹟 ( 王金寶 )
rank: 1  weight: 25.89233097995836 docid: 2734857435606641662 title: 中國 十大 奇蹟 都 是 什麼
rank: 2  weight: 25.505700206213298 docid: 1500823194476917788 title: 真正 復古 的 奇蹟 手遊安卓 下載 2022   十大 真正 復古 的 奇蹟 手遊 推薦   ...
rank: 3  weight: 25.41629259671702 docid: 7128947999947583631 title: 西安 臨潼區 必玩 十大 景區 , 西安 臨潼區 有 哪些 景點 推薦 、 旅遊 ...
rank: 4  weight: 25.156904086936752 docid: 6074515952166234396 title: 世界 建築史 上 堪稱 逆天 的 十大 工程 , 個個 都 是 奇蹟 !
rank: 5  weight: 24.62510506307912 docid: 193253728534326320 title: 十大 兇夢有 哪些 ? 十大 兇夢 列表 !   觀音 靈籤 算命網
rank: 6  weight: 23.192754028779266 docid: 7179285817750982899 title: 十大 電腦 恐怖 遊戲 排行   好玩 的 恐怖 遊戲 有 哪些
rank: 7  weight: 23.14557703440898 docid: 8499116988738957144 title: 十大 爆火 的 奇蹟 類手遊 排行榜   最火 的 奇蹟 類手遊 排名 前十   特 ...
rank: 8  weight: 22.274870321417836 docid: 1134007698166133600 title: 世界 十大 著名 建築物   感受 人類 的 輝煌 奇蹟   建築   第一 排行榜
rank: 9  weight: 22.214192030795594 docid: 7678030174605825797 title: 世界 十大 奇蹟 動物 : 愛爾蘭 大鹿 死而復生   世界 十大 建築 奇蹟
cost time 2.651214599609375ms
'21 世紀 +十大 +奇蹟 都 有 哪些'[0:10] = 461487 1416736 1156355 2498966 2025338 173861 448901 723659 1029533 1830781

耗時3ms不到，且結果更優質。

總結

xapian的介紹到這裏告一段落，後續文章會深入xapian的內部細節。

xapian 搜索引擎介紹與使用入門

安裝

編譯安裝core

安裝多語言綁定

使用

c++ 使用

python 使用

總結

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

淺談sparse vec檢索工程化實現

BGE M3-Embedding 模型介紹

Sparse稀疏檢索介紹與實踐

知識圖譜增強的KG-RAG框架

知識圖譜在RAG中的應用探討

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結