搜索引擎RAG召回效果評測MTEB介紹與使用入門

RAG 評測數據集建設尚處於初期階段，缺乏針對特定領域和場景的專業數據集。市面上常見的 MS-Marco 和 BEIR 數據集覆蓋範圍有限，且在實際使用場景中效果可能與評測表現不符。目前最權威的檢索榜單是 HuggingFace MTEB，今天我們來學習使用MTEB，並來評測自研模型recall效果。

MTEB 是一個包含廣泛文本嵌入（Text Embedding）的基準測試，它提供了多種語言的數十個數據集，用於各種 NLP 任務，例如文本分類、聚類、檢索和文本相似性。MTEB 提供了一個公共排行榜，允許研究人員提交他們的結果並跟蹤他們的進展。MTEB 還提供了一個簡單的 API，允許研究人員輕鬆地將他們的模型與基準測試進行比較。

安裝使用

pip install mteb

使用入門

最簡單的用法就是，直接編寫python代碼來測試 (see scripts/run_mteb_english.py and mteb/mtebscripts for more):

from mteb import MTEB
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "average_word_embeddings_komninos"

model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
results = evaluation.run(model, output_folder=f"results/{model_name}")

也可以使用官方提供的 CLI

mteb --available_tasks

mteb -m average_word_embeddings_komninos \
    -t Banking77Classification  \
    --output_folder results/average_word_embeddings_komninos \
    --verbosity 3

高級用法

測試數據集選擇

MTEB支持指定數據集，可以通過下面的形式

按task_type任務類型（例如“聚類”或“分類”)

evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks

按類別劃分, 例如“句子到句子 "S2S" (sentence to sentence) "P2P" (paragraph to paragraph)

evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence datasets

按照文本語言

evaluation = MTEB(task_langs=["en", "de"]) # Only select datasets which are "en", "de" or "en-de"

還可以針對數據集選擇語言：

from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining

evaluation = MTEB(tasks=[
        AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
        BUCCBitextMining(langs=["de-en"]), # Only load "de-en" subset of BUCC
])

可爲某些任務集合提供預設

from mteb import MTEB_MAIN_EN
evaluation = MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"])

自定義評測 split

有的數據集有多個split，評測會比較耗時，可以指定splits，來減少評測時間，比如下面的就指定了只用test split。

evaluation.run(model, eval_splits=["test"])

自定義評測模型

如果想自定義評測模型，可以自定義一個類，只要實現一個encode函數，輸入是一個句子列表，返回的是一個嵌入向量列表（嵌入可以是np.array、torch.tensor等）。可以參考 mteb/mtebscripts repo 倉庫。

class MyModel():
    def encode(self, sentences, batch_size=32, **kwargs):
        """
        Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)

如果針對query和corpus需要使用不同的encode方法，可以獨立提供encode_queries and encode_corpus兩個方法。

class MyModel():
    def encode_queries(self, queries, batch_size=32, **kwargs):
        """
        Returns a list of embeddings for the given sentences.
        Args:
            queries (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

    def encode_corpus(self, corpus, batch_size=32, **kwargs):
        """
        Returns a list of embeddings for the given sentences.
        Args:
            corpus (`List[str]` or `List[Dict[str, str]]`): List of sentences to encode
                or list of dictionaries with keys "title" and "text"
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

自定義評測Task（數據集）

要添加一個新任務，你需要實現一個從與任務類型相關的AbsTask繼承的新類（例如，對於重排任務是AbsTaskReranking）。你可以在這裏找到支持的任務類型。

比如下面的自定義重排任務：

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MindSmallReranking(AbsTaskReranking):
    @property
    def description(self):
        return {
            "name": "MindSmallReranking",
            "hf_hub_name": "mteb/mind_small",
            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
            "type": "Reranking",
            "category": "s2s",
            "eval_splits": ["validation"],
            "eval_langs": ["en"],
            "main_score": "map",
        }

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)

源碼分析

Retrieval召回評測

召回評測是通過RetrievalEvaluator類實現的。

def __init__(
        self,
        queries: Dict[str, str],  # qid => query
        corpus: Dict[str, str],  # cid => doc
        relevant_docs: Dict[str, Set[str]],  # qid => Set[cid]
        corpus_chunk_size: int = 50000,
        mrr_at_k: List[int] = [10],
        ndcg_at_k: List[int] = [10],
        accuracy_at_k: List[int] = [1, 3, 5, 10],
        precision_recall_at_k: List[int] = [1, 3, 5, 10],
        map_at_k: List[int] = [100],
        show_progress_bar: bool = False,
        batch_size: int = 32,
        name: str = "",
        score_functions: List[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = {
            "cos_sim": cos_sim,
            "dot": dot_score,
        },  # Score function, higher=more similar
        main_score_function: str = None,
        limit: int = None,
        **kwargs
    ):
        super().__init__(**kwargs)
        self.queries_ids = []
        for qid in queries:
            if qid in relevant_docs and len(relevant_docs[qid]) > 0:
                self.queries_ids.append(qid)
                if limit and len(self.queries_ids) >= limit:
                    break

        self.queries = [queries[qid] for qid in self.queries_ids]

        self.corpus_ids = list(corpus.keys())
        self.corpus = [corpus[cid] for cid in self.corpus_ids]

        self.relevant_docs = relevant_docs
        self.corpus_chunk_size = corpus_chunk_size
        self.mrr_at_k = mrr_at_k
        self.ndcg_at_k = ndcg_at_k
        self.accuracy_at_k = accuracy_at_k
        self.precision_recall_at_k = precision_recall_at_k
        self.map_at_k = map_at_k

        self.show_progress_bar = show_progress_bar
        self.batch_size = batch_size
        self.name = name
        self.score_functions = score_functions
        self.score_function_names = sorted(list(self.score_functions.keys()))
        self.main_score_function = main_score_function

構造函數幾個重要的參數：

- queries: Dict[str, str], # qid => query qid到query的dict
- corpus: Dict[str, str], # cid => doc docid到doc的dict
- relevant_docs: Dict[str, Set[str]], # qid => Set[cid] qid到相關docid的dict

因此，要自定義評測任務，需要提供這些數據。

具體的評測函數在compute_metrics裏：

def compute_metrics(self, model, corpus_model=None, corpus_embeddings: torch.Tensor = None) -> Dict[str, float]:
        if corpus_model is None:
            corpus_model = model

        max_k = max(
            max(self.mrr_at_k),
            max(self.ndcg_at_k),
            max(self.accuracy_at_k),
            max(self.precision_recall_at_k),
            max(self.map_at_k),
        )

        # Compute embedding for the queries
        logger.info("Encoding the queries...")
        # We don't know if encode has the kwargs show_progress_bar
        kwargs = {
            "show_progress_bar": self.show_progress_bar
        } if "show_progress_bar" in inspect.signature(model.encode).parameters else {}
        query_embeddings = np.asarray(model.encode(self.queries, batch_size=self.batch_size, **kwargs))
        queries_result_list = {}
        for name in self.score_functions:
            queries_result_list[name] = [[] for _ in range(len(query_embeddings))]

        # Iterate over chunks of the corpus
        logger.info("Encoding chunks of corpus, and computing similarity scores with queries...")
        for corpus_start_idx in trange(
            0,
            len(self.corpus),
            self.corpus_chunk_size,
            desc="Corpus Chunks",
            disable=not self.show_progress_bar,
        ):
            # Encode chunk of corpus
            if corpus_embeddings is None:
                corpus_end_idx = min(corpus_start_idx + self.corpus_chunk_size, len(self.corpus))
                sub_corpus_embeddings = np.asarray(corpus_model.encode(
                    self.corpus[corpus_start_idx:corpus_end_idx],
                    batch_size=self.batch_size,
                ))
            else:
                corpus_end_idx = min(corpus_start_idx + self.corpus_chunk_size, len(corpus_embeddings))
                sub_corpus_embeddings = corpus_embeddings[corpus_start_idx:corpus_end_idx]

            # Compute cosine similarites
            for name, score_function in self.score_functions.items():
                pair_scores = score_function(query_embeddings, sub_corpus_embeddings)

                # Get top-k values
                pair_scores_top_k_values, pair_scores_top_k_idx = torch.topk(
                    pair_scores,
                    min(max_k, len(pair_scores[0])),
                    dim=1,
                    largest=True,
                    sorted=False,
                )
                pair_scores_top_k_values = pair_scores_top_k_values.cpu().tolist()
                pair_scores_top_k_idx = pair_scores_top_k_idx.cpu().tolist()

                for query_itr in range(len(query_embeddings)):
                    for sub_corpus_id, score in zip(
                        pair_scores_top_k_idx[query_itr],
                        pair_scores_top_k_values[query_itr],
                    ):
                        corpus_id = self.corpus_ids[corpus_start_idx + sub_corpus_id]
                        queries_result_list[name][query_itr].append({"corpus_id": corpus_id, "score": score})

        # Compute scores
        logger.info("Computing metrics...")
        scores = {name: self._compute_metrics(queries_result_list[name]) for name in self.score_functions}

        return scores

model（embedding模型），corpus_model（如果doc用單獨的embedding模型，需要傳入這個參數，否則默認使用和query一樣的model）
首先會計算query_embedding query_embeddings = np.asarray(model.encode(self.queries, batch_size=self.batch_size, **kwargs))
然後計算corpus_embeddings
通過score_function，計算tok_k，結果放到queries_result_list
根據召回結果計算指標_compute_metrics, 會計算"mrr@k", "ndcg@k", "accuracy@k", "precision_recall@k", "map@k"等指標

Reranking 精排

精排是通過RerankingEvaluator來實現的。

class RerankingEvaluator(Evaluator):
    """
    This class evaluates a SentenceTransformer model for the task of re-ranking.
    Given a query and a list of documents, it computes the score [query, doc_i] for all possible
    documents and sorts them in decreasing order. Then, MRR@10 and MAP is compute to measure the quality of the ranking.
    :param samples: Must be a list and each element is of the form:
        - {'query': '', 'positive': [], 'negative': []}. Query is the search query, positive is a list of positive
        (relevant) documents, negative is a list of negative (irrelevant) documents.
        - {'query': [], 'positive': [], 'negative': []}. Where query is a list of strings, which embeddings we average
        to get the query embedding.
    """

    def __init__(
        self,
        samples,
        mrr_at_k: int = 10,
        name: str = "",
        similarity_fct=cos_sim,
        batch_size: int = 512,
        use_batched_encoding: bool = True,
        limit: int = None,
        **kwargs,
    ):

給定一個query和一組文檔，模型計算文檔得分，並按降序排列，最後計算MRR@10和MAP指標來衡量排名的質量。

__init__方法接收以下參數：

samples：必須是一個列表，每個元素的形式爲：
- {'query': '', 'positive': [], 'negative': []}。查詢是搜索查詢，正文檔是相關（正面）文檔的列表，負文檔是無關（負面）文檔的列表。
- {'query': [], 'positive': [], 'negative': []}。其中查詢是一個字符串列表，我們將這些字符串的平均嵌入作爲查詢嵌入。
mrr_at_k：默認值爲10，表示計算MRR時考慮的前k個結果。
name：默認值爲空字符串，表示評估器的名稱。
similarity_fct：默認值爲cos_sim，表示用於計算相似度的函數。

在compute_metrics_batched 計算得分，還是計算的cos得分，這裏相當於直接計算的embedding的排序能力，如果要計算cross模型的排序能力，默認的代碼不適用，需要重新定製。

評測實踐

說了這麼多，現在切入正題：

評測自研模型的召回能力 —— 自定義模型
自定義評測集，對比開源模型和自研模型的效果 —— 自定義評測任務

自研模型召回效果評測

我們先評估模型召回效果，訓練好的模型導出爲onnx，因此我們通過onnxrutime來進行推理，先自定義模型：

from mteb import MTEB
import onnxruntime as ort
from paddlenlp.transformers import AutoTokenizer
import math
from tqdm import tqdm
# 模型路徑
model_path = "onnx/fp16_model.onnx"
tokenizer_path = "model_520000"

class MyModel():
    def __init__(self, use_gpu=True):
        providers = ['CUDAExecutionProvider'] if use_gpu else ['CPUExecutionProvider']
        sess_options = ort.SessionOptions()
        self.predictor = ort.InferenceSession( 
            model_path, sess_options=sess_options, providers=providers)
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        
    def encode(self, sentences, batch_size=64, **kwargs):
        all_embeddings = []
        # 向上取整
        batch_count = math.ceil(len(sentences) / batch_size)
        
        for i in tqdm(range(batch_count)):
            # 按batch
            sub_sentences = sentences[i * batch_size : min(len(sentences), (i + 1) * batch_size)]
            features = self.tokenizer(sub_sentences, max_seq_len=128,
                                    pad_to_max_seq_len=True, truncation_strategy="longest_first")
            vecs = self.predictor.run(None, features.data)
            all_embeddings.extend(vecs[0])
        return all_embeddings

由於傳進來的sentences是所有的數據，我們需要按照batch_size，分批進行embedding計算，計算好的放入all_embeddings，最後返回即可。

自定義召回評測任務

上面分析源代碼時提到了，自定義時需要提供qurey，doc，以及query的相關doc

假設我們的自定義測試爲jsonline格式，每行包含query，以及相關的doc，json格式如下：

{

    "query": "《1984》是什麼",
    "data": [
        {

            "title": "《1984》介紹-知乎",
            "summary": "《1984》是僞裝成小說的政治思想...",
            "url": "",
            "id": 5031622209044687985,
            "answer": "完全相關",
            "accuracy": "無錯",
            "result": "good"
        }
    ]
}

那麼我們可以編寫自定義召回評測任務：

class SSRetrieval(AbsTaskRetrieval):
    @property
    def description(self):
        return {
            'name': 'SSRetrieval',
            'description': 'SSRetrieval是S研發部測試團隊準備的召回測試集',
            'type': 'Retrieval',
            'category': 's2p',
            'json_path': '/data/xapian-core-1.4.24/demo/result.json',
            'eval_splits': ['dev'],
            'eval_langs': ['zh'],
            'main_score': 'recall_at_10',
        }
    

    def load_data(self, **kwargs):
        if self.data_loaded:
            return

        self.corpus = {} # doc_id => doc
        self.queries = {}  # qid => query
        self.relevant_docs = {} # qid => Set[doc_id]
        query_index = 1
        with open(self.description['json_path'], 'r', encoding='utf-8') as f:
            for line in f:
                if "完全相關" not in line:
                    continue
                line =  json.loads(line)
                query =  line['query']
                query_id = str(query_index)
                self.queries[query_id] = query
                query_index = query_index + 1
                query_relevant_docs = []
                for doc in line['data']:
                    doc_id = str(doc['id'])
                    self.corpus[doc_id] = {"title": doc["title"], "text": doc["summary"]}
                    if doc['answer'] == "完全相關":
                        if query_id not in self.relevant_docs:
                            self.relevant_docs[query_id] = {}
                        self.relevant_docs[query_id][doc_id] = 1 
                
                # debug使用
                # if query_index == 100:
                #     break

        self.queries = DatasetDict({"dev": self.queries})
        self.corpus = DatasetDict({"dev": self.corpus})
        self.relevant_docs = DatasetDict({"dev": self.relevant_docs})
        
        self.data_loaded = True

用自定義模型，評測自定義任務

if __name__ == '__main__':

    model = MyModel()
    
    # task_names = [t.description["name"] for t in MTEB(task_types='Retrieval',
    #                                                   task_langs=['zh', 'zh-CN']).tasks]
    
    task_names = ["SSRetrieval"]

    for task in task_names:
        model.query_instruction_for_retrieval = None
        evaluation = MTEB(tasks=[task], task_langs=['zh', 'zh-CN'])
        evaluation.run(model, output_folder=f"zh_results/256_model", batch_size=64)

總結

mteb 最爲embedding召回效果測試，是一個權威的榜單，本身提供的工具框架也具備較好的擴展性，方便開發者自定義模型和自定義評測任務。

搜索引擎RAG召回效果評測MTEB介紹與使用入門

安裝使用

使用入門

高級用法

測試數據集選擇

自定義評測 split

自定義評測模型

自定義評測Task（數據集）

源碼分析

Retrieval召回評測

Reranking 精排

評測實踐

自研模型召回效果評測

自定義召回評測任務

用自定義模型，評測自定義任務

總結

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

淺談sparse vec檢索工程化實現

BGE M3-Embedding 模型介紹

Sparse稀疏檢索介紹與實踐

知識圖譜增強的KG-RAG框架

知識圖譜在RAG中的應用探討

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結