BEIR项目v2.1.0版本发布：全面支持最新嵌入模型评估

2025-06-26 09:35:12作者：温玫谨Lighthearted

项目简介

BEIR是一个专注于信息检索系统评估的开源工具库，它为研究人员和开发者提供了标准化的评估框架和数据集。BEIR支持多种检索模型的评估，包括密集检索、稀疏检索以及混合检索方法。该项目通过提供统一的评估接口，极大地简化了不同检索模型在相同数据集上的性能对比工作。

版本核心更新

1. 支持最新嵌入模型评估

BEIR v2.1.0版本最显著的改进是全面支持了当前最先进的嵌入模型评估能力：

HuggingFace模型支持：新增了models.HuggingFace模块，可以轻松评估E5系列模型、使用Tevatron微调的PEFT模型（如RepLLAMA）以及HuggingFace上的任何自定义嵌入模型。该模块支持三种池化技术：均值池化(mean)、CLS池化和EOS池化。

SentenceTransformer增强：更新后的models.SentenceTransformer模块现在支持提示词(prompts)和提示名称(prompt_names)等最新特性，能够评估Stella、modernBERT-gte-base等基于LLM的解码器模型。特别值得一提的是，现在所有sentence-transformer模型都可以在多GPU环境下进行评估。

NVEmbed专用支持：新增models.NVEmbed模块专门用于评估NVIDIA的NV-Embed-v2模型，虽然目前需要特定版本的transformers库配合使用。

LLM2Vec集成：新增models.LLM2Vec模块支持评估McGill-NLP团队开发的LLM2Vec系列跨注意力嵌入模型。

2. 评估工具增强

新版本引入了两个实用的工具函数：

util.save_runfile()函数可将评估结果保存为TREC标准格式的运行文件，这对于后续的重新排序(re-ranking)分析非常有用。

util.save_results()函数则将评估指标（包括nDCG、MAP、Recall、Precision等）保存为JSON格式，便于后续分析和比较。

3. 技术栈升级

项目将Python最低版本要求从3.6升级到了3.9+，采用了更现代的代码格式化工具ruff，并重构了项目结构使用pyproject.toml进行管理。这些改进使项目维护更加规范，代码质量更高。

技术细节解析

模型评估示例

以评估E5-Mistral-7B模型为例，开发者可以这样配置：

query_prompt = "Given a query on respiratory diseases, retrieve documents that answer the query"
passage_prompt = ""
dense_model = models.HuggingFace(
    model="intfloat/e5-mistral-7b-instruct",
    max_length=512,
    append_eos_token=True,
    pooling="eos",
    normalize=True,
    prompts={"query": query_prompt, "passage": passage_prompt},
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16"
)

对于PEFT模型（如RepLLAMA）的评估，配置也非常直观：

dense_model = models.HuggingFace(
    model="meta-llama/Llama-2-7b-hf",
    peft_model_path="castorini/repllama-v1-7b-lora-passage",
    max_length=512,
    append_eos_token=True,
    pooling="eos",
    normalize=True,
    prompts={"query": "query: ", "passage": "passage: "},
    attn_implementation="flash_attention_2",
    torch_dtype="bfloat16",
)

评估结果保存

新版本简化了评估结果的保存过程：

ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
mrr = retriever.evaluate_custom(qrels, results, retriever.k_values, metric="mrr")

results_dir = os.path.join(pathlib.Path(__file__).parent.absolute(), "results")
os.makedirs(results_dir, exist_ok=True)

util.save_runfile(os.path.join(results_dir, f"{dataset}.run.trec"), results)
util.save_results(os.path.join(results_dir, f"{dataset}.json"), ndcg, _map, recall, precision, mrr)