LitGPT项目中使用自定义数据集进行模型评估的完整指南

2025-05-19 20:50:59作者：裘旻烁

Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

项目地址：https://gitcode.com/gh_mirrors/li/lit-gpt

在机器学习项目中，模型评估是验证模型性能的关键环节。本文将详细介绍如何在LitGPT项目中使用自定义数据集进行模型评估，帮助开发者全面掌握评估流程。

评估流程概述

完整的模型评估流程包含三个主要步骤：

加载训练好的模型
在测试集上生成预测结果
对预测结果进行评分

详细实施步骤

1. 模型加载与预测生成

首先需要加载训练完成的模型，并在测试数据集上生成预测结果：

from litgpt import LLM
from tqdm import tqdm

# 加载训练好的模型
llm = LLM.load("path/to/your/model")

# 在测试集上生成预测
for i in tqdm(range(len(test_data))):
    response = llm.generate(format_input(test_data[i]))
    test_data[i]["response"] = response

这段代码会遍历测试数据集中的每个样本，使用模型生成预测结果，并将预测结果保存到原始数据结构中。

2. 结果保存

生成预测后，建议将结果保存为JSON文件，便于后续分析：

import json

with open("test_with_response.json", "w") as json_file:
    json.dump(test_data, json_file, indent=4)

3. 评估模型表现

评估阶段可以使用另一个LLM（如Llama 3）对预测结果进行评分：

def generate_model_scores(json_data, json_key):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"给定输入`{format_input(entry)}`"
            f"和正确答案`{entry['output']}`"
            f"请对模型响应`{entry[json_key]}`"
            f"进行0-100的评分，100为最高分。"
            f"只需返回整数分数。"
        )
        score = llm.generate(prompt, max_new_tokens=50)
        try:
            scores.append(int(score))
        except ValueError:
            continue
    return scores

scores = generate_model_scores(json_data, "response")
print(f"\n评估结果")
print(f"有效评分数量: {len(scores)} of {len(json_data)}")
print(f"平均分数: {sum(scores)/len(scores):.2f}\n")