SimpleRL-reason 开源项目使用教程

2026-01-30 05:02:36作者：瞿蔚英Wynne

项目概述

SimpleRL-reason 是一个基于强化学习（Reinforcement Learning）的开源项目，专门用于提升大语言模型在数学推理任务上的表现。该项目采用了简单而有效的强化学习配方，仅使用规则化奖励（rule-based reward）和 PPO（Proximal Policy Optimization）算法，就能显著提升模型在复杂数学问题上的推理能力。

核心亮点

极简设计：无需监督微调（SFT）、无需奖励模型（Reward Model），仅使用 8K 数学示例
惊人效果：在 7B 模型上实现与使用 50 倍以上数据和复杂组件的基线模型相当的性能
高效训练：基于 OpenRLHF 框架，支持分布式训练和 vLLM 加速

环境准备

硬件要求

配置类型	GPU 数量	GPU 类型	内存要求	训练时间
最小配置	6	A100-80G	480GB	未测试
推荐配置	32	A100-80G	2.56TB	约 1.5 天
单节点	8	A100-80G	640GB	约 2-3 天

软件依赖

首先安装基础依赖：

# 克隆项目
git clone https://gitcode.com/gh_mirrors/si/simpleRL-reason.git
cd simpleRL-reason/train

# 安装 OpenRLHF
pip install -e .

# 安装 vLLM 加速（可选）
pip install openrlhf[vllm]

# 安装数学评估依赖
cd ../eval
pip install -r requirements.txt
pip install vllm==0.5.1 --no-build-isolation
pip install transformers==4.42.3

# 安装 LaTeX 到 SymPy 转换器
cd latex2sympy
pip install -e .
cd ..

训练流程详解

1. Ray 集群部署

SimpleRL-reason 使用 Ray 进行分布式训练，首先需要启动 Ray 集群：

# 在主节点启动 Ray
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

# 在其他节点加入集群
ray start --address {MASTER-NODE-ADDRESS}:6379 --num-gpus 8

2. 训练配置解析

项目提供了两种训练脚本：

单节点训练脚本 (train_ppo_qwen_base_math_lv35_1_node.sh)：

#!/bin/bash
HDFS_HOME=TO_BE_DEFINED
RUN_NAME=Qwen2.5-Math-7B_ppo_from_base_math_lv35

python3 openrlhf/cli/train_ppo_ray_box.py \
    --ref_num_nodes 1 \
    --ref_num_gpus_per_node 2 \
    --reward_num_nodes 0 \
    --reward_num_gpus_per_node 0 \
    --critic_num_nodes 1 \
    --critic_num_gpus_per_node 2 \
    --actor_num_nodes 1 \
    --actor_num_gpus_per_node 2 \
    --vllm_num_engines 2 \
    --vllm_tensor_parallel_size 1 \
    --colocate_actor_ref \
    --pretrain $HDFS_HOME/model_hub/models--Qwen--Qwen2.5-Math-7B/snapshots/b101308fe89651ea5ce025f25317fea6fc07e96e \
    --save_path $HDFS_HOME/checkpoints/$RUN_NAME \
    --micro_train_batch_size 2 \
    --train_batch_size 128 \
    --micro_rollout_batch_size 2 \
    --rollout_batch_size 1024 \
    --temperature 0.6 \
    --n_samples_per_prompt 8 \
    --max_samples 100000 \
    --max_epochs 1 \
    --num_episodes 20 \
    --prompt_max_len 1024 \
    --generate_max_len 3000 \
    --zero_stage 3 \
    --bf16 \
    --actor_learning_rate 5e-7 \
    --critic_learning_rate 9e-6 \
    --init_kl_coef 0.01 \
    --prompt_data data/math_level3to5_data_processed_with_qwen_prompt.json \
    --input_key input \
    --normalize_reward \
    --flash_attn \
    --adam_offload \
    --gradient_checkpointing \
    --save_steps 4 \
    --load_checkpoint \
    --use_wandb YOUR_WANDB_KEY \
    --wandb_run_name $RUN_NAME \
    --ckpt_path $HDFS_HOME/checkpoints/$RUN_NAME \
    --max_ckpt_num 20000

3. 关键参数说明

参数	说明	推荐值
`--pretrain`	基础模型路径	Qwen2.5-Math-7B
`--micro_train_batch_size`	微批次大小	2
`--train_batch_size`	训练批次大小	128
`--n_samples_per_prompt`	每个提示生成的样本数	8
`--actor_learning_rate`	Actor 学习率	5e-7
`--critic_learning_rate`	Critic 学习率	9e-6
`--temperature`	采样温度	0.6
`--init_kl_coef`	KL 散度系数	0.01

4. 数据格式

训练数据采用 JSON 格式，包含数学问题和标准答案：

{
  "input": "<|im_start|>system\nPlease reason step by step, and put your final answer within \\boxed{}.<|im_end|>\n<|im_start|>user\nLet $a$ and $b$ be the two real values of $x$ for which\\[\\sqrt[3]{x} + \\sqrt[3]{20 - x} = 2\\]The smaller of the two values can be expressed as $p - \\sqrt{q}$, where $p$ and $q$ are integers. Compute $p + q$.<|im_end|>\n<|im_start|>assistant",
  "answer": "118",
  "gt_answer": "118",
  "subject": "Intermediate Algebra",
  "level": 5,
  "question": "Let $a$ and $b$ be the two real values of $x$ for which\\[\\sqrt[3]{x} + \\sqrt[3]{20 - x} = 2\\]The smaller of the two values can be expressed as $p - \\sqrt{q}$, where $p$ and $q$ are integers. Compute $p + q$.",
  "ground_truth_answer": "118",
  "target": "118"
}

评估流程

1. 评估数据集

项目支持多种数学评估数据集：

graph TD
    A[数学评估数据集] --> B[AIME 2024]
    A --> C[MATH 500]
    A --> D[AMC]
    A --> E[Minerva Math]
    A --> F[OlympiadBench]
    A --> G[GSM8K]
    A --> H[其他数据集]

2. 评估命令

使用以下命令进行评估：

# 设置评估参数
PROMPT_TYPE="qwen25-math-cot"
export CUDA_VISIBLE_DEVICES="0"
MODEL_NAME_OR_PATH="Qwen/Qwen2.5-Math-7B-Instruct"
OUTPUT_DIR="Qwen2.5-Math-7B-Instruct-Math-Eval"

# 执行评估
bash sh/eval.sh $PROMPT_TYPE $MODEL_NAME_OR_PATH $OUTPUT_DIR

3. 评估指标

项目使用 pass@1 准确率作为主要评估指标：

数据集	Qwen2.5-Math-7B-Base	Qwen2.5-7B-SimpleRL-Zero	提升幅度
AIME 2024	16.7%	33.3%	+16.6%
MATH 500	52.4%	77.2%	+24.8%
AMC	52.5%	62.5%	+10.0%
Minerva Math	12.9%	33.5%	+20.6%
OlympiadBench	16.4%	37.6%	+21.2%

实战案例：从零开始训练

步骤 1：数据准备

准备 8K MATH 数据集，格式如下：

# 数据预处理示例
def preprocess_math_data(question, answer):
    return {
        "input": f"<|im_start|>system\nPlease reason step by step, and put your final answer within \\boxed{}.<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant",
        "answer": str(answer),
        "gt_answer": str(answer),
        "target": str(answer)
    }

步骤 2：模型配置

选择合适的基座模型和训练参数：

# 训练配置示例
model: Qwen2.5-Math-7B
learning_rate: 5e-7
batch_size: 128
num_episodes: 20
max_length: 3000
temperature: 0.6

步骤 3：训练执行

提交训练任务：

ray job submit --address="http://127.0.0.1:8265" \
    --runtime-env-json='{
        "pip": ["ray==2.12.0", "latex2sympy2", "timeout_decorator"]
    }' -- /bin/bash examples/script/train_ppo_qwen_base_math_lv35_1_node.sh

步骤 4：监控训练

使用 WandB 监控训练过程：

# 训练监控指标
- 奖励值变化
- KL 散度变化  
- 损失函数变化
- 生成样本质量

性能优化技巧

1. 内存优化

# 启用梯度检查点
--gradient_checkpointing

# 使用 BF16 精度
--bf16

# Adam 优化器卸载到 CPU
--adam_offload

# 使用 ZeRO Stage 3
--zero_stage 3

2. 训练加速

# 使用 vLLM 加速生成
--vllm_num_engines 2
--vllm_tensor_parallel_size 1

# 启用 Flash Attention
--flash_attn

# 样本打包
--packing_samples

3. 稳定性提升

# 奖励归一化
--normalize_reward

# 合适的 KL 系数
--init_kl_coef 0.01

# 温度调节
--temperature 0.6

常见问题解决

1. 内存不足问题

症状：训练过程中出现 OOM（Out of Memory）错误

解决方案：

减小 micro_train_batch_size
启用 --gradient_checkpointing
使用 --adam_offload
降低 --generate_max_len

2. 训练不稳定问题

症状：奖励值波动大，模型性能下降

解决方案：

调整 --init_kl_coef（建议 0.01-0.1）
降低学习率
增加 --num_episodes

3. 评估失败问题

症状：评估过程中出现 LaTeX 解析错误

解决方案：

确保安装了 latex2sympy2
检查评估数据格式
验证模型输出格式

进阶应用

1. 自定义奖励函数

您可以实现自定义的规则化奖励函数：

def custom_reward_function(response, ground_truth):
    # 基于答案正确性的基础奖励
    if response == ground_truth:
        base_reward = 1.0
    else:
        base_reward = -1.0
    
    # 基于推理步骤的额外奖励
    reasoning_steps = count_reasoning_steps(response)
    step_reward = min(reasoning_steps * 0.1, 0.5)
    
    return base_reward + step_reward

2. 多任务训练

支持同时训练多个数学领域：

# 多任务数据混合
datasets = {
    "algebra": "data/algebra.json",
    "geometry": "data/geometry.json", 
    "number_theory": "data/number_theory.json"
}

# 按比例混合训练
--prompt_data_probs 0.4,0.3,0.3