Swift项目中使用NVIDIA RTX 4090进行GRPO训练的配置指南

2025-05-31 09:43:32作者：滑思眉Philip

魔搭大模型训练推理工具箱，支持LLaMA、千问、ChatGLM、BaiChuan等多种模型及LoRA等多种训练方式(The LLM training/inference framework of ModelScope community, Support various models like LLaMA, Qwen, Baichuan, ChatGLM and others, and training methods like LoRA, ResTuning, NEFTune, etc.)

项目地址：https://gitcode.com/GitHub_Trending/swift1/swift

在深度学习训练过程中，硬件设备的兼容性配置是一个常见的技术挑战。本文将详细介绍如何在Swift项目中正确配置NVIDIA RTX 4090显卡进行GRPO（Gradient-based Reinforcement Policy Optimization）训练。

RTX 4090显卡的通信限制

RTX 4000系列显卡在NCCL（NVIDIA Collective Communications Library）通信方面存在一些特殊限制。具体表现为：

不支持通过P2P（Peer-to-Peer）方式进行快速通信
不支持通过IB（InfiniBand）进行宽带通信

当直接在这些显卡上运行分布式训练时，系统会抛出NotImplementedError异常，提示用户需要禁用这些通信方式。

解决方案

针对RTX 4090显卡的这一特性，我们需要在启动训练脚本时设置以下两个环境变量：

NCCL_P2P_DISABLE="1"
NCCL_IB_DISABLE="1"

这两个环境变量的作用分别是：

NCCL_P2P_DISABLE="1"：禁用P2P通信方式
NCCL_IB_DISABLE="1"：禁用InfiniBand通信

完整的训练启动命令

结合Swift项目的GRPO训练需求，完整的启动命令示例如下：

CUDA_VISIBLE_DEVICES=0,1,2,3,4 \
NCCL_P2P_DISABLE="1" \
NCCL_IB_DISABLE="1" \
NPROC_PER_NODE=4 \
swift rlhf \
    --rlhf_type grpo \
    --model joshuaHe/tcm_qwen2.5-1.5b-sft \
    --model_type qwen2_5 \
    --dataset '/path/to/data' \
    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs TCMSDAccuracy format \
    --use_vllm true \
    --vllm_device auto \
    --vllm_gpu_memory_utilization 0.9 \
    --vllm_max_model_len 4096 \
    --train_type lora \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --torch_dtype bfloat16 \
    --max_completion_length 1024 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 3 \
    --per_device_eval_batch_size 3 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 1 \
    --logging_steps 10 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --dataset_num_proc 4 \
    --num_generations 6 \
    --temperature 0.9 \
    --system 'examples/train/grpo/prompt.txt' \
    --deepspeed zero2 \
    --log_completions true

关键参数说明

GPU配置：
- CUDA_VISIBLE_DEVICES：指定使用的GPU设备编号
- NPROC_PER_NODE：设置每个节点的进程数，通常比实际GPU数量少1
模型配置：
- --model_type qwen2_5：指定模型架构类型
- --train_type lora：使用LoRA微调方法
- --lora_rank 8和--lora_alpha 32：LoRA相关参数
训练参数：
- --per_device_train_batch_size 3：每个设备的训练批次大小
- --gradient_accumulation_steps 4：梯度累积步数
- --deepspeed zero2：使用DeepSpeed的zero2优化策略
vLLM配置：
- --use_vllm true：启用vLLM推理框架
- --vllm_gpu_memory_utilization 0.9：设置GPU内存利用率