Chai-Lab项目多GPU并行任务运行问题解析

2025-07-10 18:57:25作者：董灵辛Dennis

问题背景

在Chai-Lab项目中使用多GPU进行并行任务时，用户报告了一个常见的技术问题：当尝试将模型从默认的cuda:0设备切换到其他GPU设备（如cuda:1）时，系统会抛出运行时错误，提示"Expected all tensors to be on the same device, but found at least two devices"。

问题本质分析

这个问题的核心在于PyTorch框架对设备一致性的严格要求。当模型的不同部分或输入数据位于不同的GPU设备上时，PyTorch会拒绝执行计算操作。在Chai-Lab项目中，这个问题特别体现在以下几个方面：

模型导出时可能隐含了设备信息
状态字典中的张量没有正确迁移到目标设备
图计算节点中的设备参数未被更新

解决方案详解

临时解决方案：使用CUDA_VISIBLE_DEVICES

对于大多数用户来说，最简单的解决方案是使用环境变量CUDA_VISIBLE_DEVICES来控制进程可见的GPU设备。这种方法不需要修改代码中的设备指定：

# 在第一个终端中
CUDA_VISIBLE_DEVICES=0 python ./examples/predict_structure.py

# 在第二个终端中
CUDA_VISIBLE_DEVICES=1 python ./examples/predict_structure.py

这种方法下，代码中仍然可以保持使用cuda:0的写法，但实际上会映射到不同的物理GPU上。

根本解决方案：修改模型加载逻辑

对于需要直接指定不同GPU设备的场景（如分布式训练），需要对模型加载逻辑进行修改。关键点在于：

遍历导出程序的计算图节点，更新所有设备参数
将状态字典中的所有张量迁移到目标设备
确保整个模型最终位于目标设备上

以下是经过验证的修改方案：

def load_exported(comp_key: str, device: torch.device) -> torch.nn.Module:
    local_path = chai1_component(comp_key)
    exported_program = torch.export.load(local_path)

    # 更新计算图中所有节点的设备参数
    for node in exported_program.graph.nodes:
        if "device" in node.kwargs:
            kwargs = node.kwargs.copy()
            kwargs["device"] = device
            node.kwargs = kwargs

    # 迁移状态字典中的所有张量
    for k, v in exported_program.state_dict.items():
        if isinstance(v, torch.nn.Parameter):
            exported_program._state_dict[k] = torch.nn.Parameter(v.to(device))
        else:
            exported_program._state_dict[k] = v.to(device)
    
    exported_program = exported_program.module()
    return exported_program.to(device)  # 关键：确保整个模型位于目标设备