TorchMetrics中实现分布式训练时保持Autograd图的技术解析

2025-07-03 12:23:40作者：苗圣禹Peter

背景介绍

在PyTorch Lightning框架中使用自定义的TorchMetrics Metric作为损失函数时，开发者经常会遇到分布式数据并行(DDP)训练场景下的梯度传播问题。特别是在使用dist_sync_on_step=True参数时，默认情况下梯度信息会在all_gather操作中丢失，这会影响模型的训练效果。

问题本质

当在DDP模式下使用TorchMetrics时，Metric的forward()方法会触发一系列同步操作：

调用_forward_reduce_state_update()
进而调用被_wrap_compute()包装的compute()函数
执行sync()操作
最终调用_sync_dist()

这个同步过程使用了torchmetrics.utilities.distributed.gather_all_tensors函数，而其中的_simple_gather_all_tensors实现会导致原始张量的autograd图信息丢失。

技术解决方案

核心问题在于_simple_gather_all_tensors函数的实现方式。原始实现如下：

def _simple_gather_all_tensors(result: Tensor, group: Any, world_size: int) -> List[Tensor]:
    gathered_result = [torch.zeros_like(result) for _ in range(world_size)]
    torch.distributed.all_gather(gathered_result, result, group)
    return gathered_result

这种实现方式会导致输入的result张量的autograd图信息丢失。解决方案是在all_gather操作后，显式地将当前进程的原始结果重新赋值给对应的位置：

def _simple_gather_all_tensors(result: Tensor, group: Any, world_size: int) -> List[Tensor]:
    gathered_result = [torch.zeros_like(result) for _ in range(world_size)]
    torch.distributed.all_gather(gathered_result, result, group)
    gathered_result[torch.distributed.get_rank(group)] = result
    return gathered_result

技术原理

这个修改的关键点在于：

all_gather操作会将各进程的result收集到gathered_result中
但收集后的张量会丢失原始的计算图信息
通过显式地将当前进程的原始result（仍保有autograd图）重新赋值给gathered_result对应位置
这样在后续计算中，梯度可以正确传播

应用场景

这种技术特别适用于以下场景：

批处理数据大小不均匀的情况
需要保持原始损失函数数学定义准确性的场景
使用自定义Metric作为损失函数的分布式训练

性能考量

虽然这种修改能保持autograd图，但也需要考虑：

内存使用会增加，因为需要保留原始计算图
同步操作的开销仍然存在
在梯度计算时可能会有额外的内存峰值

总结

在TorchMetrics中实现分布式训练时保持autograd图是一个常见需求，特别是在使用自定义Metric作为损失函数时。通过修改_simple_gather_all_tensors函数的实现，可以有效地解决梯度传播问题，同时保持分布式训练的正确性。这种技术为处理不均匀批处理大小等复杂场景提供了可靠的解决方案。

torchmetrics

Machine learning metrics for distributed, scalable PyTorch applications.

项目地址：https://gitcode.com/gh_mirrors/to/torchmetrics

登录后查看全文