PiperTTS训练日志时间戳优化实践

2025-05-26 21:08:51作者：董灵辛Dennis

背景介绍

在PiperTTS模型训练过程中，日志记录对于监控训练进度和排查问题至关重要。默认情况下，PiperTTS使用的PyTorch Lightning框架输出的日志信息缺乏时间戳，这给训练过程的时间管理和性能分析带来了不便。本文将详细介绍如何通过修改相关代码，为训练日志添加精确的时间戳信息。

问题分析

PiperTTS训练过程中主要涉及三个关键日志输出点：

训练恢复时的检查点加载信息
训练过程中的检查点保存信息
训练结束时的完成通知

这些日志信息默认输出格式较为简单，缺乏时间维度信息，难以准确判断各个阶段的耗时情况。

解决方案

1. 检查点连接器修改

在PyTorch Lightning的checkpoint_connector.py文件中，我们通过以下修改增强了训练恢复时的日志输出：

from datetime import datetime  # 新增导入

# 替换原有日志输出
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
rank_zero_info(f"\n\n[{timestamp}]\nRestored all states from the checkpoint file at \n{self.resume_checkpoint_path}\n------------------------------------------\nStarting Training...\n------------------------------------------")

这一修改实现了：

添加精确到秒的时间戳
使用分隔线增强可读性
明确标注训练开始节点

2. 训练循环修改

在fit_loop.py文件中，我们对训练结束通知进行了增强：

from datetime import datetime  # 新增导入

# 替换原有日志输出
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
rank_zero_info(f"\n---------------------------------------------------------------------\n[{timestamp}] - `Trainer.fit` stopped: `max_epochs={self.max_epochs!r}` reached.\n---------------------------------------------------------------------")

这一修改实现了：

训练结束时间的精确记录
使用明显的分隔线突出显示训练结束信息
保持与开始日志一致的格式风格

3. 本地文件操作修改

在local.py文件中，我们为每次检查点保存操作添加了时间戳：

from datetime import datetime  # 新增导入

# 修改日志输出
timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
logger.debug(f" [{timestamp}] open: %s", path")

这一修改实现了：

每个检查点保存时间的精确记录
保持原有调试信息的同时增加时间维度

效果对比

修改前典型日志输出：

DEBUG:fsspec.local:open file: /path/to/checkpoint.ckpt
Restored all states from the checkpoint file at /path/to/checkpoint.ckpt
`Trainer.fit` stopped: `max_epochs=800` reached.

修改后典型日志输出：

DEBUG:fsspec.local: [2024-11-30 15:08:50] open: /path/to/checkpoint.ckpt

[2024-11-30 15:08:50]
Restored all states from the checkpoint file at
/path/to/checkpoint.ckpt
------------------------
Starting Training...
------------------------

------------------------------------------------------------------------------
[2024-11-30 15:14:54] - `Trainer.fit` stopped: `max_epochs=950` reached.
------------------------------------------------------------------------------