StreamSpeech 开源项目使用教程

2024-09-14 22:26:29作者：盛欣凯Ernestine

1. 项目介绍

StreamSpeech 是一个“All in One”无缝模型，专为离线和同时语音识别、语音翻译和语音合成而设计。该项目通过多任务学习框架，能够同时处理语音识别、语音翻译和语音合成任务，适用于实时通信场景。StreamSpeech 不仅支持离线处理，还支持同时处理，能够在接收语音输入的同时输出目标语音，极大地提升了实时通信的效率和用户体验。

2. 项目快速启动

环境准备

确保你的环境满足以下要求：

Python == 3.10
PyTorch == 2.0.1

安装依赖

首先，克隆项目到本地：

git clone https://github.com/ictnlp/StreamSpeech.git
cd StreamSpeech

安装 fairseq 和 SimulEval：

cd fairseq
pip install --editable ./ --no-build-isolation
cd ../SimulEval
pip install --editable ./

模型下载

下载 StreamSpeech 模型和预训练的 HiFi-GAN 声码器：

# 下载 StreamSpeech 模型
# 例如：Fr-En 语言对的离线模型
wget https://huggingface.co/streamspeech/offline/fr-en/pt/model.pt

# 下载 HiFi-GAN 声码器
wget https://huggingface.co/streamspeech/vocoder/hifigan/fr-en/config.json
wget https://huggingface.co/streamspeech/vocoder/hifigan/fr-en/model.pt

数据准备

准备测试数据，格式如下：

# wav_list.txt
/path/to/source_speech1.wav
/path/to/source_speech2.wav

# target.txt
reference_text1
reference_text2

运行推理

使用 SimulEval 进行推理：

export CUDA_VISIBLE_DEVICES=0
ROOT=/path/to/StreamSpeech
PRETRAIN_ROOT=/path/to/pretrain_models
VOCODER_CKPT=$PRETRAIN_ROOT/unit-based_HiFi-GAN_vocoder/mHuBERT_layer11_km1000_en/g_00500000
VOCODER_CFG=$PRETRAIN_ROOT/unit-based_HiFi-GAN_vocoder/mHuBERT_layer11_km1000_en/config.json
LANG=fr
file=streamspeech_simultaneous_$LANG-en_pt
output_dir=$ROOT/res/streamspeech_simultaneous_$LANG-en/simul-s2st
chunk_size=320 #ms

PYTHONPATH=$ROOT/fairseq simuleval --data-bin $ROOT/configs/$LANG-en \
  --user-dir $ROOT/researches/ctc_unity --agent-dir $ROOT/agent \
  --source example/wav_list.txt --target example/target.txt \
  --model-path $file \
  --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \
  --agent $ROOT/agent/speech_to_speech_streamspeech_agent.py \
  --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG --dur-prediction \
  --output $output_dir/chunk_size=$chunk_size \
  --source-segment-size $chunk_size \
  --quality-metrics ASR_BLEU --target-speech-lang en --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks DiscontinuitySum DiscontinuityAve DiscontinuityNum RTF \
  --device gpu --computation-aware \
  --output-asr-translation True