无需服务器！5分钟实现浏览器端实时语音转文字：基于transformers.js的Whisper模型实战

2026-02-04 05:01:22作者：卓炯娓

你还在为语音识别需要昂贵的云服务而烦恼吗？还在担心用户语音数据传输的隐私安全问题吗？本文将带你零成本实现浏览器本地运行的实时语音转文字功能，无需后端服务器，所有处理都在用户设备上完成。读完本文，你将掌握如何使用transformers.js和Whisper模型构建一个安全、高效的语音识别应用。

技术原理概述

transformers.js是一个专为Web平台优化的机器学习库，它允许开发者在浏览器中直接运行预训练模型。而Whisper是由OpenAI开发的通用语音识别模型，能够将多种语言的语音准确地转换为文本。

通过结合这两项技术，我们可以实现完全在客户端运行的语音识别功能。下面是这个方案的核心优势：

传统云服务方案	transformers.js方案
需要后端服务器支持	纯浏览器本地运行
存在数据隐私泄露风险	语音数据永不离开用户设备
受网络延迟影响	无网络也能正常工作
按调用次数收费	完全免费

实现步骤

1. 准备工作

首先，我们需要从项目仓库获取示例代码：

git clone https://gitcode.com/GitHub_Trending/tr/transformers.js
cd transformers.js/examples/webgpu-whisper
npm install

这个示例项目已经包含了所有必要的配置和依赖，相关代码可以在examples/webgpu-whisper/目录中找到。

2. 核心实现原理

下面是实时语音识别的工作流程图：

sequenceDiagram
    participant 用户
    participant 浏览器
    participant Web Audio API
    participant Whisper模型
    participant 结果显示
    
    用户->>浏览器: 授权麦克风访问
    浏览器->>Web Audio API: 获取音频流
    Web Audio API->>Web Audio API: 音频处理与采样
    Web Audio API->>Whisper模型: 音频数据
    Whisper模型->>Whisper模型: 语音识别处理
    Whisper模型->>结果显示: 识别文本
    结果显示->>用户: 实时展示文字

3. 关键代码解析

3.1 初始化Worker

为了避免UI阻塞，我们使用Web Worker在后台线程中处理模型加载和语音识别任务：

// [examples/webgpu-whisper/src/App.jsx](https://gitcode.com/GitHub_Trending/tr/transformers.js/blob/1538e3a1544a93ef323e41c4e3baef6332f4e557/examples/webgpu-whisper/src/App.jsx?utm_source=gitcode_repo_files#L37-L45)
useEffect(() => {
  if (!worker.current) {
    // 创建Worker
    worker.current = new Worker(new URL('./worker.js', import.meta.url), {
      type: 'module'
    });
  }
  // ...
}, []);

3.2 加载Whisper模型

当用户点击"Load model"按钮时，我们会向Worker发送加载模型的请求：

// [examples/webgpu-whisper/src/App.jsx](https://gitcode.com/GitHub_Trending/tr/transformers.js/blob/1538e3a1544a93ef323e41c4e3baef6332f4e557/examples/webgpu-whisper/src/App.jsx?utm_source=gitcode_repo_files#L211-L214)
<button
  onClick={() => {
    worker.current.postMessage({ type: 'load' });
    setStatus('loading');
  }}
>
  Load model
</button>

3.3 获取麦克风权限并录音

获取用户麦克风权限，并使用MediaRecorder API录制音频：

// [examples/webgpu-whisper/src/App.jsx](https://gitcode.com/GitHub_Trending/tr/transformers.js/blob/1538e3a1544a93ef323e41c4e3baef6332f4e557/examples/webgpu-whisper/src/App.jsx?utm_source=gitcode_repo_files#L120-L127)
navigator.mediaDevices.getUserMedia({ audio: true })
  .then(stream => {
    setStream(stream);
    recorderRef.current = new MediaRecorder(stream);
    audioContextRef.current = new AudioContext({ sampleRate: WHISPER_SAMPLING_RATE });
    // ...
  })

3.4 音频处理与识别

将录制的音频数据发送给Worker进行识别：

// [examples/webgpu-whisper/src/App.jsx](https://gitcode.com/GitHub_Trending/tr/transformers.js/blob/1538e3a1544a93ef323e41c4e3baef6332f4e557/examples/webgpu-whisper/src/App.jsx?utm_source=gitcode_repo_files#L171-L180)
fileReader.onloadend = async () => {
  const arrayBuffer = fileReader.result;
  const decoded = await audioContextRef.current.decodeAudioData(arrayBuffer);
  let audio = decoded.getChannelData(0);
  if (audio.length > MAX_SAMPLES) { 
    audio = audio.slice(-MAX_SAMPLES);
  }
  worker.current.postMessage({ type: 'generate', data: { audio, language } });
}