Tesseract.js并行处理图像识别的正确实现方式

2025-05-03 02:13:44作者：戚魁泉Nursing

在使用Tesseract.js进行图像OCR识别时，许多开发者会遇到并行处理效率低下的问题。本文将深入分析问题原因并提供优化方案。

问题现象分析

当开发者尝试使用Tesseract.js的调度器(Scheduler)功能并行处理多个图像时，经常发现实际执行仍然是串行的。具体表现为：

虽然创建了多个Worker
但图像处理仍然是一个接一个顺序执行
系统资源利用率低
整体处理时间没有明显缩短

根本原因

问题的核心在于代码中使用了await关键字等待每个识别任务完成。示例代码中的关键问题部分：

for (let i = 0; i < imageArr.length; i++) {
  const out = await scheduler.addJob('recognize', imagePath);
  // 后续处理...
}

这段代码虽然使用了调度器，但由于await的存在，实际上变成了：

启动第一个识别任务
等待第一个任务完成
然后才启动第二个任务
以此类推

正确实现方案

要实现真正的并行处理，应该采用以下方法：

方案一：使用Promise.all并行执行

const recognitionPromises = imageArr.map(async (imagePath) => {
  const out = await scheduler.addJob('recognize', imagePath);
  return {
    imageName: path.basename(imagePath),
    words: out.data.words.map(word => ({
      text: word.text,
      confidence: word.confidence.toFixed(2),
      bbox: word.bbox,
    }))
  };
});

const results = await Promise.all(recognitionPromises);

方案二：控制并发数量

对于大量图像，可以控制并发数量以避免资源耗尽：

const concurrentLimit = 5; // 同时处理5个图像
const batches = Math.ceil(imageArr.length / concurrentLimit);

for (let i = 0; i < batches; i++) {
  const batch = imageArr.slice(i * concurrentLimit, (i + 1) * concurrentLimit);
  const batchPromises = batch.map(imagePath => 
    scheduler.addJob('recognize', imagePath)
      .then(out => ({
        imageName: path.basename(imagePath),
        words: out.data.words.map(word => ({
          text: word.text,
          confidence: word.confidence.toFixed(2),
          bbox: word.bbox,
        }))
      }))
  );
  const batchResults = await Promise.all(batchPromises);
  results.push(...batchResults);
}