Kreuzberg项目中的结构化文档提取功能实现解析

2025-07-08 08:18:29作者：董宙帆

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

项目地址：https://gitcode.com/gh_mirrors/kr/kreuzberg

在当今数据驱动的世界中，文档信息提取已成为许多业务流程中的关键环节。Kreuzberg项目最新引入的结构化文档提取功能，为开发者提供了一种高效、类型安全的方式来从各类文档中提取结构化数据。本文将深入解析这一功能的实现原理和技术细节。

功能概述

Kreuzberg的结构化提取功能允许用户定义明确的数据模型，然后直接从文档内容中提取符合该模型的结构化数据。这一功能特别适用于发票处理、法律文档分析、合同解析等场景，能够显著减少手动数据录入的工作量。

技术架构

核心依赖

实现这一功能主要依赖于三个关键库：

msgspec：一个高性能的数据序列化库，提供了快速的结构化数据验证能力
Pydantic v2：流行的数据验证库，提供灵活的数据模型定义
LiteLLM：统一的AI模型调用接口，支持多种视觉模型

这些依赖被组织为可选功能组，用户只有在需要结构化提取时才需要安装。

配置扩展

项目扩展了现有的ExtractionConfig配置类，新增了多个与结构化提取相关的参数：

@dataclass(unsafe_hash=True)
class ExtractionConfig:
    output_type: type[msgspec.Struct] | type[BaseModel] | None = None
    extraction_model: str | None = None
    extraction_model_config: dict[str, Any] = field(default_factory=dict)
    max_extraction_retries: int = 3
    include_error_in_retry: bool = True
    extraction_temperature: float = 0.1
    strict_validation: bool = True
    schema_in_prompt: bool = True

这些配置项允许用户精细控制提取过程，包括模型选择、重试策略和验证严格程度等。

数据模型定义

Kreuzberg支持两种主流的数据模型定义方式：

msgspec Struct方式

class LineItem(msgspec.Struct):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(msgspec.Struct, omit_defaults=True):
    invoice_number: str
    date: date
    total: float
    line_items: list[LineItem]
    notes: Optional[str] = None

msgspec提供了高性能的数据验证，特别适合对性能要求较高的场景。

Pydantic BaseModel方式

class InvoicePydantic(BaseModel):
    invoice_number: str
    date: date
    total: float
    line_items: list[LineItem]

Pydantic方式更适合需要复杂验证逻辑或已经使用Pydantic的项目。

提取流程实现

结构化提取的核心流程包括以下几个关键步骤：

提示构建：根据数据模型生成包含结构信息的提示词
模型调用：通过LiteLLM调用指定的视觉模型
结果验证：验证模型输出是否符合数据模型
错误重试：验证失败时自动重试，可包含错误反馈

验证机制

验证过程会根据使用的数据模型类型选择相应的验证器：

def validate_extraction(output: str, output_type: type) -> Any:
    if issubclass(output_type, msgspec.Struct):
        return msgspec.json.decode(output, type=output_type, strict=True)
    elif issubclass(output_type, BaseModel):
        return output_type.model_validate_json(output)

msgspec验证器会提供详细的错误路径信息，便于调试和错误处理。

重试机制

当验证失败时，系统会自动重试，并可选择将错误信息反馈给模型：

async def extract_with_retry(image: Image, output_type: type, ...):
    for attempt in range(max_retries):
        try:
            messages = build_extraction_messages(
                image=image,
                output_type=output_type,
                previous_attempt=last_output if include_error else None,
                previous_error=str(last_error) if include_error else None
            )
            response = await litellm.acompletion(...)
            return validate_extraction(...)
        except ExtractionValidationError as e:
            last_error = e
            if attempt == max_retries - 1:
                raise ExtractionError(...)

这种机制显著提高了提取成功率，特别是在处理复杂文档时。

错误处理设计

Kreuzberg采用了一套完整的错误处理体系：

ExtractionError：所有提取错误的基类
ExtractionValidationError：专门处理验证失败的情况
MissingDependencyError：处理缺少依赖的情况

所有错误都包含丰富的上下文信息，便于调试和问题定位。

性能优化

项目在性能方面做了多项优化：

msgspec的高效验证：比传统JSON解析快数倍
omit_defaults选项：减少不必要的数据传输
strict_validation控制：避免处理无关字段
类型提示全面应用：提高IDE支持度和代码健壮性

使用示例

定义数据模型后，使用非常简单：

config = ExtractionConfig(
    output_type=Invoice,
    extraction_model="gpt-4-vision-preview",
    max_extraction_retries=3
)

result = await extract_file("invoice.pdf", config=config)
invoice = result.structured_data  # 类型安全的访问