解决Pandas-AI项目中Bedrock Claude模型的JSON解析问题

2025-05-11 16:14:03作者：平淮齐Percy

问题背景

在Pandas-AI项目中使用Bedrock Claude模型时，开发者遇到了一个常见的技术问题：当模型生成包含额外文本的响应时，会导致JSON解析失败并抛出InvalidLLMOutputType("Response validation failed!")错误。这个问题特别容易出现在需要模型返回结构化数据（如JSON数组）的场景中。

问题分析

Bedrock Claude模型在响应时，有时会在JSON数据前后添加解释性文本或格式化标记。例如，一个典型的响应可能如下：

Based on the query "what was the max and min", here are some potential clarification questions a senior data scientist might ask:

[
  "QuestionA?",
  "QuestionB?"
]

这种响应格式虽然对人类阅读友好，但直接进行JSON解析时会失败，因为：

包含非JSON格式的前导文本
可能包含Markdown代码块标记(json和)
整体不符合严格的JSON格式要求

解决方案

1. 响应预处理

在验证方法中添加预处理步骤，去除无关文本和标记：

def validate(self, output) -> bool:
    try:
        # 移除Markdown代码块标记
        output = output.replace("```json", "").replace("```", "")
        # 提取JSON部分（假设JSON在最后）
        json_start = output.find('[')
        if json_start != -1:
            output = output[json_start:]
        json_data = json.loads(output)
        return isinstance(json_data, list)
    except json.JSONDecodeError:
        return False

2. 模型参数优化

通过调整模型调用参数，可以引导模型生成更规范的JSON响应：

params = {
    "anthropic_version": "bedrock-2023-05-31",
    "system": "你是一个JSON生成器，请直接输出有效的JSON数组，不要包含任何解释性文字或标记。",
    "messages": messages,
    "response_format": {"type": "json_object"}
}

3. 完整的BedrockClaude类实现

以下是经过优化的完整实现，解决了合并冲突并增强了健壮性：

from __future__ import annotations
import json
from typing import TYPE_CHECKING, Any, Dict, Optional
from ..exceptions import APIKeyNotFoundError, UnsupportedModelError
from ..helpers import load_dotenv
from ..prompts.base import BasePrompt
from .base import LLM

load_dotenv()

class BedrockClaude(LLM):
    """Bedrock Claude LLM实现"""
    
    _supported_models = [
        "anthropic.claude-3-opus-20240229-v1:0",
        "anthropic.claude-3-5-sonnet-20240620-v1:0",
        "anthropic.claude-3-sonnet-20240229-v1:0",
        "anthropic.claude-3-haiku-20240307-v1:0",
    ]
    
    def __init__(self, bedrock_runtime_client, **kwargs):
        # 初始化代码...
        
    def call(self, instruction: BasePrompt, context=None) -> str:
        # 构建请求参数...
        response = self.client.invoke_model(modelId=self.model, body=body)
        response_body = json.loads(response.get("body").read())
        
        # 响应后处理
        raw_output = response_body["content"][0]["text"]
        return self._clean_json_output(raw_output)
    
    def _clean_json_output(self, raw_output: str) -> str:
        """清理模型输出中的非JSON内容"""
        # 实现清理逻辑...
        return cleaned_json