5大系统故障的诊断与修复指南：Claude Code Router运维实战

2026-03-10 04:21:04作者：咎岭娴Homer

Use Claude Code as the foundation for coding infrastructure, allowing you to decide how to interact with the model while enjoying updates from Anthropic.

项目地址：https://gitcode.com/GitHub_Trending/cl/claude-code-router

Claude Code Router是一款能够将Claude Code请求路由到其他LLM服务提供商的工具，帮助用户无需Anthropics账户即可使用相关功能。在实际运维过程中，系统可能会遇到各种故障，本文将通过系统化的方法，帮助您快速定位问题、分析根因并实施解决方案，同时建立有效的预防策略。

系统健康诊断流程图

flowchart TD
    A[系统异常] --> B{初步检查}
    B -->|服务状态| C[服务运行检查]
    B -->|日志分析| D[错误日志查看]
    B -->|配置验证| E[配置文件检查]
    
    C --> F{服务状态}
    F -->|运行中| G[性能指标监控]
    F -->|未运行| H[启动故障处理]
    
    D --> I{错误类型}
    I -->|API错误| J[API连接检查]
    I -->|配置错误| K[配置项验证]
    I -->|路由错误| L[路由逻辑调试]
    
    E --> M{配置状态}
    M -->|有效| N[高级故障排查]
    M -->|无效| O[配置修复流程]
    
    G & J & K & L & N & O --> P[问题解决]
    P --> Q[系统恢复]
    Q --> R[预防措施实施]

基础故障排查与解决

1. 服务启动故障：从无法启动到正常运行

故障现象：执行ccr start命令后无响应或立即退出，服务未正常启动

根因分析：常见原因包括端口占用、权限不足、依赖缺失或配置文件损坏

快速诊断命令集：

# 检查服务状态
systemctl status claude-code-router

# 查看端口占用情况
ss -tulpn | grep 3456

# 检查最近的错误日志
journalctl -u claude-code-router -n 50 --no-pager

# 验证配置文件格式
jq empty ~/.claude-code-router/config.json

# 尝试手动启动并观察输出
ccr start --debug

解决方案：

Python实现的端口冲突自动解决脚本：

import os
import subprocess
import re

def find_and_kill_port(port):
    """查找并终止占用指定端口的进程"""
    try:
        # 获取占用端口的进程ID
        result = subprocess.run(
            f"lsof -t -i:{port}", 
            shell=True, 
            capture_output=True, 
            text=True
        )
        
        if result.stdout:
            pid = result.stdout.strip()
            print(f"发现占用端口 {port} 的进程: {pid}")
            
            # 终止进程
            subprocess.run(f"kill -9 {pid}", shell=True)
            print(f"已终止进程 {pid}")
            return True
        else:
            print(f"端口 {port} 未被占用")
            return False
    except Exception as e:
        print(f"处理端口冲突时出错: {str(e)}")
        return False

def start_ccr_with_fallback_port(default_port=3456):
    """尝试使用默认端口启动CCR，失败则自动切换端口"""
    if find_and_kill_port(default_port):
        # 给系统一点时间释放端口
        import time
        time.sleep(2)
    
    # 尝试启动服务
    try:
        subprocess.run(
            f"ccr start --port {default_port}",
            shell=True,
            check=True
        )
        print(f"CCR服务已在端口 {default_port} 启动成功")
        return default_port
    except subprocess.CalledProcessError:
        # 启动失败，尝试使用备用端口
        fallback_port = default_port + 1
        print(f"端口 {default_port} 启动失败，尝试使用备用端口 {fallback_port}")
        subprocess.run(
            f"ccr start --port {fallback_port}",
            shell=True,
            check=True
        )
        print(f"CCR服务已在端口 {fallback_port} 启动成功")
        return fallback_port

if __name__ == "__main__":
    start_ccr_with_fallback_port()

2. API调用异常：从连接失败到稳定通信

故障现象：服务运行正常但无法连接到LLM提供商API，出现超时或认证错误

根因分析：网络连接问题、API密钥无效、代理配置错误或请求格式不正确

快速诊断命令集：

# 测试网络连通性
curl -I -m 5 https://api.openai.com/v1/chat/completions

# 检查环境变量配置
printenv | grep -i "OPENAI\|DEEPSEEK\|API"

# 查看API调用日志
tail -f ~/.claude-code-router/logs/ccr-api.log | grep -i "error\|fail"

# 验证代理设置
curl -x http://your-proxy:port https://api.openai.com/v1/models

# 测试API密钥有效性
python -c "import os, requests; print(requests.post('https://api.openai.com/v1/models', headers={'Authorization': 'Bearer ' + os.getenv('OPENAI_API_KEY')}).status_code)"

解决方案：

API配置自动修复脚本：

import json
import os
import requests
from pathlib import Path

def validate_api_key(provider, api_key, base_url):
    """验证API密钥有效性"""
    headers = {"Authorization": f"Bearer {api_key}"}
    
    # 不同提供商的验证端点不同
    endpoints = {
        "openai": "/v1/models",
        "deepseek": "/v1/models",
        "anthropic": "/v1/models",
        "gemini": "/v1/models"
    }
    
    endpoint = endpoints.get(provider, "/v1/models")
    try:
        response = requests.get(f"{base_url}{endpoint}", headers=headers, timeout=10)
        return response.status_code in [200, 403]  # 403表示密钥无效但连接正常
    except Exception as e:
        print(f"API连接测试失败: {str(e)}")
        return False

def fix_api_config(config_path):
    """检查并修复API配置问题"""
    config_path = Path(config_path)
    
    if not config_path.exists():
        print(f"配置文件不存在: {config_path}")
        return False
    
    with open(config_path, 'r') as f:
        try:
            config = json.load(f)
        except json.JSONDecodeError as e:
            print(f"配置文件格式错误: {str(e)}")
            return False
    
    # 检查并修复Providers配置
    if "Providers" not in config:
        print("配置中缺少Providers部分，添加默认配置")
        config["Providers"] = []
    
    fixed = False
    for provider in config["Providers"]:
        # 检查必要字段
        required_fields = ["name", "api_base_url", "api_key", "models"]
        for field in required_fields:
            if field not in provider:
                print(f"提供商 {provider.get('name', '未知')} 缺少必要字段: {field}")
                fixed = True
                if field == "api_key" and provider["name"].upper() in os.environ:
                    # 尝试从环境变量获取API密钥
                    provider[field] = f"${provider['name'].upper()}_API_KEY"
                    print(f"已从环境变量自动填充{field}")
        
        # 验证API密钥
        if provider.get("api_key", "").startswith("$"):
            env_var = provider["api_key"][1:]
            if env_var in os.environ:
                api_key = os.environ[env_var]
                if not validate_api_key(provider["name"], api_key, provider["api_base_url"]):
                    print(f"提供商 {provider['name']} 的API密钥无效")
            else:
                print(f"环境变量 {env_var} 未设置")
                fixed = True
    
    if fixed:
        with open(config_path, 'w') as f:
            json.dump(config, f, indent=2)
        print("配置文件已更新，请重启服务")
        return True
    else:
        print("API配置检查通过")
        return True

if __name__ == "__main__":
    fix_api_config(os.path.expanduser("~/.claude-code-router/config.json"))

高级故障排查与解决

1. 配置解析错误：从配置混乱到规范有序

故障现象：服务启动后行为异常，日志中出现配置相关错误信息

根因分析：JSON语法错误、配置项缺失、环境变量引用错误或路径权限问题

快速诊断命令集：

# 验证JSON语法
cat ~/.claude-code-router/config.json | jq .

# 检查环境变量引用
grep -oE '\$\w+' ~/.claude-code-router/config.json | sort -u

# 验证路径权限
namei -l ~/.claude-code-router/config.json

# 检查配置文件校验和
md5sum ~/.claude-code-router/config.json

# 比较配置与模板差异
diff ~/.claude-code-router/config.json ~/.claude-code-router/config.example.json

解决方案：

配置文件验证与修复工具：

import json
import os
import re
from pathlib import Path

class ConfigValidator:
    def __init__(self, config_path):
        self.config_path = Path(config_path)
        self.config = None
        self.errors = []
        
    def load_config(self):
        """加载配置文件"""
        try:
            with open(self.config_path, 'r') as f:
                self.config = json.load(f)
            return True
        except json.JSONDecodeError as e:
            self.errors.append(f"JSON语法错误: {str(e)}")
            return False
        except Exception as e:
            self.errors.append(f"加载配置失败: {str(e)}")
            return False
    
    def validate_env_vars(self):
        """验证环境变量引用"""
        if not self.config:
            return False
            
        config_str = json.dumps(self.config)
        env_vars = re.findall(r'\$\w+', config_str)
        
        for var in env_vars:
            var_name = var[1:]  # 去除$符号
            if var_name not in os.environ:
                self.errors.append(f"未设置环境变量: {var_name}")
        
        return len(self.errors) == 0
    
    def validate_required_fields(self):
        """验证必要配置字段"""
        if not self.config:
            return False
            
        required_fields = ["Providers", "Router"]
        for field in required_fields:
            if field not in self.config:
                self.errors.append(f"缺少必要配置项: {field}")
        
        # 验证Providers配置
        if "Providers" in self.config:
            for provider in self.config["Providers"]:
                if not isinstance(provider, dict):
                    self.errors.append("Providers列表中的项必须是对象")
                    continue
                    
                provider_required = ["name", "api_base_url", "models"]
                for pr in provider_required:
                    if pr not in provider:
                        self.errors.append(f"提供商配置缺少必要项: {pr}")
        
        return len(self.errors) == 0
    
    def validate_paths(self):
        """验证配置中的路径是否有效"""
        if not self.config:
            return False
            
        # 检查自定义路由文件路径
        if "Router" in self.config and "custom_router_path" in self.config["Router"]:
            path = Path(self.config["Router"]["custom_router_path"])
            if not path.exists():
                self.errors.append(f"自定义路由文件不存在: {path}")
            elif not path.is_file():
                self.errors.append(f"自定义路由路径不是文件: {path}")
                
        return len(self.errors) == 0
    
    def validate_all(self):
        """执行所有验证"""
        if not self.load_config():
            return False
            
        self.validate_env_vars()
        self.validate_required_fields()
        self.validate_paths()
        
        return len(self.errors) == 0
    
    def print_errors(self):
        """打印所有错误"""
        if not self.errors:
            print("配置验证通过")
            return
            
        print(f"发现{len(self.errors)}个配置问题:")
        for i, error in enumerate(self.errors, 1):
            print(f"{i}. {error}")

if __name__ == "__main__":
    validator = ConfigValidator(os.path.expanduser("~/.claude-code-router/config.json"))
    if validator.validate_all():
        print("配置文件验证通过")
    else:
        validator.print_errors()

2. 路由逻辑故障：从路由失效到智能分发

故障现象：请求未按预期路由到指定模型，或路由规则不生效

根因分析：路由规则定义错误、自定义路由脚本问题、模型可用性检查失败或优先级配置冲突

快速诊断命令集：

# 启用调试日志
export LOG_LEVEL=debug && ccr restart

# 测试路由规则
curl -X POST http://localhost:3456/v1/debug/route \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "test routing"}]}'

# 检查自定义路由脚本
node -c ~/.claude-code-router/custom-router.js

# 查看路由决策日志
grep -A 10 "Router decision" ~/.claude-code-router/logs/ccr-router.log

# 验证模型可用性
curl -X GET http://localhost:3456/v1/models

解决方案：

路由规则测试工具：

import requests
import json
import argparse

def test_route(config, model, content):
    """测试特定模型和内容的路由决策"""
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": content}]
    }
    
    try:
        response = requests.post(
            f"http://localhost:{config.get('port', 3456)}/v1/debug/route",
            headers={"Content-Type": "application/json"},
            json=payload,
            timeout=10
        )
        
        if response.status_code == 200:
            result = response.json()
            print("路由测试结果:")
            print(f"请求模型: {model}")
            print(f"路由目标: {result.get('provider')}/{result.get('model')}")
            print(f"路由原因: {result.get('reason')}")
            print(f"备选方案: {result.get('alternatives', '无')}")
            return result
        else:
            print(f"路由测试失败: HTTP {response.status_code}")
            print(response.text)
            return None
    except Exception as e:
        print(f"路由测试发生错误: {str(e)}")
        return None

def batch_test_routes(config, test_cases):
    """批量测试多个路由场景"""
    print(f"开始批量路由测试，共{len(test_cases)}个测试用例\n")
    
    results = []
    for i, case in enumerate(test_cases, 1):
        print(f"测试用例 {i}/{len(test_cases)}:")
        print(f"模型: {case['model']}, 内容摘要: {case['content'][:50]}...")
        
        result = test_route(config, case['model'], case['content'])
        results.append({
            "test_case": case,
            "result": result
        })
        
        print("-" * 50)
    
    # 生成测试报告
    success_count = sum(1 for r in results if r['result'] is not None)
    print(f"\n测试完成: {success_count}/{len(test_cases)} 成功")
    
    # 找出失败的测试用例
    failed = [i+1 for i, r in enumerate(results) if r['result'] is None]
    if failed:
        print(f"失败的测试用例: {', '.join(map(str, failed))}")
    
    return results

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Claude Code Router路由测试工具")
    parser.add_argument("--port", type=int, default=3456, help="CCR服务端口")
    args = parser.parse_args()
    
    # 测试用例
    test_cases = [
        {
            "model": "gpt-4",
            "content": "编写一个Python函数来处理JSON数据"
        },
        {
            "model": "claude-3-opus",
            "content": "分析这段代码的性能瓶颈并提出优化建议"
        },
        {
            "model": "codellama",
            "content": "解释这个C++模板元编程的实现原理"
        },
        {
            "model": "unknown-model",
            "content": "这是一个未知模型的测试请求"
        }
    ]
    
    batch_test_routes({"port": args.port}, test_cases)

故障案例分析

案例一：生产环境服务频繁崩溃的根因追踪

故障现象：Claude Code Router服务在高负载时段每2-3小时崩溃一次，日志显示"JavaScript heap out of memory"错误

排查过程：

问题定位：
- 查看系统日志确认OOM（内存溢出）错误
- 使用pm2 monit监控内存使用情况，发现内存持续增长
- 检查最近的代码变更和依赖更新记录
根因分析：
- 通过--inspect选项启动服务，使用Chrome DevTools进行内存分析
- 发现请求处理函数中存在闭包导致的内存泄漏
- 确认是在引入新的转换器模块后开始出现问题
解决方案：
- 修复转换器模块中的闭包问题，避免循环引用
- 实现请求处理后的资源自动清理机制
- 添加内存使用监控和自动重启策略
预防策略：
- 引入内存泄漏检测自动化测试
- 实施代码审查流程，特别关注内存管理
- 配置进程级内存使用阈值告警

关键代码修复：

// 修复前：存在内存泄漏的转换器代码
function createTransformer(config) {
  // 闭包中引用了大对象
  const largeConfig = JSON.parse(JSON.stringify(config));
  
  return function transformer(input) {
    // 每次调用都创建新函数，导致内存无法释放
    return input.map(item => {
      return {
        ...item,
        processed: true,
        timestamp: Date.now(),
        configHash: hash(largeConfig)
      };
    });
  };
}

// 修复后：优化内存使用的代码
function createTransformer(config) {
  // 只保留必要的配置项
  const { essentialConfig1, essentialConfig2 } = config;
  const configHash = hash({ essentialConfig1, essentialConfig2 });
  
  // 提取处理函数到外部，避免重复创建
  function processItem(item) {
    return {
      ...item,
      processed: true,
      timestamp: Date.now(),
      configHash: configHash
    };
  }
  
  return function transformer(input) {
    return input.map(processItem);
  };
}

案例二：API密钥轮换导致的服务中断

故障现象：系统升级API密钥后，所有请求均返回401 Unauthorized错误，但密钥在测试环境验证有效

排查过程：

问题定位：
- 检查应用日志确认认证失败错误
- 对比新旧密钥格式，发现新密钥包含特殊字符
- 检查配置文件中密钥的引用方式
根因分析：
- 新密钥包含美元符号($)，在shell环境中被错误解析
- 配置文件中直接使用了密钥值而非环境变量引用
- 密钥中的特殊字符未正确转义
解决方案：
- 将密钥存储在环境变量中，配置文件中使用变量引用
- 实现密钥自动轮换脚本，确保正确处理特殊字符
- 添加密钥有效性预验证步骤
预防策略：
- 建立密钥管理规范，强制使用环境变量
- 实施密钥轮换前的自动化测试
- 开发密钥健康检查监控面板

关键配置修复：

// 修复前：直接在配置中存储密钥（不安全且易出错）
{
  "Providers": [
    {
      "name": "openai",
      "api_base_url": "https://api.openai.com/v1/chat/completions",
      "api_key": "sk-abc$def123456", // 包含特殊字符$
      "models": ["gpt-4", "gpt-3.5-turbo"]
    }
  ]
}

// 修复后：使用环境变量引用
{
  "Providers": [
    {
      "name": "openai",
      "api_base_url": "https://api.openai.com/v1/chat/completions",
      "api_key": "$OPENAI_API_KEY", // 引用环境变量
      "models": ["gpt-4", "gpt-3.5-turbo"]
    }
  ]
}

密钥轮换脚本：

#!/bin/bash
# 安全的API密钥轮换脚本

set -euo pipefail

# 配置
PROVIDER_NAME="openai"
SERVICE_NAME="claude-code-router"
CONFIG_PATH="$HOME/.claude-code-router/config.json"
ENV_FILE="/etc/environment"

# 1. 验证新密钥
read -p "请输入新的$PROVIDER_NAME API密钥: " NEW_API_KEY
export TEMP_API_KEY="$NEW_API_KEY"

echo "正在验证新密钥..."
if ! curl -s -o /dev/null -w "%{http_code}" \
  -H "Authorization: Bearer $TEMP_API_KEY" \
  "https://api.openai.com/v1/models" | grep -q "200"; then
  echo "错误: 新密钥验证失败"
  exit 1
fi

# 2. 更新环境变量
echo "正在更新环境变量..."
if grep -q "^${PROVIDER_NAME^^}_API_KEY=" "$ENV_FILE"; then
  sudo sed -i "s/^${PROVIDER_NAME^^}_API_KEY=.*/${PROVIDER_NAME^^}_API_KEY=$NEW_API_KEY/" "$ENV_FILE"
else
  echo "${PROVIDER_NAME^^}_API_KEY=$NEW_API_KEY" | sudo tee -a "$ENV_FILE"
fi

# 3. 重启服务
echo "正在重启服务..."
sudo systemctl restart "$SERVICE_NAME"

# 4. 验证服务状态
echo "正在验证服务状态..."
if ! systemctl is-active --quiet "$SERVICE_NAME"; then
  echo "错误: 服务重启失败"
  exit 1
fi

# 5. 测试API连接
echo "正在测试API连接..."
if ! curl -s -o /dev/null -w "%{http_code}" \
  -X POST "http://localhost:3456/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model": "gpt-3.5-turbo", "messages": [{"role": "user", "content": "test"}]}' | grep -q "200"; then
  echo "警告: API测试失败，但服务已重启"
  exit 1
fi

echo "API密钥轮换成功完成"

系统状态监控仪表盘

为了全面掌握Claude Code Router的运行状态，建议建立包含以下指标的监控仪表盘：

指标类别	关键指标	正常范围	告警阈值	风险等级
服务健康	服务运行状态	Running	Not Running	高
	响应时间	<500ms	>2000ms	中
	错误率	<1%	>5%	高
资源使用	内存使用	<500MB	>1GB	中
	CPU使用率	<30%	>80%	中
	磁盘空间	>2GB可用	<500MB可用	高
API性能	API成功率	>99%	<95%	高
	API平均响应时间	<2s	>5s	中
	API超时率	<0.1%	>1%	中
路由性能	路由决策时间	<100ms	>500ms	低
	路由失败率	<0.1%	>1%	中
	备用路由触发率	<1%	>5%	低

故障预防体系

日常维护清单

每日检查
- 服务状态验证
- 错误日志审查
- API连接测试
每周维护
- 配置文件备份
- 依赖更新检查
- 性能指标趋势分析
每月维护
- 完整日志归档
- 系统安全更新
- 全面性能评估

自动化检查建议

#!/usr/bin/env python3
# CCR系统健康检查脚本

import os
import subprocess
import json
import time
from datetime import datetime
import smtplib
from email.mime.text import MIMEText

class CCRHealthChecker:
    def __init__(self):
        self.config = {
            "service_name": "claude-code-router",
            "port": 3456,
            "log_path": os.path.expanduser("~/.claude-code-router/logs"),
            "max_log_age_days": 30,
            "alert_email": "admin@example.com",
            "check_interval": 300,  # 5分钟
            "cpu_threshold": 80,
            "memory_threshold": 1024,  # MB
            "error_threshold": 5  # 5分钟内错误数
        }
        self.problems = []
    
    def check_service_status(self):
        """检查服务运行状态"""
        try:
            result = subprocess.run(
                f"systemctl is-active {self.config['service_name']}",
                shell=True,
                capture_output=True,
                text=True
            )
            status = result.stdout.strip()
            
            if status != "active":
                self.problems.append(f"服务状态异常: {status}")
                return False
            return True
        except Exception as e:
            self.problems.append(f"检查服务状态失败: {str(e)}")
            return False
    
    def check_api_health(self):
        """检查API健康状态"""
        try:
            result = subprocess.run(
                f"curl -s -o /dev/null -w '%{{http_code}}' http://localhost:{self.config['port']}/health",
                shell=True,
                capture_output=True,
                text=True
            )
            status_code = result.stdout.strip()
            
            if status_code != "200":
                self.problems.append(f"API健康检查失败: HTTP {status_code}")
                return False
            return True
        except Exception as e:
            self.problems.append(f"API健康检查错误: {str(e)}")
            return False
    
    def check_resource_usage(self):
        """检查资源使用情况"""
        try:
            # 获取进程ID
            pid = subprocess.run(
                f"pgrep -f {self.config['service_name']}",
                shell=True,
                capture_output=True,
                text=True
            ).stdout.strip()
            
            if not pid:
                self.problems.append("未找到服务进程")
                return False
            
            # 检查CPU和内存使用
            top_output = subprocess.run(
                f"top -b -n 1 -p {pid}",
                shell=True,
                capture_output=True,
                text=True
            ).stdout
            
            # 解析CPU和内存使用率
            for line in top_output.split('\n'):
                if pid in line:
                    parts = line.strip().split()
                    cpu_usage = float(parts[8])
                    memory_usage = float(parts[5])  # 单位：KB
                    
                    if cpu_usage > self.config['cpu_threshold']:
                        self.problems.append(f"CPU使用率过高: {cpu_usage}%")
                    
                    if memory_usage > self.config['memory_threshold'] * 1024:
                        self.problems.append(f"内存使用过高: {memory_usage/1024:.2f}MB")
            
            return True
        except Exception as e:
            self.problems.append(f"资源检查错误: {str(e)}")
            return False
    
    def check_recent_errors(self):
        """检查最近错误日志"""
        try:
            # 查找最近的日志文件
            log_files = sorted(
                [f for f in os.listdir(self.config['log_path']) if f.startswith('ccr-') and f.endswith('.log')],
                reverse=True
            )
            
            if not log_files:
                self.problems.append("未找到日志文件")
                return False
            
            # 检查最近5分钟的错误
            error_count = subprocess.run(
                f"grep -c -i 'error\|fail' {os.path.join(self.config['log_path'], log_files[0])} | tail -n 1",
                shell=True,
                capture_output=True,
                text=True
            ).stdout.strip()
            
            if int(error_count) > self.config['error_threshold']:
                self.problems.append(f"错误率过高: {error_count}个错误")
            
            return True
        except Exception as e:
            self.problems.append(f"日志检查错误: {str(e)}")
            return False
    
    def clean_old_logs(self):
        """清理旧日志文件"""
        try:
            cutoff_time = time.time() - (self.config['max_log_age_days'] * 86400)
            
            for filename in os.listdir(self.config['log_path']):
                file_path = os.path.join(self.config['log_path'], filename)
                if os.path.isfile(file_path) and filename.endswith('.log'):
                    file_mtime = os.path.getmtime(file_path)
                    if file_mtime < cutoff_time:
                        os.remove(file_path)
                        print(f"已清理旧日志: {filename}")
            
            return True
        except Exception as e:
            print(f"清理日志错误: {str(e)}")
            return False
    
    def send_alert(self):
        """发送告警邮件"""
        if not self.problems:
            return True
            
        subject = f"[告警] Claude Code Router 系统异常 ({datetime.now().strftime('%Y-%m-%d %H:%M:%S')})"
        body = "发现以下系统问题:\n\n" + "\n".join([f"- {p}" for p in self.problems])
        
        msg = MIMEText(body)
        msg['Subject'] = subject
        msg['From'] = "monitor@example.com"
        msg['To'] = self.config['alert_email']
        
        try:
            with smtplib.SMTP('localhost') as server:
                server.send_message(msg)
            print("告警邮件已发送")
            return True
        except Exception as e:
            print(f"发送告警邮件失败: {str(e)}")
            return False
    
    def run_full_check(self):
        """执行完整检查"""
        self.problems = []
        print(f"开始系统健康检查: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
        
        self.check_service_status()
        self.check_api_health()
        self.check_resource_usage()
        self.check_recent_errors()
        
        if self.problems:
            print("发现问题:")
            for problem in self.problems:
                print(f"- {problem}")
            self.send_alert()
        else:
            print("系统健康检查通过")
        
        # 每周日执行日志清理
        if datetime.now().weekday() == 6:
            self.clean_old_logs()
        
        return len(self.problems) == 0

if __name__ == "__main__":
    checker = CCRHealthChecker()
    checker.run_full_check()

故障应急响应流程

检测与报告
- 自动监控系统发现异常并触发告警
- 第一响应人确认告警真实性
- 创建故障工单并通知相关人员
评估与升级
- 初步评估故障影响范围和严重程度
- 根据影响范围决定是否升级处理
- 通知受影响的用户或团队
诊断与修复
- 按照本文档的故障排查流程定位问题
- 实施临时解决方案恢复服务
- 部署永久修复方案
恢复与验证
- 确认服务恢复正常运行
- 验证所有功能正常工作
- 监控系统一段时间确保稳定
复盘与改进
- 召开故障复盘会议
- 记录根本原因和解决方案
- 更新预防措施和文档

第三方工具集成推荐

PM2 - 进程管理工具

功能：服务启动、监控、自动重启和日志管理
配置示例：

{
  "name": "claude-code-router",
  "script": "dist/server.js",
  "instances": "max",
  "exec_mode": "cluster",
  "watch": false,
  "max_memory_restart": "1G",
  "env": {
    "NODE_ENV": "production",
    "LOG_LEVEL": "info"
  },
  "log_date_format": "YYYY-MM-DD HH:mm:ss Z"
}

Prometheus + Grafana - 监控与可视化平台
- 功能：收集系统指标、创建自定义仪表盘、设置告警
- 集成方式：通过Node.js客户端暴露指标端点
ELK Stack - 日志管理与分析
- 功能：集中式日志收集、搜索和可视化
- 配置建议：创建专用索引模式，设置错误日志告警
Postman/Newman - API测试工具
- 功能：创建自动化API测试套件，定期验证服务功能
- 使用场景：部署前验证、定期健康检查
Ansible - 自动化运维工具
- 功能：配置管理、服务部署和更新、故障恢复自动化
- 应用示例：自动部署新版本、配置备份、密钥轮换