Velero 在 Azure 上使用 Workload Identity 认证失败的排查与解决

2025-05-25 10:24:13作者：魏侃纯Zoe

问题背景

在使用 Velero 进行 Kubernetes 集群备份时，许多用户选择在 Azure Kubernetes Service (AKS) 上部署 Velero 并通过 Azure Workload Identity 进行身份认证。然而，在实际部署过程中，可能会遇到认证失败的问题，表现为 Velero 无法获取存储账户属性或无法验证备份存储位置。

错误现象

典型的错误日志会显示以下信息：

failed to retrieve the storage account properties: ManagedIdentityCredential: ManagedIdentityCredential: Get "http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=&resource=https%3A%2F%2Fmanagement.core.windows.net%2F": context deadline exceeded

这表明 Velero 尝试通过传统的托管身份认证方式（IMDS）获取令牌失败，而没有正确使用 Workload Identity 认证流程。

根本原因分析

经过深入排查，发现这个问题通常由以下几个关键因素导致：

Workload Identity 组件未正确启用：虽然集群配置了 OIDC 颁发者，但 Workload Identity 附加组件可能未被激活。
身份认证流程选择错误：Velero 错误地回退到传统的托管身份认证方式，而非使用 Workload Identity 认证。
服务账户配置问题：Workload Identity 所需的联邦令牌文件可能未被正确注入到 Pod 中。

解决方案

1. 确保 Workload Identity 组件已启用

在 AKS 集群上，必须显式启用 Workload Identity 功能。仅配置 OIDC 颁发者是不够的。使用以下命令确保功能已启用：

az aks update --enable-oidc-issuer --enable-workload-identity --name <cluster-name> --resource-group <resource-group>

2. 验证服务账户配置

确保 Velero 的服务账户正确配置了 Workload Identity 注解：

serviceAccount:
  server:
    create: true
    name: "velero-sa"
    annotations:
      azure.workload.identity/client-id: <managed-identity-client-id>

3. 检查 Pod 标签

Velero 的 Pod 必须包含正确的标签以启用 Workload Identity：

podLabels:
  azure.workload.identity/use: "true"

4. 验证联邦令牌文件

在 Velero Pod 中检查是否存在联邦令牌文件：

kubectl exec -it <velero-pod> -n velero -- ls -la /var/run/secrets/azure/tokens

应该能看到类似 azure-identity-token 的文件存在。

权限配置建议

为确保 Workload Identity 正常工作，需要为托管身份配置适当的权限：

存储账户级别：
- 存储 Blob 数据参与者
- 读取者
资源组级别：
- 参与者
- Velero 自定义角色（根据文档配置）

最佳实践

避免混合使用认证方式：不要在同一个集群中同时使用 Pod Identity 和 Workload Identity，这可能导致冲突。
检查网络策略：确保集群节点可以访问 Azure Instance Metadata Service (IMDS) 端点，即使 Workload Identity 主要不依赖它。
版本兼容性：确保使用的 Velero 插件版本与 Workload Identity 功能兼容。