Kedro项目中运行时参数(runtime_params)的正确使用方式

2025-05-22 16:45:01作者：伍希望

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

项目地址：https://gitcode.com/GitHub_Trending/ke/kedro

概述

在Kedro项目中，运行时参数(runtime_params)是一个强大的功能，允许用户在运行管道时动态覆盖配置参数。然而，许多开发者在实际使用过程中会遇到参数解析失败的问题。本文将深入探讨runtime_params的工作原理、常见误区以及最佳实践。

runtime_params的基本用法

runtime_params是Kedro提供的一种参数解析机制，主要用于在运行管道时动态覆盖配置文件中的参数值。基本语法是在参数文件中使用${runtime_params:参数名}的形式声明可被覆盖的参数。

典型的parameters.yml配置示例：

model:
  name: "${runtime_params:model_name}"
  identifier: "${runtime_params:model_identifier}"

运行时可以通过CLI传递参数值：

kedro run --params model_name=llama,model_identifier=meta-llama/Llama-3.1-8

常见问题分析

问题现象

开发者经常遇到以下错误：

InterpolationResolutionError: Runtime parameter 'model_name' not found and no default value provided.

根本原因

手动加载配置时的限制：当开发者手动实例化OmegaConfigLoader来加载参数时，该加载器无法感知通过CLI传递的运行时参数。
配置加载时机：Kedro在创建会话时会合并配置文件中的参数和运行时参数，但手动加载会绕过这一机制。

最佳实践

1. 通过管道输入传递参数

正确的方式是将参数作为管道输入传递，而不是手动加载：

base_pipeline = pipeline(
    [
        node(
            func=process_model,
            inputs=["params:model"],  # 通过params:前缀获取完整参数
            outputs="processed_data",
        )
    ]
)

2. 动态数据集配置

对于需要动态创建数据集的情况，可以在catalog.yml中使用参数解析：

HFTokenizer:
  type: custom.datasets.HFTokenizer
  model_identifier: "${runtime_params:model_identifier}"

3. 避免手动加载配置

除非有特殊需求，否则应避免在管道创建函数中手动加载配置。Kedro框架会自动处理参数合并和解析。

高级场景处理

对于需要基于参数动态构建管道的场景，可以考虑以下模式：

def create_pipeline(**kwargs) -> Pipeline:
    # 通过kwargs获取上下文参数
    model_params = kwargs.get("params", {}).get("model", {})
    
    return pipeline(
        nodes=[
            node(
                func=process_model,
                inputs={"model_config": "params:model"},
                outputs="result"
            )
        ]
    )