GATK GermlineCNVCaller工作流中PostProcessGermlineCNVCalls工具的正确使用方法

2025-07-08 03:52:04作者：吴年前Myrtle

问题背景

在使用GATK的Germline CNV Caller工作流进行拷贝数变异分析时，许多用户在PostProcessGermlineCNVCalls步骤会遇到KeyError错误。这个错误通常表现为工具无法找到样本名称，尽管所有输入文件都存在且路径正确。

错误原因分析

该问题的根本原因是参数--contig-ploidy-calls的路径设置不正确。用户常犯的错误是将路径指向了具体的SAMPLE_x文件夹，而实际上应该指向包含所有SAMPLE_x文件夹的父目录。

正确配置方法

关键参数说明

--contig-ploidy-calls：这个参数应该指向DetermineGermlineContigPloidy工具输出的目录，该目录包含多个以SAMPLE_为前缀的子目录。
--calls-shard-path：指向GermlineCNVCaller工具输出的目录。
--model-shard-path：指向模型文件的目录。

正确命令示例

gatk PostprocessGermlineCNVCalls \
    --calls-shard-path /path/to/germlinecnvcaller-calls \
    --model-shard-path /path/to/model \
    --sample-index 0 \
    --autosomal-ref-copy-number 2 \
    --allosomal-contig chrX \
    --allosomal-contig chrY \
    --contig-ploidy-calls /path/to/determine_ploidy-calls \  # 注意这里是父目录
    --output-genotyped-intervals /path/to/genotyped_intervals.vcf \
    --output-genotyped-segments /path/to/genotyped_segments.vcf \
    --output-denoised-copy-ratios /path/to/genotyped_denoised_copy_ratios.vcf

工作流程解析

DetermineGermlineContigPloidy：首先运行此工具确定每个样本的倍性，输出目录结构应包含多个SAMPLE_x子目录。
GermlineCNVCaller：然后运行此工具进行CNV检测，生成calls-shard-path目录。
PostprocessGermlineCNVCalls：最后运行此工具进行后处理，需要正确引用前两步的输出目录。

最佳实践建议

始终使用绝对路径指定输入和输出目录。
在执行PostprocessGermlineCNVCalls前，检查--contig-ploidy-calls参数指定的目录是否包含预期的SAMPLE_x子目录。
确保--sample-index参数与要处理的样本索引一致。
对于批量处理，可以考虑编写脚本自动化检查目录结构。