Dragonfly2 调度器离线节点元数据清理机制测试实践

2025-06-30 18:13:19作者：尤辰城Agatha

Delivers efficient, stable, and secure data distribution and acceleration powered by P2P technology, with an optional content‑addressable filesystem that accelerates OCI container launch.

项目地址：https://gitcode.com/gh_mirrors/dr/Dragonfly2

在分布式文件分发系统 Dragonfly2 中，调度器（Scheduler）对节点（Peer）元数据的高效管理是保障系统稳定性的关键环节。本文将深入探讨如何通过端到端（E2E）测试验证调度器对异常/正常退出节点的元数据清理能力，这是保证系统资源回收和避免"僵尸节点"的重要技术保障。

核心机制解析

Dragonfly2 设计了双重节点状态感知机制：

主动通知机制
当节点正常退出时，会通过 LeaveHost() RPC 调用主动通知调度器进行元数据清理。这种同步清理方式具有即时性优势，能快速释放调度器资源。
被动回收机制
通过垃圾回收（GC）模块周期性扫描，检测超过 2 倍主机宣告间隔（announce host interval）未上报心跳的异常节点。这种异步处理方式作为容错备份，确保网络分区等异常场景下的最终一致性。

测试方案设计

测试环境构建

需要搭建包含以下组件的测试集群：

至少 2 个 Peer 节点（1 个作为被测对象，1 个作为基准参照）
1 个 Scheduler 实例
监控组件（用于验证元数据变更）

正常退出测试用例

def test_normal_exit():
    # 获取初始活跃节点数
    init_count = get_active_hosts_count()
    
    # 启动测试节点
    test_host = start_host()
    assert get_active_hosts_count() == init_count + 1
    
    # 模拟正常退出
    test_host.graceful_shutdown()
    
    # 验证元数据清理
    await assert_eventually(
        lambda: get_active_hosts_count() == init_count,
        timeout=2*HEARTBEAT_INTERVAL
    )

异常退出测试用例

def test_abnormal_exit():
    init_count = get_active_hosts_count()
    
    test_host = start_host()
    assert get_active_hosts_count() == init_count + 1
    
    # 模拟进程崩溃
    test_host.kill(-9)
    
    # 等待GC触发（2倍心跳间隔）
    sleep(2*HEARTBEAT_INTERVAL + BUFFER_TIME)
    
    assert get_active_hosts_count() == init_count