Apache HugeGraph HStore中JRaft Histograms指标NaN问题的分析与解决

2025-06-29 16:16:41作者：钟日瑜

问题背景

在Apache HugeGraph的HStore组件中，当通过Spring Actuator接口获取JRaft监控指标时，发现Histograms类型的指标值出现NaN异常。具体表现为jraft_append_logs_bytes_*系列指标全部显示为NaN，而实际上这些指标在JRaft内部统计中是存在有效值的。

问题现象

通过监控接口获取的指标数据显示如下异常：

jraft_append_logs_bytes_summary_count{group="0",handle="data",hg="store",} NaN
jraft_append_logs_bytes_mean{group="0",handle="data",hg="store",} NaN
jraft_append_logs_bytes_min{group="0",handle="data",hg="store",} NaN
...

而实际上，通过JRaft内部统计可以看到这些指标确实有有效值：

append-logs-bytes
             count = 67710
               min = 110
               max = 110
              mean = 110.00
            stddev = 0.00
            median = 110.00
              75% <= 110.00
              95% <= 110.00

问题分析

经过深入分析，发现问题出在HistogramWrapper类的实现上。当前实现采用了一种缓存机制，每30秒才更新一次快照数据：

private static class HistogramWrapper {
    private final com.codahale.metrics.Histogram histogram;
    private Snapshot snapshot;
    private long ts = System.currentTimeMillis();

    HistogramWrapper(com.codahale.metrics.Histogram histogram) {
        this.histogram = histogram;
        this.snapshot = this.histogram.getSnapshot();
    }

    Snapshot getSnapshot() {
        if (System.currentTimeMillis() - this.ts > 30_000) {
            this.snapshot = this.histogram.getSnapshot();
            this.ts = System.currentTimeMillis();
        }
        return this.snapshot;
    }
}

这种设计存在两个潜在问题：

数据时效性问题：30秒的缓存间隔可能导致获取到的快照数据不是最新的，特别是在系统负载较高时，这种延迟会更加明显。
初始化问题：在系统刚启动或指标刚被创建时，如果直接获取快照数据，可能会因为缺乏足够的数据点而导致NaN值的出现。

解决方案

针对上述问题，可以考虑以下几种解决方案：

方案一：优化缓存策略

调整缓存时间间隔，根据实际业务场景选择一个更合适的值。例如，对于高频率变化的指标，可以缩短到10秒：

Snapshot getSnapshot() {
    if (System.currentTimeMillis() - this.ts > 10_000) {
        this.snapshot = this.histogram.getSnapshot();
        this.ts = System.currentTimeMillis();
    }
    return this.snapshot;
}

方案二：实时获取快照

对于性能要求不是特别高的场景，可以直接获取实时快照数据：

Snapshot getSnapshot() {
    return this.histogram.getSnapshot();
}

需要注意的是，这种方式会增加系统开销，特别是在高并发场景下。

方案三：添加数据校验

在返回快照数据前，添加数据有效性检查，避免返回NaN值：

Snapshot getSnapshot() {
    Snapshot snapshot = this.histogram.getSnapshot();
    if (Double.isNaN(snapshot.getMean())) {
        // 返回默认值或上一次的有效快照
        return this.lastValidSnapshot;
    }
    this.lastValidSnapshot = snapshot;
    return snapshot;
}

实施建议

在实际应用中，建议根据具体场景选择合适的方案：

对于性能敏感但数据时效性要求不高的场景，可以采用方案一，适当调整缓存间隔。
对于数据准确性要求高的场景，可以采用方案二，但需要评估性能影响。
对于需要兼顾性能和准确性的场景，可以采用方案三，既能保证数据有效性，又能减少不必要的性能开销。

总结

Apache HugeGraph HStore中JRaft Histograms指标NaN问题是一个典型的监控数据采集问题。通过分析其底层实现，我们发现问题的根源在于快照数据的缓存策略和数据有效性检查。针对这一问题，我们提出了三种解决方案，各有优缺点，开发者可以根据实际业务需求选择合适的方案进行优化。

这类问题的解决不仅限于HugeGraph项目，对于其他使用类似监控机制的分布式系统也有参考价值。关键在于平衡数据采集的实时性和系统性能之间的关系，同时确保数据的准确性和可靠性。

hugegraph

A graph database that supports more than 100+ billion data, high performance and scalability (Include OLTP Engine & REST-API & Backends)

项目地址：https://gitcode.com/gh_mirrors/in/hugegraph

登录后查看全文