ggplot2中geom_curve与geom_segment处理缺失值的差异分析

2025-06-02 22:10:26作者：俞予舒Fleming

在数据可视化过程中，我们经常需要使用线段或曲线来连接散点图中的数据点。ggplot2包提供了两种主要的几何对象来实现这一功能：geom_segment和geom_curve。然而，最近发现这两种几何对象在处理缺失值时存在不一致的行为，这可能导致用户在使用geom_curve时遇到意外的错误。

问题现象

当使用geom_segment连接数据点时，即使存在缺失值（NA），该几何对象也能正常工作，只是会忽略包含NA的线段。例如：

library(ggplot2)

dtc <- data.frame(
  node = c("A","B","C"),
  x_connect = c(60,32,80),
  y_connect = c(39,88,110)
)

# geom_segment正常工作
ggplot(dtc) +
  geom_point(aes(x = x_connect, y = y_connect), size=5) +
  geom_segment(aes(x = x_connect, y = y_connect,
                   xend = lead(x_connect), yend = lead(y_connect)))

然而，当使用geom_curve尝试同样的操作时，会抛出错误："end points must not be identical"：

# geom_curve抛出错误
ggplot(dtc) +
  geom_point(aes(x = x_connect, y = y_connect), size=5) +
  geom_curve(aes(x = x_connect, y = y_connect,
                 xend = lead(x_connect), yend = lead(y_connect)))

问题根源

通过分析ggplot2的源代码和内部数据结构，我们发现这种不一致行为的原因在于：

数据预处理差异：geom_segment在绘制前会调用remove_missing()函数自动过滤掉包含NA值的记录，而geom_curve没有进行这一步骤。
底层绘制机制：geom_curve最终依赖于grid包的曲线绘制函数，该函数对输入参数有更严格的检查，当遇到NA值时会产生错误。
数据验证时机：geom_curve在数据传递到grid绘制系统前没有充分验证数据完整性，导致无效数据触发了底层错误。

技术细节

我们可以通过检查图层数据来更深入地理解这个问题：

# 获取geom_curve的图层数据
layer_data <- layer_data(last_plot(), 2)
print(layer_data)

输出显示，geom_curve确实接收到了包含NA值的数据记录，而它没有像geom_segment那样自动过滤这些记录。

临时解决方案

在官方修复此问题前，用户可以采取以下临时解决方案：

手动过滤NA值：

ggplot(dtc) +
  geom_point(aes(x = x_connect, y = y_connect), size=5) +
  geom_curve(aes(x = x_connect, y = y_connect,
                 xend = lead(x_connect), yend = lead(y_connect)),
             data = ~ .x %>% filter(!is.na(lead(x_connect))))