Changedetection.io 中的磁盘IO优化：静态缓存历史记录查找

2025-05-08 14:15:56作者：范靓好Udolf

The best and simplest free open source website change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification

项目地址：https://gitcode.com/GitHub_Trending/ch/changedetection.io

在网页监控工具Changedetection.io中，我们发现了一个可以显著提升性能的优化点。该工具的核心功能之一是监控网页内容变化，并将每次变化的历史记录存储在本地文件中。

当前实现的问题

在现有实现中，当需要比较当前内容与历史记录时，系统会反复从磁盘读取历史索引文件。具体表现为：

每次调用get_history_snapshot方法时都会重新加载history.txt索引文件
在"仅显示唯一行"的功能逻辑中，这种重复读取尤为明显
日志显示同一历史记录文件在短时间内被多次读取

虽然现代SSD的读取速度很快，但这种重复IO操作仍然会带来不必要的性能开销，特别是在处理大量历史记录或频繁检查变化时。

优化方案

我们可以通过引入静态内存缓存来优化这一过程：

缓存历史索引：在内存中维护一个静态缓存，存储已加载的历史索引
缓存失效策略：当历史记录更新时，使对应缓存失效
按需加载：仅在缓存不存在或失效时才从磁盘读取

这种优化特别适合以下场景：

历史记录较多时
需要频繁比较历史记录时
系统资源有限的环境中

实现细节

优化的核心在于重构get_history_snapshot方法：

def get_history_snapshot(self, key):
    if key not in self._history_cache:
        # 从磁盘加载并存入缓存
        content = self._load_history_from_disk(key)
        self._history_cache[key] = content
    return self._history_cache[key]

同时需要添加缓存失效机制，当历史记录更新时：

def update_history(self, key, content):
    # 更新磁盘上的历史记录
    self._save_history_to_disk(key, content)
    # 使缓存失效
    if key in self._history_cache:
        del self._history_cache[key]