本地AI应用开发：从技术选型到企业级部署的完整指南

2026-03-30 11:36:02作者：魏献源Searcher

在数据隐私日益受到重视的今天，如何在本地环境构建高效、安全的AI应用成为开发者面临的重要挑战。本地AI应用开发不仅能保护敏感数据，还能降低对云端服务的依赖，实现真正的离线智能。本文将带你深入了解如何利用node-llama-cpp构建生产级本地AI应用，从价值定位到技术选型，再到实战开发与场景拓展，全方位掌握本地AI开发的核心技术与最佳实践。

价值定位：为什么选择本地AI应用开发

为什么越来越多的企业开始转向本地AI部署？在云计算大行其道的今天，本地部署究竟能带来哪些独特价值？让我们从数据安全、响应速度和成本控制三个维度深入分析本地AI应用的核心优势。

数据主权：掌控AI应用的隐私边界

在医疗、金融等敏感行业，数据泄露可能导致灾难性后果。本地AI应用将数据处理过程完全限制在企业内部网络，从根本上消除数据传输过程中的安全风险。node-llama-cpp通过直接在本地硬件上运行模型，确保所有数据处理都在可控环境中进行，满足GDPR、HIPAA等严格的合规要求。

实时响应：突破网络延迟的性能瓶颈

想象一下，在工业自动化场景中，每毫秒的延迟都可能影响生产效率。本地AI应用将模型推理时间从云端的数百毫秒压缩到本地的几十毫秒，特别适合需要实时决策的应用场景。通过node-llama-cpp的高效绑定，开发者可以充分利用本地GPU资源，实现低延迟的AI推理。

成本优化：摆脱云端服务的持续支出

长期使用云端AI服务的成本往往被低估。一个每日处理10万次推理请求的应用，采用云端服务可能需要每年支付数十万元。而本地部署只需一次性硬件投入，配合node-llama-cpp的资源优化技术，可以显著降低总拥有成本(TCO)。

node-llama-cpp项目封面

技术选型：打造高效本地AI栈

面对琳琅满目的AI模型和部署工具，如何为你的本地应用选择最合适的技术栈？这需要综合考虑模型性能、硬件兼容性和开发效率等多方面因素。本节将为你提供系统化的技术选型指南，帮助你做出明智的决策。

模型格式深度解析：为什么GGUF成为本地部署首选

为什么GGUF格式在本地AI应用中越来越受欢迎？与其他格式相比，它有哪些独特优势？GGUF（GGML Universal Format）是llama.cpp项目推出的统一模型格式，专为本地部署优化，具有以下核心优势：

跨平台兼容性：同一模型文件可在x86、ARM等不同架构上运行
动态量化支持：运行时可根据硬件能力动态调整量化级别
元数据丰富：内置模型架构、量化信息等关键元数据
内存映射技术：支持大型模型的部分加载，降低内存占用

node-llama-cpp深度优化了GGUF格式的加载和推理流程，使模型加载速度提升40%，内存占用减少25%。

模型选择矩阵：找到你的最佳匹配

选择合适的模型是本地AI应用成功的关键。以下是不同类型模型的对比分析，帮助你根据需求做出选择：

模型类型	代表模型	量化级别	推荐硬件	适用场景	推理速度	显存需求
通用聊天	Llama 3.1-8B	Q4_K_M	8GB显存GPU	对话机器人	★★★★☆	6GB
代码生成	CodeGemma-7B	Q5_K_S	8GB显存GPU	代码补全	★★★☆☆	7GB
嵌入生成	BGE-M3	Q4_K_M	CPU/集成显卡	语义搜索	★★★★★	2GB
多模态	Llava-1.5-7B	Q5_K_M	12GB显存GPU	图像理解	★★☆☆☆	9GB
高效推理	Mistral-7B	Q4_0	4GB显存GPU	边缘设备	★★★★★	4GB

💡 技巧提示：对于首次尝试本地AI的开发者，推荐从Llama 3.1-8B的Q4_K_M量化版本开始，它在性能、质量和资源需求之间取得了很好的平衡。

硬件加速方案：释放本地计算潜力

如何充分利用你的硬件资源来加速本地AI推理？node-llama-cpp支持多种硬件加速技术，让你根据自己的设备配置选择最佳方案：

CUDA：NVIDIA显卡的首选加速方案，支持所有主流模型
Metal：Apple设备的硬件加速技术，M系列芯片性能优异
Vulkan：跨平台GPU加速标准，支持AMD和Intel显卡
CPU优化：针对多核CPU的推理优化，适合无GPU环境

使用以下命令检查你的系统支持的加速选项：

npx --no node-llama-cpp inspect gpu

node-llama-cpp logo

实战开发：解决本地模型部署痛点

本地AI应用开发常常面临模型加载慢、资源占用高、推理速度慢等挑战。本节采用问题导向式教学，通过实际案例带你解决这些常见痛点，掌握高效本地AI应用开发的核心技巧。

问题：模型加载时间过长怎么办？

问题描述：尝试加载8B模型时，启动时间超过30秒，严重影响用户体验。

解决方案：实现模型预加载与资源池化

import { getLlama } from "node-llama-cpp";

// 创建模型池单例
class ModelPool {
  private static instance: ModelPool;
  private models: Map<string, any> = new Map();
  private isLoading: Map<string, Promise<any>> = new Map();
  
  private constructor() {}
  
  static getInstance() {
    if (!ModelPool.instance) {
      ModelPool.instance = new ModelPool();
    }
    return ModelPool.instance;
  }
  
  async getModel(modelPath: string, options = {}) {
    // 如果模型已加载，直接返回
    if (this.models.has(modelPath)) {
      return this.models.get(modelPath);
    }
    
    // 如果正在加载，等待加载完成
    if (this.isLoading.has(modelPath)) {
      return await this.isLoading.get(modelPath);
    }
    
    // 开始加载模型
    const loadPromise = (async () => {
      const llama = await getLlama();
      const model = await llama.loadModel({
        modelPath,
        ...options,
        // 启用内存映射，减少初始加载时间
        mmap: true,
        // 预分配显存，避免运行时卡顿
        preallocate: true
      });
      this.models.set(modelPath, model);
      return model;
    })();
    
    this.isLoading.set(modelPath, loadPromise);
    const model = await loadPromise;
    this.isLoading.delete(modelPath);
    
    return model;
  }
  
  // 释放不常用模型
  async releaseUnusedModels(keepModels: string[] = []) {
    for (const [path, model] of this.models.entries()) {
      if (!keepModels.includes(path)) {
        model.dispose();
        this.models.delete(path);
      }
    }
  }
}

// 使用示例
async function main() {
  const modelPool = ModelPool.getInstance();
  
  // 应用启动时预加载常用模型
  modelPool.getModel("./models/Llama-3.1-8B-Instruct.Q4_K_M.gguf");
  
  // 需要时获取模型
  const model = await modelPool.getModel("./models/Llama-3.1-8B-Instruct.Q4_K_M.gguf");
  const context = await model.createContext();
  
  // 使用模型...
}

main().catch(console.error);

优化：进一步提升加载速度

模型文件分块：将大模型分割为多个小文件，实现并行加载
预编译缓存：保存模型编译结果，加速后续加载
优先级加载：先加载模型关键部分，实现快速可用

常见问题速查表：模型加载问题

Q: 模型加载时报内存不足错误？ A: 尝试降低量化级别（如从Q5降到Q4），或启用内存映射（mmap: true）
Q: 模型加载后推理速度仍很慢？ A: 检查是否启用了GPU加速，可通过gpuLayers参数调整GPU使用比例
Q: 不同模型能否共享上下文？ A: 不行，每个模型需要创建独立的上下文实例，但可以共享同一个llama实例

问题：如何在资源有限的设备上运行大模型？

问题描述：在只有8GB内存的笔记本电脑上运行13B模型时，频繁出现内存溢出。

解决方案：实现智能内存管理与按需加载

import { getLlama } from "node-llama-cpp";

async function loadLargeModelWithMemoryOptimization(modelPath: string) {
  const llama = await getLlama();
  
  // 1. 检测系统可用内存
  const memoryInfo = await llama.getMemoryInfo();
  const availableMemoryGB = memoryInfo.available / (1024 **3);
  
  // 2. 根据可用内存动态调整参数
  let gpuLayers = 0;
  let contextSize = 2048;
  
  if (availableMemoryGB >= 16) {
    gpuLayers = -1; // 使用所有可用GPU层
    contextSize = 4096;
  } else if (availableMemoryGB >= 8) {
    gpuLayers = 20; // 平衡CPU和GPU内存使用
    contextSize = 2048;
  } else {
    gpuLayers = 0; // 完全使用CPU
    contextSize = 1024;
  }
  
  // 3. 加载模型时启用内存优化
  const model = await llama.loadModel({
    modelPath,
    gpuLayers,
    contextSize,
    // 启用内存缓存
    cache: true,
    // 低内存模式
    lowMem: availableMemoryGB < 8,
    // 设置内存使用上限
    memoryUsageLimit: Math.floor(availableMemoryGB * 0.7 * 1024** 3) // 使用70%可用内存
  });
  
  return { model, llama };
}

// 使用示例
async function main() {
  try {
    const { model, llama } = await loadLargeModelWithMemoryOptimization(
      "./models/Llama-3.1-13B-Instruct.Q4_K_M.gguf"
    );
    
    // 创建上下文时进一步优化
    const context = await model.createContext({
      // 启用增量编码
      incremental: true,
      // 设置批处理大小
      batchSize: 512
    });
    
    // 使用模型...
    
    context.dispose();
    model.dispose();
    llama.dispose();
  } catch (error) {
    console.error("模型加载失败:", error);
  }
}

main().catch(console.error);

优化：实现上下文窗口动态调整

根据输入文本长度自动调整上下文窗口大小，在保证推理质量的同时减少内存占用。

常见问题速查表：内存优化问题

Q: 启用GPU加速后内存使用反而增加？ A: 检查是否同时加载了多个模型，GPU内存碎片化可能导致此问题，尝试一次只加载一个模型
Q: 如何在32位系统上运行大模型？ A: 32位系统有4GB内存限制，不建议运行8B以上模型，可选择更小的模型如Mistral-7B的Q2量化版本
Q: 模型运行时出现频繁的磁盘交换？ A: 增加系统交换空间，或使用低内存模式(lowMem: true)，减少内存占用

模型量化技巧：平衡性能与资源消耗

如何在有限的硬件资源上获得最佳模型性能？量化技术是关键。node-llama-cpp支持多种量化方法，让你根据需求选择最佳平衡点：

import { getLlama } from "node-llama-cpp";

async function evaluateQuantizationOptions(modelPathWithoutQuant: string) {
  // 定义要测试的量化级别
  const quantLevels = [
    { name: "Q2_K", path: modelPathWithoutQuant.replace(".gguf", ".Q2_K.gguf") },
    { name: "Q4_K_M", path: modelPathWithoutQuant.replace(".gguf", ".Q4_K_M.gguf") },
    { name: "Q5_K_S", path: modelPathWithoutQuant.replace(".gguf", ".Q5_K_S.gguf") },
    { name: "Q8_0", path: modelPathWithoutQuant.replace(".gguf", ".Q8_0.gguf") }
  ];
  
  const results = [];
  
  const llama = await getLlama();
  
  for (const quant of quantLevels) {
    console.log(`测试量化级别: ${quant.name}`);
    
    try {
      const startTime = performance.now();
      const model = await llama.loadModel({
        modelPath: quant.path,
        gpuLayers: -1
      });
      const loadTime = (performance.now() - startTime) / 1000;
      
      // 创建上下文
      const context = await model.createContext();
      
      // 简单推理测试
      const prompt = "请简要介绍人工智能的发展历程。";
      const startTimeInference = performance.now();
      const completion = await context.createCompletion({
        prompt,
        maxTokens: 100,
        temperature: 0.7
      });
      const inferenceTime = (performance.now() - startTimeInference) / 1000;
      const tokensPerSecond = completion.tokens.length / inferenceTime;
      
      // 获取内存使用
      const memoryInfo = await llama.getMemoryInfo();
      const memoryUsed = memoryInfo.total - memoryInfo.available;
      
      results.push({
        quantLevel: quant.name,
        loadTime,
        inferenceTime,
        tokensPerSecond,
        memoryUsedMB: memoryUsed / (1024 **2),
        output: completion.content.substring(0, 50) + "..."
      });
      
      // 清理资源
      context.dispose();
      model.dispose();
    } catch (error) {
      results.push({
        quantLevel: quant.name,
        error: error.message
      });
    }
  }
  
  llama.dispose();
  
  // 输出结果表格
  console.table(results, [
    "quantLevel", "loadTime", "tokensPerSecond", "memoryUsedMB", "output"
  ]);
  
  return results;
}

// 使用示例
evaluateQuantizationOptions("./models/Llama-3.1-8B-Instruct.gguf")
  .catch(console.error);

💡 技巧提示：Q4_K_M通常是最佳平衡点，相比Q8_0可节省约50%内存，而质量损失小于5%。对于资源受限的环境，Q2_K可以将模型大小减少75%，适合边缘设备部署。

场景拓展：企业级部署考量

将本地AI应用从原型推向生产环境需要考虑哪些关键因素？企业级部署不仅要确保系统稳定性，还要实现高效的模型管理和资源监控。本节将深入探讨多模型管理、资源监控和容器化部署等企业级特性。

多模型管理：构建智能模型调度系统

在实际应用中，往往需要同时部署多个不同用途的模型。如何高效管理这些模型，实现资源的最优分配？

import { getLlama } from "node-llama-cpp";

// 模型元数据接口
interface ModelMetadata {
  id: string;
  name: string;
  path: string;
  type: 'chat' | 'embedding' | 'code' | 'multimodal';
  quantLevel: string;
  requiredMemoryMB: number;
  priority: number; // 1-10，10为最高优先级
  maxInstances: number;
}

// 模型实例接口
interface ModelInstance {
  id: string;
  model: any;
  metadata: ModelMetadata;
  lastUsed: number;
  usageCount: number;
}

class EnterpriseModelManager {
  private llama: any;
  private modelMetadata: Map<string, ModelMetadata> = new Map();
  private modelInstances: Map<string, ModelInstance> = new Map();
  private availableMemoryMB: number;
  private usedMemoryMB: number = 0;
  
  constructor(availableMemoryMB: number) {
    this.availableMemoryMB = availableMemoryMB;
  }
  
  async initialize() {
    this.llama = await getLlama();
  }
  
  // 注册模型元数据
  registerModel(metadata: ModelMetadata) {
    this.modelMetadata.set(metadata.id, metadata);
  }
  
  // 获取模型实例，自动处理加载和卸载
  async getModelInstance(modelId: string) {
    // 检查是否已有可用实例
    const existingInstance = Array.from(this.modelInstances.values())
      .find(inst => inst.metadata.id === modelId && inst.usageCount < inst.metadata.maxInstances);
      
    if (existingInstance) {
      existingInstance.lastUsed = Date.now();
      existingInstance.usageCount++;
      return existingInstance.model;
    }
    
    // 获取模型元数据
    const metadata = this.modelMetadata.get(modelId);
    if (!metadata) {
      throw new Error(`Model ${modelId} not registered`);
    }
    
    // 检查内存是否足够
    if (this.usedMemoryMB + metadata.requiredMemoryMB > this.availableMemoryMB) {
      // 尝试释放低优先级模型
      await this.releaseLowPriorityModels(metadata.requiredMemoryMB);
      
      // 如果仍然不够，抛出错误
      if (this.usedMemoryMB + metadata.requiredMemoryMB > this.availableMemoryMB) {
        throw new Error(`Insufficient memory to load model ${modelId}`);
      }
    }
    
    // 加载新模型实例
    const model = await this.llama.loadModel({
      modelPath: metadata.path,
      gpuLayers: -1
    });
    
    const instanceId = `${modelId}-${Date.now()}`;
    const instance: ModelInstance = {
      id: instanceId,
      model,
      metadata,
      lastUsed: Date.now(),
      usageCount: 1
    };
    
    this.modelInstances.set(instanceId, instance);
    this.usedMemoryMB += metadata.requiredMemoryMB;
    
    return model;
  }
  
  // 释放模型实例
  releaseModelInstance(modelId: string) {
    const instance = Array.from(this.modelInstances.values())
      .find(inst => inst.metadata.id === modelId);
      
    if (instance) {
      instance.usageCount--;
      instance.lastUsed = Date.now();
      
      // 如果使用计数为0，完全释放
      if (instance.usageCount === 0) {
        instance.model.dispose();
        this.modelInstances.delete(instance.id);
        this.usedMemoryMB -= instance.metadata.requiredMemoryMB;
      }
    }
  }
  
  // 释放低优先级模型
  private async releaseLowPriorityModels(requiredMemoryMB: number) {
    // 按优先级和最后使用时间排序，优先释放低优先级和长时间未使用的模型
    const candidates = Array.from(this.modelInstances.values())
      .sort((a, b) => {
        if (a.metadata.priority !== b.metadata.priority) {
          return a.metadata.priority - b.metadata.priority;
        }
        return a.lastUsed - b.lastUsed;
      });
      
    let memoryFreed = 0;
    
    for (const instance of candidates) {
      if (memoryFreed >= requiredMemoryMB) break;
      
      instance.model.dispose();
      this.modelInstances.delete(instance.id);
      memoryFreed += instance.metadata.requiredMemoryMB;
      this.usedMemoryMB -= instance.metadata.requiredMemoryMB;
    }
  }
  
  // 获取当前资源使用情况
  getResourceUsage() {
    return {
      totalMemoryMB: this.availableMemoryMB,
      usedMemoryMB: this.usedMemoryMB,
      freeMemoryMB: this.availableMemoryMB - this.usedMemoryMB,
      modelCount: this.modelInstances.size,
      models: Array.from(this.modelInstances.values()).map(inst => ({
        id: inst.metadata.id,
        name: inst.metadata.name,
        usageCount: inst.usageCount,
        memoryMB: inst.metadata.requiredMemoryMB
      }))
    };
  }
  
  async dispose() {
    for (const instance of this.modelInstances.values()) {
      instance.model.dispose();
    }
    this.llama.dispose();
  }
}

// 使用示例
async function main() {
  // 假设系统有16GB内存可用
  const modelManager = new EnterpriseModelManager(16 * 1024);
  await modelManager.initialize();
  
  // 注册模型
  modelManager.registerModel({
    id: "llama3-8b",
    name: "Llama 3.1 8B Instruct",
    path: "./models/Llama-3.1-8B-Instruct.Q4_K_M.gguf",
    type: "chat",
    quantLevel: "Q4_K_M",
    requiredMemoryMB: 6000,
    priority: 8,
    maxInstances: 2
  });
  
  modelManager.registerModel({
    id: "bge-m3",
    name: "BGE M3 Embedding",
    path: "./models/bge-m3.Q4_K_M.gguf",
    type: "embedding",
    quantLevel: "Q4_K_M",
    requiredMemoryMB: 2000,
    priority: 5,
    maxInstances: 1
  });
  
  // 获取模型实例
  const chatModel = await modelManager.getModelInstance("llama3-8b");
  console.log("资源使用情况:", modelManager.getResourceUsage());
  
  // 使用模型...
  
  // 释放模型
  modelManager.releaseModelInstance("llama3-8b");
  
  await modelManager.dispose();
}

main().catch(console.error);

资源监控：构建AI应用的可观测性体系

企业级应用必须具备完善的监控能力，以便及时发现和解决问题。以下是如何实现本地AI应用的资源监控：

import { getLlama } from "node-llama-cpp";
import { createServer } from "http";

class AiResourceMonitor {
  private llama: any;
  private monitoringInterval: NodeJS.Timeout;
  private metrics: {
    timestamp: number;
    cpuUsage: number;
    memoryUsage: {
      totalMB: number;
      usedMB: number;
      freeMB: number;
    };
    gpuUsage: {
      utilization: number;
      memoryUsedMB: number;
      memoryTotalMB: number;
    };
    inferenceStats: {
      activeRequests: number;
      queueLength: number;
      avgInferenceTimeMs: number;
    };
  }[] = [];
  
  constructor() {
    this.initializeMonitoring();
  }
  
  async initialize() {
    this.llama = await getLlama();
  }
  
  private initializeMonitoring() {
    // 每5秒收集一次指标
    this.monitoringInterval = setInterval(async () => {
      await this.collectMetrics();
    }, 5000);
  }
  
  private async collectMetrics() {
    try {
      // 获取系统内存信息
      const memoryInfo = await this.llama.getMemoryInfo();
      
      // 获取GPU信息
      const gpuInfo = await this.llama.getGpuInfo();
      
      // 收集推理统计信息（实际应用中需要从推理队列中获取）
      const inferenceStats = {
        activeRequests: 0, // 实际应用中替换为真实数据
        queueLength: 0,    // 实际应用中替换为真实数据
        avgInferenceTimeMs: 0 // 实际应用中替换为真实数据
      };
      
      // 收集CPU使用率
      const cpuUsage = process.cpuUsage().user / 1000;
      
      const metric = {
        timestamp: Date.now(),
        cpuUsage,
        memoryUsage: {
          totalMB: memoryInfo.total / (1024 **2),
          usedMB: (memoryInfo.total - memoryInfo.available) / (1024** 2),
          freeMB: memoryInfo.available / (1024 **2)
        },
        gpuUsage: {
          utilization: gpuInfo.utilization || 0,
          memoryUsedMB: gpuInfo.memoryUsed / (1024** 2),
          memoryTotalMB: gpuInfo.memoryTotal / (1024 **2)
        },
        inferenceStats
      };
      
      this.metrics.push(metric);
      
      // 保留最近100个指标
      if (this.metrics.length > 100) {
        this.metrics.shift();
      }
      
    } catch (error) {
      console.error("指标收集失败:", error);
    }
  }
  
  // 提供HTTP接口暴露监控数据
  startHttpServer(port: number) {
    const server = createServer((req, res) => {
      if (req.url === "/metrics") {
        res.setHeader("Content-Type", "application/json");
        res.end(JSON.stringify(this.metrics));
      } else if (req.url === "/health") {
        res.statusCode = 200;
        res.end("OK");
      } else {
        res.statusCode = 404;
        res.end("Not found");
      }
    });
    
    server.listen(port, () => {
      console.log(`监控服务器运行在 http://localhost:${port}`);
    });
    
    return server;
  }
  
  async dispose() {
    clearInterval(this.monitoringInterval);
    if (this.llama) {
      await this.llama.dispose();
    }
  }
}

// 使用示例
async function main() {
  const monitor = new AiResourceMonitor();
  await monitor.initialize();
  
  // 启动监控服务器
  const server = monitor.startHttpServer(3000);
  
  // 应用运行中...
  
  // 当应用关闭时
  // await monitor.dispose();
  // server.close();
}

main().catch(console.error);

离线部署方案：实现完全隔离的本地AI系统

在某些安全要求极高的环境中，需要实现完全离线的本地AI部署。以下是如何构建一个不依赖任何外部资源的离线AI应用：

import { getLlama } from "node-llama-cpp";
import * as fs from "fs";
import * as path from "path";

class OfflineAiApplication {
  private modelPath: string;
  private modelChecksumPath: string;
  private llama: any;
  private model: any;
  
  constructor(modelPath: string) {
    this.modelPath = modelPath;
    this.modelChecksumPath = `${modelPath}.sha256`;
  }
  
  // 验证模型完整性
  private verifyModelIntegrity(): boolean {
    try {
      if (!fs.existsSync(this.modelPath) || !fs.existsSync(this.modelChecksumPath)) {
        console.error("模型文件或校验和文件不存在");
        return false;
      }
      
      // 读取存储的校验和
      const storedChecksum = fs.readFileSync(this.modelChecksumPath, "utf8").trim();
      
      // 计算当前文件校验和（实际应用中应使用crypto模块实现）
      // 这里简化处理，实际应用中需要实现真正的SHA256计算
      const currentChecksum = "dummy-checksum"; // 替换为真实计算的校验和
      
      return storedChecksum === currentChecksum;
    } catch (error) {
      console.error("模型校验失败:", error);
      return false;
    }
  }
  
  // 加载模型，仅使用本地资源
  async loadModel() {
    // 验证模型完整性
    if (!this.verifyModelIntegrity()) {
      throw new Error("模型文件损坏或未找到");
    }
    
    // 初始化llama，禁用任何网络请求
    this.llama = await getLlama({
      disableNetwork: true,
      cacheDir: path.join(__dirname, "cache"),
      offlineMode: true
    });
    
    // 加载模型
    this.model = await this.llama.loadModel({
      modelPath: this.modelPath,
      // 使用最大GPU层，减少内存占用
      gpuLayers: -1,
      // 禁用任何可能的网络活动
      allowDownload: false
    });
    
    console.log("模型加载成功，进入离线模式");
    return this.model;
  }
  
  // 执行推理，确保不产生网络请求
  async runInference(prompt: string, options: any = {}) {
    if (!this.model) {
      throw new Error("模型未加载");
    }
    
    const context = await this.model.createContext(options.contextOptions || {});
    
    try {
      // 执行推理
      const result = await context.createCompletion({
        prompt,
        ...options.inferenceOptions
      });
      
      return result;
    } finally {
      // 释放上下文
      context.dispose();
    }
  }
  
  // 清理资源
  async dispose() {
    if (this.model) {
      this.model.dispose();
    }
    if (this.llama) {
      this.llama.dispose();
    }
  }
}

// 使用示例
async function main() {
  const offlineApp = new OfflineAiApplication("./models/offline/Llama-3.1-8B-Instruct.Q4_K_M.gguf");
  
  try {
    await offlineApp.loadModel();
    
    const result = await offlineApp.runInference(
      "请分析以下数据并提供见解：[本地数据样本]",
      {
        inferenceOptions: {
          maxTokens: 500,
          temperature: 0.7
        }
      }
    );
    
    console.log("推理结果:", result.content);
  } catch (error) {
    console.error("离线AI应用出错:", error);
  } finally {
    await offlineApp.dispose();
  }
}

main().catch(console.error);