DeepSeek-R1模型W8A16量化实践与问题解决

2025-04-28 04:14:57作者：尤峻淳Whitney

在大型语言模型的实际部署中，模型量化是降低计算资源需求、提高推理效率的重要手段。本文将分享在DeepSeek-R1模型上实施W8A16（权重8位、激活16位）量化的完整过程，以及遇到的技术问题及其解决方案。

量化流程概述

DeepSeek-R1是一个参数规模达到671B的MoE（混合专家）模型。我们的量化目标是从原始FP8版本转换为W8A16格式，主要步骤如下：

从官方仓库获取FP8版本的DeepSeek-R1模型
使用专用转换脚本将FP8转换为BF16格式作为中间步骤
应用llm-compressor工具进行W8A16量化

量化实施细节

量化过程使用了llm-compressor工具库中的量化模块。核心代码配置如下：

from transformers import AutoTokenizer
from modeling_deepseek import DeepseekV3ForCausalLM
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot

MODEL_ID = "/data/models/DeepSeek-R1-bf16/DeepSeek-R1-bf16/"
OUTPUT_DIR = "/data/models/DeepSeek-R1-w8a16"

model = DeepseekV3ForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto", trust_remote_code=True
)

recipe = QuantizationModifier(
    targets="Linear", 
    scheme="W8A16", 
    ignore=["lm_head", "re:.*mlp.gate$"]
)

oneshot(
    model=model,
    recipe=recipe,
    tokenizer=AutoTokenizer.from_pretrained(MODEL_ID),
    output_dir=OUTPUT_DIR,
)

这段代码实现了对模型中所有Linear层的W8A16量化，同时排除了lm_head和特定模式的MLP门控层。

关键技术问题与解决

在量化后的模型部署阶段，我们遇到了一个关键错误：

AttributeError: Layer 'ColumnParallelLinear(in_features=512, output_features=2048, bias=False, tp_size=16, gather_output=False)' has neither weight nor qweight

经过分析，发现问题根源在于vLLM框架的权重获取逻辑不兼容量化后的权重命名方式。量化后的模型权重被命名为"weight_packed"，而vLLM框架默认只检查"weight"和"qweight"属性。

解决方案是修改vLLM框架中的权重获取函数，增加对"weight_packed"属性的支持：

def get_layer_weight(layer):
    if hasattr(layer, "weight"):
        return layer.weight
    elif hasattr(layer, "qweight"):
        return layer.qweight
    elif hasattr(layer, "weight_packed"):
        return layer.weight_packed
    else:
        raise AttributeError(
            f"Layer '{layer}' has neither weight nor qweight")

分布式部署实践

在2节点、每节点8块A100 GPU的环境下，我们采用Ray框架实现多节点分布式部署。关键部署步骤如下：

主节点启动Ray服务：

NCCL_IB_DISABLE=1 NCCL_SOCKET_IFNAME="ens81f0" GLOO_SOCKET_IFNAME="ens81f0" ray start --head --dashboard-host 0.0.0.0

工作节点加入集群：

NCCL_IB_DISABLE=1 NCCL_SOCKET_IFNAME="ens81f0" GLOO_SOCKET_IFNAME="ens81f0" ray start --address='<your-ip>:<port>'

设置环境变量：

export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')

启动vLLM服务：

NCCL_IB_DISABLE=1 NCCL_DEBUG=INFO python -m vllm.entrypoints.openai.api_server \
--model /data/models/DeepSeek-R1-w8a16 \
--trust-remote-code \
--served-model-name deepseek-r1-w8a16 \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--gpu-memory-utilization 0.85 \
--uvicorn-log-level debug \
--max-model-len 16000 \
--host 0.0.0.0 \
--port 11000 \
--dtype float16