RuntimeError: RuntimeError when making fake tensor call

快速结论：此报错发生在 vLLM 使用 DiffusionGemma 模型启动多 GPU 服务时，tensor-parallel 策略下采样器中的 soft-embedding 计算因词汇表维度不匹配而崩溃。优先排查是否启用了 TP>1 且 vLLM 版本低于 0.23.1rc1 的特定 nightly 构建。

问题场景

用户在 vLLM 中部署 nvidia/diffusiongemma-26B-A4B-IT-NVFP4 或 aidendle94/diffusiongemma-26B-A4B-it-INT8-dynamic 模型时，使用 tensor-parallel size >1 或 pipeline-parallel >1 启动服务。单 GPU 运行正常，但多 GPU 配置在引擎预热阶段崩溃。官方文档仅描述单 80 GiB GPU（H100/H200）部署，未提示 TP/PP 不兼容。

报错原文

RuntimeError when making fake tensor call
  call_function matmul(
      FakeTensor(size=(s21, (s88//s21), s3), dtype=torch.bfloat16),
      Parameter(FakeTensor(size=(131072, 2816), dtype=torch.bfloat16))
  ): got RuntimeError('a and b must have same reduction dim, but got [s21*((s88//s21)), s3] X [131072, 2816].')

或类似（取决于词汇表大小）：

RuntimeError: a and b must have same reduction dim, but got [s88, s3] X [65536, 2816].

PP 模式下额外报错：

assert sampled_token_ids.dtype == torch.int64
AssertionError

原因分析

可能原因：DiffusionGemma 的自条件化软嵌入（self-conditioning soft-embedding）在采样器中计算 torch.matmul(probs, embed_weight) 时，probs 基于完整词汇表大小（get_vocab_size()），但 embed_weight 来自 embed_tokens.weight（VocabParallelEmbedding 分片），在 TP>1 时分片尺寸为 [vocab/tp, hidden]，导致矩阵乘法约减维度不匹配。PP 模式下则因扩散采样器输出为 float 类型而非 int64，与 PP 广播断言冲突。

环境排查

vLLM 版本：确认是否低于 0.23.1rc1 之后的特定 nightly（如 0.23.1rc1.dev20+g0d8097964 包含该 bug）
PyTorch 版本：如 2.11.0+cu130
CUDA 版本：如 cu130（取决于 wheel）
GPU 配置：检查 TP/PP 参数是否 >1，以及显卡显存是否不足以单卡加载
模型量化类型：NVFP4 或 INT8-dynamic，确认是否使用官方适配格式
启动参数：检查 --tensor-parallel-size、--pipeline-parallel-size、--diffusion-config 设置

解决步骤

升级 vLLM 至最新 nightly 或包含修复的版本：Issue 讨论指出 PR #46177 已合并至 nightly（2026年6月26日合并），该修复在采样器前将分片的 embed_tokens.weight 全收集（all-gather）为完整权重再传入 DiffusionSampler。可尝试从 wheels.vllm.ai 获取最新 wheel 或从源码编译对应 commit。
若仍使用旧版本：可尝试临时限制为 TP=1 和 PP=1（单 GPU），或使用显存更大的单 GPU 部署。多 GPU 场景下 TP>1 问题已修复，但 PP>1 问题在讨论时仍存在。
针对 PP>1 的额外步骤：目前 PP 模式下采样器输出类型问题尚未被修复，除非 vLLM 后续更新处理 pp_utils.broadcast 的 int64 断言。可关注 https://github.com/vllm-project/vllm/pull/46212 或相关后续 PR。
如使用 Docker 镜像：检查镜像标签是否包含修复。讨论中提到的 Docker 镜像 vllm/vllm-openai:gemma-x86_64-cu129 可能不包含该修复。

验证方法

以 TP=2 启动命令（参考 Issue 复现步骤）：

vllm serve google/diffusiongemma-26B-A4B-it \
    --tensor-parallel-size 2 \
    --max-num-seqs 4 \
    --hf-overrides '{"diffusion_sampler":"entropy_bound","diffusion_entropy_bound":0.1}' \
    --diffusion-config '{"canvas_length": 256}'

若服务启动成功且无上述 matmul 报错，且能正常处理推理请求，则 TP 问题已解决。PP 问题需等待后续修复。

参考来源

vllm-project/vllm #45719

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

RuntimeError: RuntimeError when making fake tensor call