[Bug]: test_flashinfer_cutlass_mxfp4_fused_moe accuracy mismatch on H20 (sm90) — 89% mismatch vs 20% threshold

快速结论：该报错发生在 vLLM 0.23.0 的 OCP MXFP4 MoE 单元测试中，表现为 NVIDIA H20 (sm90) 上 8 个 parametrize 用例全部报 89% 的不匹配率（远超阈值 20%）。经排查，这并非 flashinfer kernel 本身 bug，而是测试代码的权重/缩放因子预处理方式在 sm90 上已过时，改用专用的 interleave_moe_weights_for_sm90_mixed_gemm 和 interleave_moe_scales_for_sm90_mixed_gemm 即可修复。

问题场景

运行 vLLM 内置的单元测试 test_flashinfer_cutlass_mxfp4_fused_moe 时，在单卡 NVIDIA H20 (sm90) 上触发。该测试被条件 gate HOPPER_MXFP4_BF16_AVAILABLE 收集（sm90 + flashinfer 存在）。使用者通过 PyPI 官方 wheel 安装 vLLM 0.23.0，搭配 transformers 4.57.1/5.12.1、CUDA 13.0、PyTorch 2.11.0+cu130。

报错原文

E   Exception: Mismatch percentage is 0.8997 for rtol 0.3 (threshold: 0.2000)

以及所有 8 个参量化测试用例均失败：

test_flashinfer_cutlass_mxfp4_fused_moe[1.0-1.0-None-3072-3072-1-32-1]
test_flashinfer_cutlass_mxfp4_fused_moe[1.0-1.0-None-3072-3072-1-32-4]
test_flashinfer_cutlass_mxfp4_fused_moe[1.0-1.0-None-3072-3072-128-32-1]
test_flashinfer_cutlass_mxfp4_fused_moe[1.0-1.0-None-3072-3072-128-32-4]
test_flashinfer_cutlass_mxfp4_fused_moe[1.702-1.0-7.0-3072-3072-1-32-1]
test_flashinfer_cutlass_mxfp4_fused_moe[1.702-1.0-7.0-3072-3072-1-32-4]
test_flashinfer_cutlass_mxfp4_fused_moe[1.702-1.0-7.0-3072-3072-128-32-1]
test_flashinfer_cutlass_mxfp4_fused_moe[1.702-1.0-7.0-3072-3072-128-32-4]

原因分析

经过排查确认：问题不在 flashinfer kernel 本身（该 kernel 在 sm90 上计算正确），而是测试代码中预处理权重/缩放因子的方式已过时。

根本原因：测试中手写了 _interleave_scales_lastdim_by4 等自定义 interleave 逻辑，而非使用 vLLM 为 sm90 提供的专用接口。当改用 interleave_moe_weights_for_sm90_mixed_gemm(quant_type="fp4") 和 interleave_moe_scales_for_sm90_mixed_gemm 后，所有测试通过。
非原因：transformers 版本差异无关（已验证 4.57.1 与 5.12.1 表现一致）；flashinfer 版本也非直接问题（0.6.12 下通过正确预处理可正常运行）。

环境排查

vLLM 版本：0.23.0（PyPI 发行版）；vllm-project/vllm@main 也可复现
flashinfer：vLLM 0.23.0 wheel 中捆绑的版本
transformers：4.57.1 或 5.12.1 均受影响
GPU：NVIDIA H20 (sm_90)，单卡
CUDA 版本：13.0
PyTorch 版本：2.11.0+cu130
其他硬件：1×H100 上同样可复现

解决步骤

定位测试文件 tests/kernels/moe/test_ocp_mx_moe.py 中的 test_flashinfer_cutlass_mxfp4_fused_moe 测试函数。
将权重预处理逻辑从手写的 interleave 实现（如 _interleave_scales_lastdim_by4）替换为 vLLM 提供的专用接口：
- 权重 interleave：使用 interleave_moe_weights_for_sm90_mixed_gemm(quant_type="fp4")
- 缩放因子 interleave：使用 interleave_moe_scales_for_sm90_mixed_gemm
提交时确保引用相关修复 PR（如该 Issue 对应的 #46915）。
（可选）如果担心其他环境差异，可考虑在测试 gate 上加版本或 sm 限制，但根据当前分析，仅修改权重预处理即可解决。

验证方法

在 H20 或 H100（sm90）上重新运行测试命令：

pytest -v tests/kernels/moe/test_ocp_mx_moe.py::test_flashinfer_cutlass_mxfp4_fused_moe

如果 8 个用例全部通过（而非之前的 8 个失败），则问题已修复。注意：原 Issue 中虽然报告了 89% 的 mismatch，但实际修复前测得的是 8 个失败；修复后为 8 个通过。

参考来源

vllm-project/vllm #46585

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

[Bug]: test_flashinfer_cutlass_mxfp4_fused_moe accuracy mismatch on H20 (sm90) — 89% mismatch vs 20% threshold