[Intel XPU] `gemv_4bit` with NF4 has huge `bfloat16` error compared with `float16` on Intel Arc A770

快速结论：该问题出现在 Intel Arc A770 显卡上，使用 bitsandbytes 的 NF4 量化 + gemv_4bit 操作时，bfloat16 精度下数值误差比 float16 大三个数量级以上。优先排查是否走的是 SYCL 内核路径（而非 Triton 路径），若为 SYCL 内核路径，可尝试临时重命名 libbitsandbytes_xpu.dll 强制使用 Triton 路径。

问题场景

用户在 Windows 11 系统（25H2 26200）的 Intel Arc A770 显卡上，使用 bitsandbytes 0.49.2 + PyTorch 2.11.0+xpu + triton-xpu 3.7.0 环境，运行 bitsandbytes.functional.gemv_4bit(...) 或基于 NF4 量化权重的 bnb_4bit_compute_dtype=torch.bfloat16 模型推理时触发。Triton 路径下结果正常，仅 SYCL 内核路径（GEMV inference kernel）存在该问题。

报错原文

torch=2.11.0+xpu
bitsandbytes=0.49.2
device=xpu:0
device_name=Intel(R) Arc(TM) A770 Graphics
torch.float16: gemv_4bit mae=0.008388, max_error=0.041985
torch.bfloat16: gemv_4bit mae=12.890043, max_error=53.826134

原因分析

可能原因：Intel Arc A770 的 SYCL GEMV 内核中，bfloat16 乘法步骤在累加前才转换为 float32，导致精度损失严重。对比 NVIDIA RTX 4090（CUDA 内核）上 bfloat16 误差仅为 mae=0.068209, max_error=0.342186，属于正常范围；同一权重下走 Triton 路径的 bfloat16 误差也正常。开发者验证 Arc B60 和 Data Center GPU Max 1550 未见该问题，因此该问题可能仅影响特定 Windows 平台的 Intel Arc A770 显卡（SYCL 内核实现差异）。

环境排查

OS: Windows 11 Pro 25H2 26200
GPU: Intel(R) Arc(TM) A770 Graphics
GPU driver: 32.0.101.8509
Python: 3.13.11
PyTorch: 2.11.0+xpu
bitsandbytes: 0.49.2
triton-xpu: 3.7.0
intel-sycl-rt / intel-opencl-rt / dpcpp-cpp-rt: 2025.3.2
numpy: 2.4.3

解决步骤

确认当前 GEMV 路径：运行 Issue 中的复现脚本，检查输出是否为 bfloat16 误差极大（mae 约 12.89）。
强制使用 Triton 路径：找到 bitsandbytes 安装目录中的 libbitsandbytes_xpu.dll，将其临时重命名为 libbitsandbytes_xpu.dll.bak。重启 Python 环境后重新运行复现脚本。
对比输出：若此时 bfloat16 误差恢复为正常范围（如 mae=0.048875, max_error=0.245483），则说明问题确为 SYCL 内核路径引起。
后续处理：
- 在官方修复 SYCL 内核之前，建议在 Intel Arc A770 上使用 bnb_4bit_compute_dtype=torch.float16（已验证 float16 结果正确）。
- 若需使用 bfloat16，可临时用上述重命名方法强制走 Triton 路径，但需注意该路径功能可能非完全覆盖（如未来更新中的兼容性）。

验证方法

运行 Issue 中给出的复现代码，观察输出中 torch.bfloat16 行的 mae 和 max_error 值：正常范围应在 mae<0.1, max_error<0.5 左右（对比 Triton 路径结果）。若数值降至该范围，则问题解决。

参考来源

bitsandbytes-foundation/bitsandbytes #1932

关键评论：用户测试 Triton 路径后得到正常结果，开发者确认 Arc B60 和 PVC 1550 无此问题。

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

[Intel XPU] `gemv_4bit` with NF4 has huge `bfloat16` error compared with `float16` on Intel Arc A770