_prefill: cache.get_seq_length() returns device tensor, breaks slice arithmetic for static cache + torch.compile on non-CUDA backends

快速结论：该报错发生在使用 static cache 配合 torch.compile（非 CUDA 后端，如 CPU/Inductor）进行多轮 model.generate() 调用时。优先排查 cache.get_seq_length() 返回的是否是 torch.Tensor 而非 int，并在 _prefill 中将其转换为 Python int。

问题场景

用户在使用 HuggingFace Transformers 库加载 GPT-2 模型（或其他 CausalLM 模型）时，设置 generation_config.cache_implementation = "static" 并开启了 torch.compile(model.forward, backend="inductor")。在第二次调用 model.generate()（且不手动重置 Dynamo）时触发崩溃。

此问题在 Transformers 5.2.0 正常工作，在 5.5.3（引入 _prefill 后）首次出现。

注意：在 CUDA 后端，CUDA graph replay 可能掩盖该 bug；如需在 CUDA 上复现，可在两次 generate() 调用之间添加 torch._dynamo.reset() 强制重新 trace。

报错原文

# 次生报错表现为切片索引异常（因 tensor 被当作 slice 索引值）
# 核心调用链异常，无独立报错消息，关键上下文：
#
# cache_utils.py line 383:
#   self.cumulative_length = torch.tensor([0], dtype=int)
# cache_utils.py line 461:
#   get_seq_length() returns this tensor directly (annotation says -> int but actual return is torch.Tensor)
# generation/utils.py _prefill():
#   next_sequence_length = input_ids.shape[1] - past_length  → produces tensor
# generation/utils.py line 532:
#   input_ids[:, -next_sequence_length:]  → tensor-as-slice-index produces wrong (0-length) result

原因分析

根本原因是 Transformers 5.5.x 引入的 _prefill 函数在与 static cache 配合时，未正确处理 cache.get_seq_length() 返回的类型。具体来说：

static cache 的设计约束：cache_utils.py 第 383 行将 cumulative_length 设为 tensor（注释标“avoid recompiling”），但 get_seq_length()（第 461 行）直接返回此 tensor，违背了返回类型的 int 申明。
类型断裂：_prefill 中 past_length = cache.get_seq_length() 得到 tensor，随后 next_sequence_length = input_ids.shape[1] - past_length 产生 tensor 结果，破坏了 next_sequence_length: int | None 的类型约契。
切片错误：input_ids[:, -next_sequence_length:] 使用 tensor 作为切片索引值，导致结果长度为 0（非预期），从而引发后续错误。

在 CUDA 后端，get_seq_length() 可能返回 int（因不同分支处理），故不总是触发。此问题是通过 generation_config 保留 static cache 跨调用复用时才出现；若每次生成都传入 cache_implementation="static" 参数（创建新 cache）则可能规避。

环境排查

Transformers 版本：5.5.3 或更高（5.5.x~5.6.x 范围）；5.2.0 无此问题
PyTorch 版本：2.10.0（报告中版本）
Python 版本：3.10.12
操作系统：Linux Ubuntu 22.04
设备：CPU 或非 CUDA 后端（如 Inductor）
模型：GPT-2 或任意 CausalLM 模型（非必需 CUDA）
关键依赖：static cache + torch.compile

注意：Issue 报告者提到最新版可能已修复（可能与 merge 的 PR #46802 相关），若无法复现请确认是否使用最新 Transformers 主分支。

解决步骤

读取并辨别 cache 类型：在 _prefill 中或生成循环中，获取 cache.get_seq_length() 后立即检查是否为 torch.Tensor。
转换为 Python int：如果是 tensor，使用 .item() 显式转换为整数：
```
past_length = cache.get_seq_length()
if isinstance(past_length, torch.Tensor):
    past_length = int(past_length.item())
```
此操作只转换局部变量，不影响底层 cache 存储的 tensor，因此不会破坏 torch.compile 的图编译缓存。
应用修复：将此转换放置在 _prefill 函数中 past_length 使用之前（例如 Issue 建议的做法）。
可优先尝试：若不想修改源码，临时解决方案是将 cache_implementation="static" 作为参数传给 generate() 调用而非设置在 generation_config 上（即每次生成使用新 cache）。
升级 Transformers：检查是否已拉取最新版或包含 #46802 合并的版本（该 PR 可能已修复此问题）。

验证方法

编写测试脚本：使用 Issue 中的复现代码（不带 torch._dynamo.reset()），在 CPU 或 Inductor 后端下执行。
确认第二次 model.generate() 不再报错，返回正常的 output shape。
验证类型：在修复点添加打印或单步调试，确认 past_length 始终为 Python int。
对比旧版：在 Transformers 5.2.0 下运行相同脚本确认无误以作 baseline。

参考来源

huggingface/transformers #46858

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

_prefill: cache.get_seq_length() returns device tensor, breaks slice arithmetic for static cache + torch.compile on non-CUDA backends