Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700

First one works. Then I ask “Can you write a Python script that prints Hello World?” and it responds correctly.
…
So far so good, all working. But then I asked it to write a longer Python script, and it died.
…
E process: common/speculative.cpp:618 – draft decode failed: llama_decode(ctx_dft) = -1 (n_tokens=9, ubatch_pos[0]=873, n_seq_tgt=1, n_ctx_tgt=2, ubatch_pos[874])
1.42.685.817 I srv update_slots: id 0 | task 0 | draft-token count: 649
1.42.685.867 I srv update_slots: id 0 | task 0 | context checkpoints, no compression, free slots 0, unused nodes 0, used nodes 2, allocated 2, reused 0
1.42.685.867 I srv fatal_error: fatal error in server, please reach out to us at https://github.com/ggml-org/llama.cpp/pull/20277: E process: common/speculative.cpp:618 – draft decode failed: llama_decode(ctx_dft) = -1
“`

Based on deeper analysis from the comments, the issue is speculative.cpp when decoding the draft model, the issue is that `ubatch_pos` is incorrect for `ctx_dft`. When context checkpointing is enabled, the `ctx_dft`’s context pos start counting ` + 1` which is offset. When used for `ctx_dft` the pos `ubatch_pos` becomes incorrect. I think this can be fixed with an offset variable.

Clarifying the term “offset” and “start from” — when using `–context-checkpoints`, the `ctx_tgt`’s `pos` variable start from `0`. But the `ctx_dft`’s `pos` variable start from some number (like `128`). When the draft model sends batch to `ctx_dft`, the `ubatch_pos` uses `n_past_dft` which can be lower than `pos` (for `ctx_dft`). This causes `llama_decode(ctx_dft) == -1`.

Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700

快速结论：该报错出现在使用 EAGLE3 投机解码且目标模型架构为 qwen3_5（hybrid）时。优先排查两件事：目标模型层循环中是否缺少 t_layer_inp[il] 赋值；以及当启用 context checkpoints 时，ctx_dft 的 ubatch_pos 是否存在偏移错误。

问题场景

用户使用 llama-server 或 llama.cpp 的 EAGLE3 投机解码功能，目标模型为 unsloth/Qwen3.6-27B-MTP-GGUF（架构 qwen3_5，GDN hybrid），草稿模型为基于 Ex0bit/Qwen3.6-27B-PRISM-EAGLE3 转换的 GGUF。问题在 Vulkan 后端（2× AMD RDNA4）和 CUDA（Tesla V100）上均已复现，且不限于特定硬件。

报错原文

src/llama-graph.cpp:956: GGML_ASSERT(t_layer_inp[il] != nullptr && "layer input tensor is null") failed

E process: llama_decode(ctx_dft) failed rc=-1 (n_tokens=4, ubatch_pos[0]=736)
tools/server/server-context.cpp:3185: fatal error - please provide logs and repro in https://github.com/ggml-org/llama.cpp/pull/20277

common/speculative.cpp:618 - draft decode failed: llama_decode(ctx_dft) = -1 (n_tokens=9, ubatch_pos[0]=873, n_seq_tgt=1, n_ctx_tgt=2, ubatch_pos[874])

原因分析

第一部分（assert 失败）：qwen35.cpp 的层循环中没有像 qwen3.cpp 等实现一样为 res->t_layer_inp[il] 赋值，导致 EAGLE3 在从目标模型抽取中间层输入时遇到空指针。

第二部分（解码失败）：这是一个更深层的问题，发生在 t_layer_inp hook 补齐之后。当启用 --context-checkpoints 时，ctx_dft（草稿模型上下文）的 pos 变量并非从 0 开始，而是从压缩 checkpoint 之后的一个偏移量开始计数。而 common/speculative.cpp 中传递给 ctx_dft 的 ubatch_pos 使用的是 n_past_dft（从 0 计数的偏移），两者不一致导致 llama_decode(ctx_dft) 返回 -1（无效输入 batch）。

环境排查

llama.cpp 版本：在 b9611（release）和 master 02182fc 上均复现。注意 PR #24593 被合入后仍有人遇到同样问题，因此即便版本包含该补丁，仍需排查。
目标模型：unsloth/Qwen3.6-27B-MTP-GGUF，架构 qwen3_5（GDN hybrid）。
草稿模型：使用 convert_hf_to_gguf.py 从 Ex0bit/Qwen3.6-27B-PRISM-EAGLE3（compressed/ 目录）转换，--outtype f16 --target-model-dir 指定目标模型目录。
后端与显卡：Vulkan（2× AMD RDNA4）或 CUDA（如 Tesla V100）均受影响。
启动参数示例：llama-server -m Qwen3.6-27B-UD-Q5_K_XL-MTP.gguf -ngl 99 -sm layer -fa on -c 32768 --jinja -np 1 --spec-type draft-eagle3 --spec-draft-model Qwen3.6-27B-EAGLE3-PRISM-f16.gguf。注意 -c 的取值（如 32768 或 65535）不影响复现。
上下文 checkpoint 设置：默认启用（可通过 --context-checkpoints 控制），此机制与 ctx_dft 的 pos 计算冲突是第二部分问题的关键。

解决步骤

修复 assert 失败：在 src/models/qwen35.cpp 的层循环开头添加 res->t_layer_inp[il] = inpL;，将其它架构（如 qwen3.cpp）的相同逻辑移植过来。此修补在 Issue 中经过验证可解决 assert 问题。（已通过 PR #24593 合入，但若自编译版本未包含，需手动应用。）
修复解码失败（rc=-1）：此问题的根本原因是 ctx_dft 的 pos 与 ubatch_pos 的偏移不一致。Issue 中提出了一个修复方向：在 common/speculative.cpp 的 EAGLE3 draft 路径中，引入一个偏移变量（offset），将 ubatch_pos 改为相对 ctx_dft 实际起始位置的偏移量，而不是从 0 开始计数的 n_past_dft。具体实现需要关注当 --context-checkpoints 开启时，ctx_dft 的 pos 可能从非零值开始（如 128）。此步骤为推测性修复，可优先尝试社区相关 PR 或提交。
临时绕过方法：作为排查，可以尝试关闭 --context-checkpoints（但 Issue 中并未明确测试该选项是否完全避免问题），或缩小 -c 使总 token 数不超过约 700 的阈值。

验证方法

修复后，对 llama-server 发起一个超过 700 token 的请求（如 curl http://127.0.0.1:8085/completion -d '{"prompt":"写一篇约 800 token 的短文。","n_predict":100}'），应不再出现 GGML_ASSERT 和 llama_decode(ctx_dft) rc=-1 的报错，且 draft acceptance 率正常显示。另外，可多次请求长上下文（如逐步增加 prompt 长度），确认稳定运行。

参考来源

ggml-org/llama.cpp #24541

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700

Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700

问题场景

报错原文

原因分析

环境排查

解决步骤

验证方法

参考来源

想把多个 AI 模型放在一个入口？

celebrityanime

发表回复取消回复

Eval bug: EAGLE3 with Qwen3.6 (qwen3_5 hybrid) target — missing t_layer_inp hooks, and llama_decode(ctx_dft) rc=-1 once context exceeds ~700

问题场景

报错原文

原因分析

环境排查

解决步骤

验证方法

参考来源

想把多个 AI 模型放在一个入口？

celebrityanime

相关文章

**ERROR**: AUTH_ERROR – AsyncCompletions.create() got an unexpected keyword argument ‘images’

[Bug]: **ERROR**: AUTH_ERROR – AsyncCompletions.create() got an unexpected keyword argument ‘images’

达沃斯现场｜「医美容错率极低」，新氧金星为何坚持把 AI 送进诊室？

发表回复取消回复

ERROR: AUTH_ERROR – AsyncCompletions.create() got an unexpected keyword argument ‘images’

[Bug]: ERROR: AUTH_ERROR – AsyncCompletions.create() got an unexpected keyword argument ‘images’