Eval bug: DFlash Speculative Decoding Crash: GGML_ASSERT(buffer) Failure

快速结论：该报错在使用 DFlash 投机解码（--spec-type draft-dflash）并配合非默认 KV 缓存量化（如 --cache-type-k q8_0）时触发。优先排查是否缺少缓冲区空指针守卫，可尝试更新到修复版本或手动合并 PR #25215。

问题场景

用户在 llama.cpp 的 llama-server 中启用 DFlash 投机解码（--spec-type draft-dflash），并设置 --cache-type-k 和 --cache-type-v 为 q8_0（非默认值）时，在首次 draft decode 阶段发生段错误。硬件为 AMD Radeon RX 9070，后端使用 Vulkan。

报错原文

#0  0x00007f3c9fa9fff2 in ?? () from /usr/lib/libc.so.6
#1  0x00007f3c9fa9403c in ?? () from /usr/lib/libc.so.6
#2  0x00007f3c9fa94084 in ?? () from /usr/lib/libc.so.6
#3  0x00007f3c9fb0494f in wait4 () from /usr/lib/libc.so.6
#4  0x00007f3ca0151c3b in ggml_print_backtrace () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libggml-base.so.0
#5  0x00007f3ca0151dd2 in ggml_abort () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libggml-base.so.0
#6  0x00007f3ca0169670 in ggml_backend_buffer_get_type () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libggml-base.so.0
#7  0x00007f3ca01696fd in ggml_backend_buffer_is_host () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libggml-base.so.0
#8  0x00007f3c9f12e1d7 in llama_kv_cache::set_input_k_rot(ggml_tensor*) const () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama.so.0
#9  0x00007f3c9f117d9e in llm_graph_input_attn_kv_iswa::set_input(llama_ubatch const*) () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama.so.0
#10 0x00007f3c9f11bd00 in llm_graph_result::set_inputs(llama_ubatch const*) () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama.so.0
#11 0x00007f3c9f0e7941 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama.so.0
#12 0x00007f3c9f0ee60a in llama_context::decode(llama_batch const&) () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama.so.0
#13 0x00007f3c9f0efe30 in llama_decode () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama.so.0
#14 0x00007f3c9f6d7ce7 in common_speculative_impl_draft_dflash::process(llama_batch const&) () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama-common.so.0
#15 0x00007f3c9f6cbdbd in common_speculative_process(common_speculative*, llama_batch const&) () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama-common.so.0
#16 0x00007f3ca039381b in server_context_impl::decode(int&, int, llama_batch&) () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama-server-impl.so
#17 0x00007f3ca0395b1d in server_context_impl::update_slots() () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama-server-impl.so
#18 0x00007f3ca033e2f1 in server_queue::start_loop(long) () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama-server-impl.so
#19 0x00007f3ca02ddc12 in llama_server(int, char**) () from /home/bestbug/Documents/llama-vulkan/llama-b9849/libllama-server-impl.so

原因分析

核心问题在 set_input_k_rot / set_input_v_rot 中对 buffer 的空指针守卫缺失。当计算图（graph）存储 K/V 但不参与 attention 计算时（对应 DFlash 的 KV-injection pass），self_k_rot 张量的 buffer 成员未被分配，但其指针本身不为 NULL（仅为 if (self_k_rot) 检查）。随后 ggml_backend_buffer_is_host(dst->buffer) 在 NULL buffer 上调用，触发 GGML_ASSERT(buffer) 断言失败。

可能原因：设计时未考虑到 DFlash 投机解码场景下，某些张量的 buffer 在特定图拓扑中可能为空。它并非 DFlash 专有问题，而是通用守卫缺失——类似的问题在相邻的 kq_mask 输入中已经通过 && buffer 守卫处理。

环境排查

llama.cpp 版本： v9849（commit 799fcc04a）
操作系统： Linux x86_64
GGML 后端： Vulkan（AMD Radeon RX 9070）
编译器： GNU 11.4.0
相关命令行参数： --spec-type draft-dflash + --cache-type-k q8_0 + --cache-type-v q8_0（以及对应的 --cache-type-k-draft q8_0）
模型： draft 模型为 unsloth/Qwen3.6-27B-Q4_K_M，DFlash 适配模型为 williamliao/Qwen3.6-27B-DFlash-IQ4_XS

解决步骤

优先尝试方案：升级到包含 PR #25215 修复的版本。该 PR 在 llm_graph_input_attn_kv::set_input() 和 llm_graph_input_attn_kv_iswa::set_input() 中，为四个 k_rot/v_rot 输入添加了 && buffer 守卫，与 kq_mask 输入的现有处理方式一致。
手动修改方案：如果无法直接升级，可以编辑 llama.cpp 源码中的 llm_graph_input_attn_kv_iswa.cpp（或对应头文件），找到 set_input 函数中形如 if (self_k_rot) 的四处位置，改为 if (self_k_rot && self_k_rot->buffer)。
临时规避方案：如果暂不修复，可去掉 --cache-type-k q8_0 和 --cache-type-v q8_0 参数，使用默认 KV 缓存量化类型。但这会降低内存效率，不是根本解决方案。
确认补丁效果：PR #25215 已在用户环境中通过测试：之前首次 draft decode 即崩溃，应用补丁后可正常完成推理，且非 DFlash 的普通因果模型不受影响。

验证方法

使用与崩溃时完全一致的命令行参数（包括 --cache-type-k q8_0 --cache-type-v q8_0 --spec-type draft-dflash）运行 llama-server。如果不再触发 GGML_ASSERT(buffer) 崩溃，且投机解码能正常完成，表明问题已修复。可同时测试非 DFlash 模型以确保回归兼容性。