Eval bug: Step 3.7 Flash gets stuck in reasoning trying to make tool calls (autoparser)

快速结论：此问题发生在 llama.cpp 使用 autoparser 解析 Step 3.7 Flash 模型且启用工具调用（tool calls）时。模型在推理过程中反复进入“搜索-否定”循环，无法正常完成工具调用。优先排查模型量化版本（Q4_K_M 旧版量化表现更稳定）以及 autoparser 对 reasoning delimiter 的解析方式。

问题场景

用户运行 llama.cpp 的 llama-cli（版本 9524, 59917d392），在 macOS（Apple M1, Metal 后端）上加载 Step 3.7 Flash 模型，使用 pi agent 且启用了工具调用功能。模型在推理过程中陷入重复模式，表现为：模型似乎要执行工具调用（如 “Let me search for …”），但随后立即以 “Actually…” 否定自己，反复循环，有时永远无法跳出。

报错原文

If the server computes probabilities for each token and stores them, this could grow. Let me check what completion_token_output is.        
                                                                                                                                            
 Let me search for it.                                                                                                                      
                                                                                                                                            
 Also, I want to check if there's something in the llama_decode or graph evaluation that allocates per-token.                               
                                                                                                                                            
 Actually, let me look at the llama-context.cpp decode function more carefully.                                                             
                                                                                                                                            
 Wait, I think I found something very relevant. Let me look at the llama_decode function and the llama_batch allocation.                    
                                                                                                                                            
 Actually, let me look at llama-context.cpp around line 3451 where llama_decode is defined.                                                 
                                                                                                                                            
 Actually, let me take a different approach. Let me look at the ggml_backend_sched code to see if it reallocates buffers dynamically.       
                                                                                                                                            
 Let me search for the scheduler implementation.                                                                                            
                                                                                                                                            
 Also, let me look at the llama_graph_result and llm_graph_result structures. These might allocate memory per inference step.               
                                                                                                                                            
 Actually, looking at llama-context.cpp line 409:

原因分析

问题的根因指向 autoparser 对 reasoning delimiter 的处理方式。在 StepFun 的生成 prompt 中，推理块通过 <think>\n 和 \n</think>\n 包裹。autoparser 在解析时只提取了 stripped 版本（<think> 和 </think>），而 reasoning-budget sampler 需要使用包含换行符的原始 mark 才能正确识别推理状态。

这导致 sampler 无法检测到推理块已经打开（pre-opened），因此在推理过程中生成的 <tool_call> token 不会被抑制，可能过早触发工具调用语法（lazy tool grammar），破坏正常的推理-工具交互流程。

可能原因：该行为可能是 autoparser 重构引入的回归（bisect 指向 commit 566059a26b0ce8faec4ea053605719d399c64cc5）。另外，模型量化版本也可能是一个影响因素：用户指出新的 IQ4_XS 量化版本 perplexity/KLD 显著升高，而旧版 Q4_K_M 量化版本在同样场景下几乎不出现无限循环。

环境排查

llama-cli 版本：9524 (59917d392)
操作系统：macOS (Darwin arm64)
硬件：Apple M1 GPU
GGML 后端：Metal
模型：Step 3.7 Flash（GGUF 格式）
使用量化版本：IQ4_XS、Q4_K_M 等
相关依赖：autoparser 组件、lazy tool grammar 实现

解决步骤

可优先尝试：更换为旧版 Q4_K_M 量化版本（如 @AesSedai 早期制作的量化），该版本在测试中未出现无限循环。
可优先尝试：应用针对 Step 3.7 Flash 的 autoparser opt-out patch（用户提供的补丁：common_chat_params_init_stepfun_3_7，通过在 chat.cpp 中添加专门的处理函数来绕过 autoparser 对 reasoning delimiter 的错误解析）。
如果无法修改源代码，尝试在调用时显式传入 enable_thinking=false 或控制 reasoning budget 参数，观察是否影响问题复现。
等待官方修复：由于 issue 已被标记为 bug-unconfirmed，建议关注后续 autoparser 对 thinking tag 的解析修复。
回退到已知正常的 commit：3dadc88b589ca43b8fca0e1beb22d4b78a09b4dd 版本，该版本下 Step 3.5 Flash 运行正常。

验证方法

重复运行包含工具调用的对话场景，观察模型能否在合理步骤内完成工具调用并正常回答，不再出现 “Let me search” + “Actually” 的循环模式。与官方 StepFun API 的输出行为进行对比，确保推理流畅性一致。

参考来源

ggml-org/llama.cpp #24181

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

Eval bug: Step 3.7 Flash gets stuck in reasoning trying to make tool calls (autoparser)