serve: Gemma 4 non-thinking responses returned as reasoning_content with empty content

快速结论：该问题发生在使用 transformers serve 部署 Gemma 4 模型且禁用思考模式时，返回的 content 字段为空，完整生成文本被错误归类到 reasoning_content。优先排查 transformers 版本是否已包含修复补丁（PR #45847）。

问题场景

用户在 transformers serve 工具中部署 google/gemma-4-31B-it 模型，通过 curl 发送非流式聊天补全请求，且提示词中不启用思考（thinking）功能。

报错原文

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"","role":"assistant"}}], "usage":{"completion_tokens":28,"prompt_tokens":33,"total_tokens":61}}

注：虽然 reasoning_content 在序列化的 JSON 中被剥离，但底层 ChatCompletionMessage 对象显示文本被归类为推理内容。

原因分析

Gemma 4 的聊天模板在禁用思考时，会在提示词末尾预填一个 空且已关闭 的 thinking 块。最后的提示词 token 序列是：[..., 105, 4368, 107, 100, 45518, 107, 101]，其中 100, 45518, 107 是 thinking 起始标记 <|channel>thought\n 的 token，而 101 是 <channel|> 的 token（即 closing 标签）。

_starts_in_thinking() 函数（位于 src/transformers/cli/serving/utils.py）检查提示词末尾是否以 thinking 起始标记结尾，并允许一个尾随 token（为兼容 DeepSeek-R1 等模板的 <think>\n 预填设计）。但在 Gemma 4 这里，这个被允许的尾随 token 恰好是 101（闭合标签），导致启发式判断错误地返回 start_in_thinking=True。

由于模型输出中没有 thinking 标记，parse_reasoning() 函数进入“预填起始标记在关闭前被截断”的分支，将整个生成内容重新归类为推理内容，返回 content=""。

环境排查

transformers 版本：确认是否为 main 分支（commit acc2cda7d9）或更新版本
模型：是否使用 google/gemma-4-31B-it 或其他 Gemma 4 变体
操作系统：Linux（已确认）
Python 版本：Python 3.x 均可
CUDA 环境：与本问题无关，但需确保推理正常

解决步骤

更新 transformers 版本：本问题已在 PR #45847 中修复。升级到包含该补丁的版本即可。如果使用 main 分支，请确保拉取最新代码。
若无法升级，可手动修补：在 src/transformers/cli/serving/utils.py 的 _starts_in_thinking() 函数中，增加对提示词末尾 token 是否为 thinking 结束标记的检查。当 input_ids[-1] == end_id 时返回 False。具体补丁见 Issue 正文 diff。

验证方法

应用修复后，重新执行 curl 请求，确认返回的 JSON 中 message.content 包含正确的生成文本（非空字符串），且 reasoning_content 字段不存在。

参考来源

huggingface/transformers #46561（包含完整分析及修复补丁）

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

serve: Gemma 4 non-thinking responses returned as reasoning_content with empty content