[Gemma 4] Gemma4UnifiedForConditionalGeneration text-only inference produces degenerate output (token repetition collapse)

快速结论： 该报错通常出现在对 google/gemma-4-12B-it 这类指令微调（instruction-tuned）模型进行原始文本推理时。优先排查是否未使用 apply_chat_template 进行对话模板化输入。使用正确的对话模板即可解决。

问题场景

用户在使用 Hugging Face Transformers 库（transformers）加载 Google Gemma 4 指令微调模型（如 google/gemma-4-12B-it）进行纯文本推理时，模型输出退化为重复单一 token（例如 “111111111111”）。该问题在 CPU 和 MPS 设备，float32 和 bfloat16 精度，以及 eager 和 flash attention 实现上均可复现。

报错原文

输出示例（非报错，而是退化输出）：
"The capital of France is111111111111"

诊断证据：
- 贪婪搜索（do_sample=False）时，top-5 下一个 token 候选为 ['1', '-', '.', '0', '_']。
- 提示词（prompt）的交叉熵损失（cross-entropy）高达 ≈ 17.9，而一个性能正常的 12B 模型应在 2-3 范围内。

原因分析

根本原因： 指令微调模型（-it）在训练时使用了特殊的对话格式（chat template）。用户在推理时直接将原始字符串（如 “The capital of France is”）进行 tokenize 并送入模型，这属于“分布外输入”（out-of-distribution, OOD），导致模型表现出严重的退化行为（输出单一重复 token），而非仅仅是生成质量下降。

补充说明： Issue 作者一开始误判为 Transformer 的 forward 路径存在 bug，因为使用相同的 Q4_K_M GGUF 权重在 llama.cpp 上推理时是正常的。但经过社区协助后确认，这是输入格式不匹配导致的问题，而非模型权重或推理代码的底层缺陷。

环境排查

Transformers 版本： 5.10.0.dev0 (main branch, 2026-06-08)
PyTorch 版本： 2.7.0
Python 版本： 3.12
平台： macOS (Apple M4 Pro, MPS/CPU) / 其它设备 (CUDA)
模型标识： google/gemma-4-12B-it (以及其他 Gemma 4 指令微调变体)
关键依赖： transformers 中 AutoTokenizer 的 apply_chat_template 方法

解决步骤

使用对话模板(tokenizer.apply_chat_template)对输入进行格式化：

不要直接对原始字符串进行 tokenize，而是构建一个符合 Hugging Face 对话格式的 chat 列表，然后使用 tokenizer 的 apply_chat_template 方法。

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "google/gemma-4-12B-it"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

# 构建包含用户角色的对话结构
chat = [
  {"role": "user", "content": "The capital of France is"},
]
# 应用模板，启用思考模式（可根据需要禁用 enable_thinking）
ids = tok.apply_chat_template(chat, tokenize=True, return_dict=True, add_generation_prompt=True, return_tensors="pt", enable_thinking=True).to(model.device)

# 使用正确的输入进行生成
out = model.generate(**ids, max_new_tokens=50, do_sample=False)
print(tok.batch_decode(out))

（可选）切换至 Base 模型进行原始文本评估：

如果您的用例需要对原始文本进行 perplexity 计算或无需对话格式的评估，建议使用基础模型 google/gemma-4-12B（即非 -it 版本），该模型对原始输入更友好，不会因缺乏模板而退化。

验证方法

使用 apply_chat_template 处理后的输入再次运行 model.generate()。如果原本完全退化的输出（如 “1111111…”）现在变成了有意义的连续文本（例如 “The capital of France is Paris.” 或包含思考过程的长句），则说明问题已解决。同时，建议使用 with torch.no_grad(): 计算提示词的交叉熵损失，若其值从 17.9 下降至 2-5 范围内，即可确认模型 forward 路径正常，格式问题已消除。

参考来源

huggingface/transformers #46531

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

[Gemma 4] Gemma4UnifiedForConditionalGeneration text-only inference produces degenerate output (token repetition collapse)