LlamaConfig rejects explicit head_dim when hidden_size is not divisible by num_attention_heads

快速结论：当使用 LlamaConfig 并显式传递 head_dim 参数，但 hidden_size 不能被 num_attention_heads 整除时，配置验证会错误地拒绝该配置。优先排查 head_dim 是否已显式设置，并注意确保配置代码位于最新版 transformers 或已应用修复补丁。

问题场景

用户在使用 transformers 库（版本 5.8.0.dev0 或当前 main 分支，提交 a0fb01c）通过 LlamaConfig 初始化模型时触发。典型场景是自定义脚本中为 LlamaForCausalLM 传递自定义 head_dim 参数，但 hidden_size 与 num_attention_heads 不满足整除关系。

报错原文

ValueError: The hidden_size (512) is not a multiple of the number of attention heads (9).

原因分析

问题出在 configuration_llama.py 中的 validate_architecture 方法。该方法无条件地检查 hidden_size 是否被 num_attention_heads 整除，即使 head_dim 已被用户显式提供，且 LlamaAttention 中的建模代码已正确使用 num_attention_heads * head_dim 确定投影维度。该验证在 __init__ 后通过 @strict 装饰器运行，此时 head_dim 已被赋值为默认值（hidden_size // num_attention_heads），因此无法通过简单的 head_dim is not None 判断用户意图。根本原因：验证逻辑没有区分 head_dim 是用户显式传入还是派生出的默认值。

环境排查

transformers 版本：确认当前版本是否为 5.8.0.dev0 或更早版本（main 分支 commit a0fb01c 前）。
Python 版本：与报错无直接关联，但可确认版本号（本 Issue 中为 3.13.12）。
依赖版本：确保 torch、transformers 已正确安装。

解决步骤

确认报错原因：运行以下代码验证是否复现：

from transformers import LlamaConfig, LlamaForCausalLM

config = LlamaConfig(
    vocab_size=99,
    hidden_size=512,
    intermediate_size=1024,
    num_hidden_layers=1,
    num_attention_heads=9,
    num_key_value_heads=1,
    head_dim=56,
)
model = LlamaForCausalLM(config)  # 应触发验证错误

临时绕过方案：确保 hidden_size 能被 num_attention_heads 整除（例如将 num_attention_heads 改为 8 或 16）。此方法不适用于需要非标准 head_dim 的场景。
推荐修复（手动补丁）：修改 configuration_llama.py 文件中的 __post_init__ 和 validate_architecture 方法。在 __post_init__ 中添加一个私有属性 _head_dim_was_explicit 以记录 head_dim 是否为显式传入，然后在验证逻辑中检查该标志。具体修改如下（来源于 Issue 讨论的修复方案）：
在 __post_init__ 中添加：

def __post_init__(self):
    self._head_dim_was_explicit = self.head_dim is not None  # 捕获用户意图
    if self.head_dim is None:
        self.head_dim = self.hidden_size // self.num_attention_heads

在 validate_architecture 中修改验证条件：

def validate_architecture(self):
    if self.hidden_size % self.num_attention_heads != 0 and not self._head_dim_was_explicit:
        raise ValueError(...)

检查 JSON 保存/加载兼容性：修复中的 _head_dim_was_explicit 是私有属性（非 dataclass 字段），不会被序列化到 JSON。从 JSON 加载配置时，head_dim 会存在于 JSON 中，因此 __post_init__ 会正确把 _head_dim_was_explicit 设置为 True，不会产生回环问题。
等待合并 PR：该修复已有开发者在其分支上实现并通过本地测试，建议关注上游仓库的合并进度。在没有正式发布前，可优先尝试上述手动补丁。

验证方法

应用修复或绕过方案后，运行上述复现代码，应不再触发 ValueError，模型能正常构建。可进一步测试保存配置并重新加载：

config.save_pretrained("./test_config")
loaded_config = LlamaConfig.from_pretrained("./test_config")
model = LlamaForCausalLM(loaded_config)  # 应正常加载

也可运行 tests/models/llama/test_modeling_llama.py 中的测试套件确认无回归（预期除少量无关的 FSDP/TP 失败外，全部通过）。

参考来源

huggingface/transformers #46082

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

LlamaConfig rejects explicit head_dim when hidden_size is not divisible by num_attention_heads