RuntimeError: convolution(): expected the second dimension of the weight tensor

快速结论：该报错发生在 Hugging Face Transformers 加载 Qwen3.5 混合注意力模型并设置 tp_plan="auto" 且张量并行度 TP>1 时，原因是 base_model_tp_plan 中缺少 linear_attn 相关投影层的条目。优先检查 base_model_tp_plan 是否包含所有 linear_attn.* 子模块的并行策略。

问题场景

用户在实验性框架中使用 Transformers 版本 5.12.1，加载 Qwen3.5 模型（如 Qwen3.5-9B），并在张量并行度 TP=4 下触发。Qwen3.5 是一种混合注意力模型，其中约 75% 的解码器层使用 linear_attention（由 Qwen3_5GatedDeltaNet，即 Gated Delta Net / GDA，实现）。当设置 tp_plan="auto" 时，linear_attn 层的权重未被正确切片，导致显存溢出和形状不匹配错误。

报错原文

RuntimeError: convolution(): expected the second dimension of the weight tensor
of shape [8192, 1, 4] to be 0 (2048 in channels divided by 8192 groups), got 1

附加错误：OOM when loading the model at TP>1。

原因分析

可能原因：Qwen3_5TextConfig 中的 base_model_tp_plan 定义不完整。该计划仅包含了 self_attn.* 和 mlp.* 层的并行策略，但完全缺失了 linear_attn.* 子模块的条目。对于混合模型 Qwen3.5，约 75% 的参数（即 linear_attn 的投影矩阵）未参与张量并行切片，导致两个问题：

OOM：每个 TP 等级持有完整权重副本，以 TP=4 为例，Qwen3.5-9B 模型加载时显存不足。
形状错误：即使追加了简单的 "colwise"/"rowwise" 方案，Qwen3_5GatedDeltaNet.forward 中的 Conv1d 层期望 8192 通道（conv_dim = key_dim*2 + value_dim），但普通 "colwise" 切片会将其缩减为 2048 通道，导致 torch.split 和卷积操作不匹配。

环境排查

Transformers 版本：5.12.1
Python 版本：3.12.13
Huggingface_hub 版本：1.20.1
Safetensors 版本：0.8.0
Accelerate 版本：1.14.0
PyTorch 版本：2.12.1（未使用 CUDA 加速）
检查是否使用了分布式或并行设置（本例中未使用）
确认加载模型时是否设置了 tp_plan="auto" 且张量并行度 TP>1

解决步骤

定位文件：编辑 src/transformers/models/qwen3_5/modular_qwen3_5.py（注意，configuration_qwen3_5.py 是自动生成的，文件头部说明“Do NOT edit this file manually”，CI 会从 modular 源重新生成）。
修改 base_model_tp_plan：在现有 self_attn 和 mlp 条目基础上，添加 linear_attn 的所有投影层，使用 "colwise_gather_output" 方案。示例如下：

base_model_tp_plan = {
    # ... 原有的 self_attn 和 mlp 条目保持不变 ...
    "layers.*.linear_attn.in_proj_qkv": "colwise_gather_output",
    "layers.*.linear_attn.in_proj_z":   "colwise_gather_output",
    "layers.*.linear_attn.in_proj_b":   "colwise_gather_output",
    "layers.*.linear_attn.in_proj_a":   "colwise_gather_output",
    "layers.*.linear_attn.out_proj":    "colwise_gather_output",
}

理解 "colwise_gather_output"：该策略对应 ColwiseParallel(gather_output=True)，将权重矩阵切片分发到各 TP 等级，并在前向计算前将输出激活全部收集为完整张量。这避免了修改后续的 Conv1d、torch.split、dt_bias 和 A_log 操作，因为它们依赖于完整的通道数（如 8192）。
可优先尝试的替代方案（推测）：如果未来希望实现更高效的端到端并行，可以考虑 "colwise" + "colwise_conv" + "rowwise" 方案，但需要同时修改前向代码以支持分片后的分组卷积和 torch.split。目前 "colwise_gather_output" 已被合并为快速修复。
合并修复：查看 Issue 关联的 PR #46847，该 PR 已实现上述修改并已合并。

验证方法

修改后，使用 TP>1 加载 Qwen3.5 模型（如参数 tp_plan="auto" 和 tp_size=4），确认加载时不再发生 OOM，并且 model.generate() 调用不再抛出 RuntimeError: convolution() 错误。建议运行完整的推理流程，确保输出结果正确。