[Bug]: as_query_engine(streaming=True) buffers entire response into a single-item generator when using local LLM

快速结论：该问题通常出现在使用 LlamaIndex 的 as_query_engine(streaming=True) 配合本地 LLM（如 Ollama）时，生成的 response_gen 将完整响应作为单个字符串元素返回，而非逐 token 流式输出。优先排查 llama-index-core 版本是否晚于 2026 年 5 月，并升级至包含 PR #21758 修复的版本。

问题场景

用户在 macOS 上使用 LlamaIndex 0.14 版本，通过 Ollama 封装本地 8B 模型（llama3.1:8b-instruct-q4_K_M），调用 index.as_query_engine(streaming=True) 时触发。依赖包括 llama-index (>=0.14.22,<0.15.0)、llama-index-llms-ollama (>=0.10.1,<0.11.0)。直接调用 LLM 的 .stream() 方法可正常逐 token 输出。

报错原文

response.response_gen blocks during the entire inference duration and returns a single-element list containing the full, completed paragraph:
['A purely peer-to-peer version of electronic cash that allows online payments to be sent directly from one party to another without going through a financial institution. It uses digital signatures and a proof-of-work chain to prevent double-spending and ensure the integrity of transactions.']

原因分析

这是 LlamaIndex 中 CompactAndRefine（及 Refine）ResponseSynthesizer 的一个已知 bug。在 PR #21758（2026 年 5 月合并）中确认，该合成器在将 LLM 流式输出传递给 response_gen 之前，将整个流拼接成了单个字符串，导致出现单元素生成器的行为。修复方法是为 DefaultRefineProgram 添加了直接的流式传递方法（stream_answer() / astream_answer()），使 token 无需缓冲即可从 LLM 增量传递到调用方。

环境排查

Python 版本：未在 Issue 中明确提及，但 LlamaIndex 0.14 通常兼容 Python 3.9+。
LlamaIndex 核心版本：确认当前安装的 llama-index-core 版本是否晚于 2026 年 5 月（例如 0.14.22 及其后的补丁版本）。
Ollama 版本：确保 llama-index-llms-ollama 包含 PR #21303（2026 年 4 月合并）的修复，该修复解决了 Ollama 在发送 content=None 块时可能丢失流式数据的问题。

解决步骤

升级核心包：运行 pip install --upgrade llama-index-core llama-index-llms-ollama，确保包含 2026 年 5 月之后的修复。
检查版本约束：确认 llama-index >=0.14.22,<0.15.0 的约束已拉取最新补丁版本（如 0.14.23 或更高）。
避免使用不支持的 ResponseMode：如果使用了 ResponseMode.COMPACT_ACCUMULATE 或 ACCUMULATE，这些模式明确不支持流式输出，应改用 COMPACT 或 REFINE 模式。
测试流式功能：升级后，直接测试 response_gen 是否逐 token 输出。

验证方法

执行以下代码片段，确认 response_gen 输出多个字符串元素（而非单个完整段落），且流式过程中无阻塞：

response = query_engine.query("What is bitcoin?")
for chunk in response.response_gen:
    print(chunk, end="", flush=True)

如果输出为逐 token 增量形式（如 ['A purely ', 'peer-to-peer ', 'version ', 'of...']），则问题已解决。

参考来源

run-llama/llama_index #22183

PR #21758 – 修复 CompactAndRefine 流式缓冲

PR #21303 – 修复 Ollama 流式丢失 chunk

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

[Bug]: as_query_engine(streaming=True) buffers entire response into a single-item generator when using local LLM