[Bug]: websearch_interception silently truncates streaming response on /v1/messages — follow-up call always uses stream=False

快速结论：该问题发生在 LiteLLM Proxy 启用了 websearch_interception 时，用户通过 Claude Code 等客户端以 stream=True 调用 /v1/messages 端点。优先排查 websearch_interception handler 中 _execute_agentic_loop 是否明确传递了 stream 参数。

问题场景

用户在 LiteLLM Proxy 中配置了 websearch_interception 及搜索工具（例如 Tavily），并通过 Claude Code 连接 Proxy（设置 ANTHROPIC_BASE_URL）。当发送一条触发网页搜索的消息（如 “What are the latest AI news today?”）时，工具调用成功，但最终 LLM 响应在中间被静默截断，且没有任何错误日志或非 200 状态码。

报错原文

# 客户端（Claude Code）期待的响应是 SSE 流式块（streaming chunks），
# 但由于 follow-up 调用返回的是非流式对象（非 async iterable），
# 导致客户端仅收到部分响应体，响应中间被静默截断。
# 无显式错误日志，无 HTTP 非 200 状态码。
# 如果发送不触发搜索的消息（普通对话），流式响应正常。

原因分析

该 Bug 存在于 litellm/integrations/websearch_interception/handler.py 中的三个步骤协同作用：

Step 1 (pre-hook 转换 stream): 在 pre-hook 中，如果 kwargs.get("stream") 为 True，代码将其设置为 False 并标记 _websearch_interception_converted_stream = True，以便后续重构流式响应。
Step 2 (strip 内部标记): _prepare_followup_kwargs 在构造 follow-up 调用的 kwargs 时，会过滤掉所有以 _websearch_interception 开头的键，导致 _websearch_interception_converted_stream 标记被清除，但 stream=False 未被移除（因为它没有特殊前缀）。
Step 3 (follow-up 未传递 stream): _execute_agentic_loop 调用 anthropic_messages.acreate 时没有显式传递 stream 参数，因此使用 kwargs 中残留的 stream=False。而 anthropic_messages 中原本用于计算 original_stream 的标志位已被剥离，导致 follow-up 始终以非流式模式运行。客户端预期 SSE chunks，却收到非流式对象，响应被截断。

环境排查

LiteLLM 版本（确认是否包含 Bug 相关的代码路径，即 websearch_interception/handler.py 中的 _execute_agentic_loop）。
使用的搜索工具提供商（如 Tavily）及其配置。
客户端工具（如 Claude Code）版本及 ANTHROPIC_BASE_URL 配置。
触发场景：仅当消息触发网页搜索工具调用时出现；普通对话流式响应正常。

解决步骤

以下解决方案基于 Issue 讨论中确认的根因，尚未合并到主分支，属于“可优先尝试”的修复方案。

打开 litellm/integrations/websearch_interception/handler.py，定位到 _execute_agentic_loop 函数（约第 758-764 行）。

在调用 anthropic_messages.acreate 时，显式传递 stream=stream 参数，以覆盖 kwargs 中残留的 stream=False。例如：

original_stream = kwargs.get("_websearch_interception_converted_stream", False) or stream

return await anthropic_messages.acreate(
    max_tokens=max_tokens,
    messages=request_patch.messages,
    model=request_patch.model or model,
    stream=original_stream,   # 显式传递原始 stream 意图
    **optional_params,
    **request_patch.kwargs,
)

或者，在构造 kwargs_for_followup 时主动排除 stream 键，只依赖显式传递的 stream=stream 参数。
保存修改后重启 LiteLLM Proxy。

验证方法

重复触发步骤：通过 Claude Code 发送一条预期触发网页搜索的消息（如 “What are the latest AI news today?”），观察最终 LLM 响应是否完整返回、不再被截断。同时检查日志确认没有非 200 状态码或错误记录。如果普通对话流式响应依然正常，说明修复未引入回归。