[BUG] `kickoff` hangs when LLM call fails

快速结论：当 LLM 调用失败（如 API Key 错误）时，kickoff 不会报错退出，而是无限挂起。优先排查 _invoke_loop 中的迭代计数器未正确递增的问题。

问题场景

用户在 CrewAI 框架中使用 kickoff() 执行 Crew 任务时，如果 LLM 调用（例如向 OpenAI API 发送请求）因认证失败或其他异常而抛出错误，整个流程会挂起，不会正常退出或抛出异常。

报错原文

ERROR:root:LiteLLM call failed: litellm.AuthenticationError: AuthenticationError: OpenAIException - Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-fake. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

原因分析

Issue 分析指出，_invoke_loop 方法的实现存在缺陷：

self._process_llm_response(answer) 会递增迭代计数器，但如果 self._get_llm_response() 抛出异常，计数器不会递增，导致循环永远无法达到最大迭代次数，从而无限挂起。
异常处理中没有向上传播异常，框架无法识别 LLM 调用失败，继续等待下一次调用。
可能原因：这是一个级联失败（Cascading Failure）模式，LLM 调用失败后缺少错误处理、重试超时或静默失败检测机制。

环境排查

Python 3.10
CrewAI 0.98.0
Ubuntu 20.04

解决步骤

可优先尝试：在 Agent 的 LLM 配置中设置 litellm.set_verbose=True 以获取更详细的调试信息，确认是否因 API Key 错误等认证问题导致。
检查 _invoke_loop 的实现（建议查看 src/crewai/agents/crew_agent_executor.py 第 110-150 行），确保异常发生时迭代计数器被正确递增，或在异常处理中 raise 以中止流程。

参考评论中的建议，实现一个带超时和重试机制的弹性 LLM 包装器，例如：

async def resilient_llm_call(llm, prompt, max_retries=3, timeout=60):
    for attempt in range(max_retries):
        try:
            return await asyncio.wait_for(llm.agenerate(prompt), timeout)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)

升级到更新版本的 CrewAI，或关注官方针对该 Issue 的修复补丁。

验证方法

最小复现示例：使用一个无效的 API Key（如 sk-fake）初始化 Agent 并调用 kickoff()，预期行为是正常抛出异常并退出，而非无响应挂起。如果修复后程序能快速报错，则问题已解决。

参考来源

crewAIInc/crewAI #1934

AI 工具推荐

想把多个 AI 模型放在一个入口？

GamsGo AI 集成 ChatGPT、DeepSeek、Gemini、Claude、Midjourney、Veo 等常用模型，适合写作、绘图、视频和日常 AI 工作流。

了解 GamsGo AI

推广链接：通过此链接购买，我可能获得佣金，不影响你的价格。

[BUG] `kickoff` hangs when LLM call fails