**Experiment Goal:** Determine if LLMs can autonomously perform root cause analysis (RCA) on live application
Five LLMs were given access to OpenTelemetry data from a demo application,:
* They were prompted with a naive instruction: "Identify the issue, root cause, and suggest solutions."
* Four distinct anomalies were used, each with a known root cause established through manual investigation.
* Performance was measured by: accuracy, guidance required, token usage, and investigation time.
* Models: Claude Sonnet 4, OpenAI GPT-o3, OpenAI GPT-4.1, Gemini 2.5 Pro
* **Autonomous RCA is not yet reliable.** The LLMs generally fell short of replacing SREs. Even GPT-5 (not explicitly tested, but implied as a benchmark) wouldn't outperform the others.
* **LLMs are useful as assistants.** They can help summarize findings, draft updates, and suggest next steps.
* **A fast, searchable observability stack (like ClickStack) is crucial.** LLMs need access to good data to be effective.
* **Models varied in performance:**
* Claude Sonnet 4 and OpenAI o3 were the most successful, often identifying the root cause with minimal guidance.
* GPT-4.1 and Gemini 2.5 Pro required more prompting and struggled to query data independently.
* **Models can get stuck in reasoning loops.** They may focus on one aspect of the problem and miss other important clues.
* **Token usage and cost varied significantly.**
**Specific Anomaly Results (briefly):**
* **Anomaly 1 (Payment Failure):** Claude Sonnet 4 and OpenAI o3 solved it on the first prompt. GPT-4.1 and Gemini 2.5 Pro needed guidance.
* **Anomaly 2 (Recommendation Cache Leak):** Claude Sonnet 4 identified the service restart issue but missed the cache problem initially. OpenAI o3 identified the memory leak. GPT-4.1 and Gemini 2.5 Pro struggled.