**Experiment Goal:** Determine if LLMs can autonomously perform root cause analysis (RCA) on live application 
Five LLMs were given access to OpenTelemetry data from a demo application,:
*   They were prompted with a naive instruction: "Identify the issue, root cause, and suggest solutions."
*   Four distinct anomalies were used, each with a known root cause established through manual investigation.
*   Performance was measured by: accuracy, guidance required, token usage, and investigation time.
*   Models: Claude Sonnet 4, OpenAI GPT-o3, OpenAI GPT-4.1, Gemini 2.5 Pro
*   **Autonomous RCA is not yet reliable.**  The LLMs generally fell short of replacing SREs. Even GPT-5 (not explicitly tested, but implied as a benchmark) wouldn't outperform the others.
*   **LLMs are useful as assistants.** They can help summarize findings, draft updates, and suggest next steps.
*   **A fast, searchable observability stack (like ClickStack) is crucial.**  LLMs need access to good data to be effective.
*   **Models varied in performance:**
    *   Claude Sonnet 4 and OpenAI o3 were the most successful, often identifying the root cause with minimal guidance.
    *   GPT-4.1 and Gemini 2.5 Pro required more prompting and struggled to query data independently.
*   **Models can get stuck in reasoning loops.** They may focus on one aspect of the problem and miss other important clues.
*   **Token usage and cost varied significantly.**
**Specific Anomaly Results (briefly):**
*   **Anomaly 1 (Payment Failure):** Claude Sonnet 4 and OpenAI o3 solved it on the first prompt. GPT-4.1 and Gemini 2.5 Pro needed guidance.
*   **Anomaly 2 (Recommendation Cache Leak):**  Claude Sonnet 4 identified the service restart issue but missed the cache problem initially. OpenAI o3 identified the memory leak. GPT-4.1 and Gemini 2.5 Pro struggled.