klotz: are*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. **Experiment Goal:** Determine if LLMs can autonomously perform root cause analysis (RCA) on live application

    Five LLMs were given access to OpenTelemetry data from a demo application,:
    * They were prompted with a naive instruction: "Identify the issue, root cause, and suggest solutions."
    * Four distinct anomalies were used, each with a known root cause established through manual investigation.
    * Performance was measured by: accuracy, guidance required, token usage, and investigation time.
    * Models: Claude Sonnet 4, OpenAI GPT-o3, OpenAI GPT-4.1, Gemini 2.5 Pro

    * **Autonomous RCA is not yet reliable.** The LLMs generally fell short of replacing SREs. Even GPT-5 (not explicitly tested, but implied as a benchmark) wouldn't outperform the others.
    * **LLMs are useful as assistants.** They can help summarize findings, draft updates, and suggest next steps.
    * **A fast, searchable observability stack (like ClickStack) is crucial.** LLMs need access to good data to be effective.
    * **Models varied in performance:**
    * Claude Sonnet 4 and OpenAI o3 were the most successful, often identifying the root cause with minimal guidance.
    * GPT-4.1 and Gemini 2.5 Pro required more prompting and struggled to query data independently.
    * **Models can get stuck in reasoning loops.** They may focus on one aspect of the problem and miss other important clues.
    * **Token usage and cost varied significantly.**

    **Specific Anomaly Results (briefly):**

    * **Anomaly 1 (Payment Failure):** Claude Sonnet 4 and OpenAI o3 solved it on the first prompt. GPT-4.1 and Gemini 2.5 Pro needed guidance.
    * **Anomaly 2 (Recommendation Cache Leak):** Claude Sonnet 4 identified the service restart issue but missed the cache problem initially. OpenAI o3 identified the memory leak. GPT-4.1 and Gemini 2.5 Pro struggled.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: are

About - Propulsed by SemanticScuttle