Tags: gemini* + claude*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. **Experiment Goal:** Determine if LLMs can autonomously perform root cause analysis (RCA) on live application

    Five LLMs were given access to OpenTelemetry data from a demo application,:
    * They were prompted with a naive instruction: "Identify the issue, root cause, and suggest solutions."
    * Four distinct anomalies were used, each with a known root cause established through manual investigation.
    * Performance was measured by: accuracy, guidance required, token usage, and investigation time.
    * Models: Claude Sonnet 4, OpenAI GPT-o3, OpenAI GPT-4.1, Gemini 2.5 Pro

    * **Autonomous RCA is not yet reliable.** The LLMs generally fell short of replacing SREs. Even GPT-5 (not explicitly tested, but implied as a benchmark) wouldn't outperform the others.
    * **LLMs are useful as assistants.** They can help summarize findings, draft updates, and suggest next steps.
    * **A fast, searchable observability stack (like ClickStack) is crucial.** LLMs need access to good data to be effective.
    * **Models varied in performance:**
    * Claude Sonnet 4 and OpenAI o3 were the most successful, often identifying the root cause with minimal guidance.
    * GPT-4.1 and Gemini 2.5 Pro required more prompting and struggled to query data independently.
    * **Models can get stuck in reasoning loops.** They may focus on one aspect of the problem and miss other important clues.
    * **Token usage and cost varied significantly.**

    **Specific Anomaly Results (briefly):**

    * **Anomaly 1 (Payment Failure):** Claude Sonnet 4 and OpenAI o3 solved it on the first prompt. GPT-4.1 and Gemini 2.5 Pro needed guidance.
    * **Anomaly 2 (Recommendation Cache Leak):** Claude Sonnet 4 identified the service restart issue but missed the cache problem initially. OpenAI o3 identified the memory leak. GPT-4.1 and Gemini 2.5 Pro struggled.
  2. An analysis of how well different AI systems perform in describing images and answering questions about them. The article compares ChatGPT, Gemini, Llama, and Claude using four images: a hand, a bottle of wine, a piece of pastry, and a flower.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "gemini+claude"

About - Propulsed by SemanticScuttle