klotz: gemini* + openai*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. This article presents findings from a survey of over 900 software engineers regarding their use of AI tools. Key findings include the dominance of Claude Code, the mainstream adoption of AI in software engineering (95% weekly usage), the increasing use of AI agents (especially among staff+ engineers), and the influence of company size on tool choice. The survey also reveals which tools engineers love, with Claude Code being particularly favored, and provides demographic information about the respondents. A longer, 35-page report with additional details is available for full subscribers.
  2. Google is accusing others of cloning its Gemini AI, despite its own history of scraping data without permission to train its models. This raises questions of hypocrisy as companies compete to protect their AI investments and differentiate their offerings, facing challenges like model distillation and the potential for smaller entities to compete.
  3. Simon Willison’s annual review of the major trends, breakthroughs, and cultural moments in the large language model ecosystem in 2025, covering reasoning models, coding agents, CLI tools, Chinese open‑weight models, image editing, academic competition wins, and the rise of AI‑enabled browsers.
  4. This tutorial explores implementing the LLM Arena-as-a-Judge approach to evaluate large language model outputs using head-to-head comparisons. It demonstrates using OpenAI’s GPT-4.1 and Gemini 2.5 Pro, judged by GPT-5, in a customer support scenario.
  5. **Experiment Goal:** Determine if LLMs can autonomously perform root cause analysis (RCA) on live application

    Five LLMs were given access to OpenTelemetry data from a demo application,:
    * They were prompted with a naive instruction: "Identify the issue, root cause, and suggest solutions."
    * Four distinct anomalies were used, each with a known root cause established through manual investigation.
    * Performance was measured by: accuracy, guidance required, token usage, and investigation time.
    * Models: Claude Sonnet 4, OpenAI GPT-o3, OpenAI GPT-4.1, Gemini 2.5 Pro

    * **Autonomous RCA is not yet reliable.** The LLMs generally fell short of replacing SREs. Even GPT-5 (not explicitly tested, but implied as a benchmark) wouldn't outperform the others.
    * **LLMs are useful as assistants.** They can help summarize findings, draft updates, and suggest next steps.
    * **A fast, searchable observability stack (like ClickStack) is crucial.** LLMs need access to good data to be effective.
    * **Models varied in performance:**
    * Claude Sonnet 4 and OpenAI o3 were the most successful, often identifying the root cause with minimal guidance.
    * GPT-4.1 and Gemini 2.5 Pro required more prompting and struggled to query data independently.
    * **Models can get stuck in reasoning loops.** They may focus on one aspect of the problem and miss other important clues.
    * **Token usage and cost varied significantly.**

    **Specific Anomaly Results (briefly):**

    * **Anomaly 1 (Payment Failure):** Claude Sonnet 4 and OpenAI o3 solved it on the first prompt. GPT-4.1 and Gemini 2.5 Pro needed guidance.
    * **Anomaly 2 (Recommendation Cache Leak):** Claude Sonnet 4 identified the service restart issue but missed the cache problem initially. OpenAI o3 identified the memory leak. GPT-4.1 and Gemini 2.5 Pro struggled.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: gemini + openai

About - Propulsed by SemanticScuttle