At GrafanaCON 2026, Grafana Labs announced significant updates including the launch of Grafana 13 and a major architectural overhaul for Loki. The new Loki design moves away from replication-at-ingestion toward using Kafka as a durability layer to reduce data duplication and improve query performance. Additionally, the company introduced GCX, a new CLI tool in public preview designed to integrate observability data directly into agentic development environments like Claude Code and Cursor, allowing engineers to resolve production issues without leaving their coding tools.
:
- Loki rearchitected with Kafka to reduce storage overhead and improve query speed.
- Introduction of GCX CLI for seamless observability integration within AI coding agents.
- Launch of Grafana 13 featuring dynamic dashboards and expanded data source support.
- New AI Observability product in public preview for monitoring LLM applications.
STCLab's SRE team shares their experience building an AI-driven investigation pipeline to automate the triage of Kubernetes alerts. By utilizing HolmesGPT, they implemented a ReAct pattern that allows LLMs to autonomously select tools like Prometheus, Loki, and kubectl based on specific context. The core finding was that high-quality markdown runbooks containing exclusion rules were more critical for successful investigations than the underlying AI model itself.
Key points:
* Implementation of HolmesGPT using the ReAct agent pattern for autonomous troubleshooting.
* Integration with Robusta to manage Slack routing, deduplication, and thread matching.
* The vital role of runbooks in narrowing search spaces and reducing wasted tool calls.
* Comparison between self-hosted models via KubeAI and managed API approaches.
* Significant reduction in manual triage time from 20 minutes to under two minutes per investigation.
As AI agents evolve from autocomplete tools to active contributors (opening PRs, managing infrastructure), DevOps must adapt. This playbook outlines the shift through these key strategic pillars:
* **Foundational Prerequisites:** Robust CI/CD, automated testing, and Infrastructure as Code are essential for agentic workflows.
* **Evolving Engineering Roles:** Engineers transition from code producers to system designers, agent operators, and quality stewards.
* **Structured Collaboration:** Integration across IDEs, PRs, pipelines, and production environments is required.
* **Repository Design:** Repositories must act as explicit interfaces using skill profiles and instruction files.
* **Development Methodology:** Shift from ephemeral prompt engineering to durable, specification-driven development.
* **Governance & Security:** Implement frameworks for custom agent consistency/auditability and transform CI/CD into active verifiers of semantic intent and security.
* **New Success Metrics:** Move from volume-based productivity counts to outcome-based and trust-boundary measurements.
Prove AI is developing an observability-first foundation designed for production generative AI systems. Their mission is to enable engineering teams to understand, diagnose, and remediate failures within complex AI pipelines, including LLM inference, retrieval processes, and agent orchestration.
The current release, v0.1, provides an opinionated observability pipeline specifically for generative AI workloads through:
- A containerized, OpenTelemetry-based telemetry pipeline.
- Preconfigured collection of traces, metrics, and logs tailored for AI systems.
- Instrumentation patterns for RAG pipelines, embeddings, LLM inference, and agent-based systems.
- Compatibility with standard backends like Prometheus.
This article examines the development of Microsoft’s Azure SRE Agent, designed to mitigate operational toil in mission-critical environments. By utilizing an "agentic workflow" of specialized AI agents, Microsoft has integrated automation across the entire software development lifecycle. This human-AI partnership has autonomously resolved over 35,000 incidents and saved more than 50,000 developer hours, accelerating root cause analysis and mitigation while maintaining rigorous governance and human oversight.
LLMOps focuses on orchestration, observability, and evaluation.
* **PydanticAI:** type-safe outputs for LLMs, supporting multiple models and complex workflows for more reliable software-like behavior.
* **Bifrost:** gateway for multiple models/providers, offering a single API with features like failover, load balancing, and observability.
* **Traceloop / OpenLLMetry:** Integrates LLM with OpenTelemetry
* **Promptfoo:** CI/CD pipelines for automated checks.
* **Invariant Guardrails:** runtime rules between applications and LLMs/tools, enforcing constraints without code changes.
* **Letta:** version-controlled memory for agents, tracking interactions like a Git repository for debugging and rollback.
* **OpenPipe:** continuous model improvement through logging, data export, evaluation, and fine-tuning within a single platform.
* **Argilla:** human feedback and data curation for tasks like annotation and error analysis, improving model performance.
* **KitOps:** Packages models, datasets, prompts, and configurations into versioned artifacts for clean deployments and reproducibility.
* **Composio:** authentication, permissions, and execution for agents interacting with hundreds of external applications.
This article details how HPE is addressing operational fatigue and burnout in IT teams through the introduction of agentic AI operations. HPE's new system utilizes skills-based AI agents that work alongside human operators to reduce alert noise, improve response times, and cut root cause analysis time by at least half, according to early adopters.
The focus is on augmenting human capabilities rather than replacing them, with a strong emphasis on auditability, transparency, and human oversight in AI-driven actions. The system aims to break down data silos and provide proactive insights to prevent issues before they escalate.
This article discusses how AI is changing infrastructure as code (IaC) and the challenges it presents. Spacelift's co-founder, Marcin Wyszynski, explains that while AI tools can democratize infrastructure provisioning, the lack of understanding of the generated code poses risks. He draws a parallel to learning a foreign language – AI can produce the code, but teams need to comprehend it to avoid potentially disastrous infrastructure changes.
Spacelift's solution, Intent, focuses on deterministic guardrails and integration with tools like Open Policy Agent to ensure safe and controlled AI-driven infrastructure management. The core challenge is balancing speed and control in a rapidly evolving landscape.
Three vendors – Cohesity, ServiceNow, and Datadog – have partnered to create a recoverability service designed to address the risks associated with agentic AI (AIOps). The service aims to restore systems to a "trusted state" by identifying and recovering files and data corrupted by AI errors or malicious attacks.
The companies anticipate increased adoption of agentic AI for system operation but recognize the potential for errors and vulnerabilities. Their solution focuses on preserving immutable snapshots of AI environments, enabling point-in-time recovery of agents, data, and infrastructure components, including vector stores and agent memory.
ServiceNow and Datadog provide control and observability platforms to detect anomalies, triggering API-driven restorations when problems are identified. This offering competes with Rubrik's similar tool and native rollback capabilities from vendors like Cisco. Gartner predicts a significant increase in the integration of task-specific agents in enterprise applications, while Forrester emphasizes the need for guardrails and strong oversight in agentic AI development.
An account of how a developer, Alexey Grigorev, accidentally deleted 2.5 years of data from his AI Shipping Labs and DataTalks.Club websites using Claude Code and Terraform. Grigorev intended to migrate his website to AWS, but a missing state file and subsequent actions by Claude Code led to a complete wipe of the production setup, including the database and snapshots. The data was ultimately restored with help from Amazon Business support. The article highlights the importance of backups, careful permissions management, and manual review of potentially destructive actions performed by AI agents.