At GrafanaCON 2026, Grafana Labs announced significant updates including the launch of Grafana 13 and a major architectural overhaul for Loki. The new Loki design moves away from replication-at-ingestion toward using Kafka as a durability layer to reduce data duplication and improve query performance. Additionally, the company introduced GCX, a new CLI tool in public preview designed to integrate observability data directly into agentic development environments like Claude Code and Cursor, allowing engineers to resolve production issues without leaving their coding tools.
:
- Loki rearchitected with Kafka to reduce storage overhead and improve query speed.
- Introduction of GCX CLI for seamless observability integration within AI coding agents.
- Launch of Grafana 13 featuring dynamic dashboards and expanded data source support.
- New AI Observability product in public preview for monitoring LLM applications.
STCLab's SRE team shares their experience building an AI-driven investigation pipeline to automate the triage of Kubernetes alerts. By utilizing HolmesGPT, they implemented a ReAct pattern that allows LLMs to autonomously select tools like Prometheus, Loki, and kubectl based on specific context. The core finding was that high-quality markdown runbooks containing exclusion rules were more critical for successful investigations than the underlying AI model itself.
Key points:
* Implementation of HolmesGPT using the ReAct agent pattern for autonomous troubleshooting.
* Integration with Robusta to manage Slack routing, deduplication, and thread matching.
* The vital role of runbooks in narrowing search spaces and reducing wasted tool calls.
* Comparison between self-hosted models via KubeAI and managed API approaches.
* Significant reduction in manual triage time from 20 minutes to under two minutes per investigation.
As AI agents evolve from autocomplete tools to active contributors (opening PRs, managing infrastructure), DevOps must adapt. This playbook outlines the shift through these key strategic pillars:
* **Foundational Prerequisites:** Robust CI/CD, automated testing, and Infrastructure as Code are essential for agentic workflows.
* **Evolving Engineering Roles:** Engineers transition from code producers to system designers, agent operators, and quality stewards.
* **Structured Collaboration:** Integration across IDEs, PRs, pipelines, and production environments is required.
* **Repository Design:** Repositories must act as explicit interfaces using skill profiles and instruction files.
* **Development Methodology:** Shift from ephemeral prompt engineering to durable, specification-driven development.
* **Governance & Security:** Implement frameworks for custom agent consistency/auditability and transform CI/CD into active verifiers of semantic intent and security.
* **New Success Metrics:** Move from volume-based productivity counts to outcome-based and trust-boundary measurements.
Prove AI is developing an observability-first foundation designed for production generative AI systems. Their mission is to enable engineering teams to understand, diagnose, and remediate failures within complex AI pipelines, including LLM inference, retrieval processes, and agent orchestration.
The current release, v0.1, provides an opinionated observability pipeline specifically for generative AI workloads through:
- A containerized, OpenTelemetry-based telemetry pipeline.
- Preconfigured collection of traces, metrics, and logs tailored for AI systems.
- Instrumentation patterns for RAG pipelines, embeddings, LLM inference, and agent-based systems.
- Compatibility with standard backends like Prometheus.
This article details how HPE is addressing operational fatigue and burnout in IT teams through the introduction of agentic AI operations. HPE's new system utilizes skills-based AI agents that work alongside human operators to reduce alert noise, improve response times, and cut root cause analysis time by at least half, according to early adopters.
The focus is on augmenting human capabilities rather than replacing them, with a strong emphasis on auditability, transparency, and human oversight in AI-driven actions. The system aims to break down data silos and provide proactive insights to prevent issues before they escalate.
This article discusses how AI is changing infrastructure as code (IaC) and the challenges it presents. Spacelift's co-founder, Marcin Wyszynski, explains that while AI tools can democratize infrastructure provisioning, the lack of understanding of the generated code poses risks. He draws a parallel to learning a foreign language – AI can produce the code, but teams need to comprehend it to avoid potentially disastrous infrastructure changes.
Spacelift's solution, Intent, focuses on deterministic guardrails and integration with tools like Open Policy Agent to ensure safe and controlled AI-driven infrastructure management. The core challenge is balancing speed and control in a rapidly evolving landscape.
Three vendors – Cohesity, ServiceNow, and Datadog – have partnered to create a recoverability service designed to address the risks associated with agentic AI (AIOps). The service aims to restore systems to a "trusted state" by identifying and recovering files and data corrupted by AI errors or malicious attacks.
The companies anticipate increased adoption of agentic AI for system operation but recognize the potential for errors and vulnerabilities. Their solution focuses on preserving immutable snapshots of AI environments, enabling point-in-time recovery of agents, data, and infrastructure components, including vector stores and agent memory.
ServiceNow and Datadog provide control and observability platforms to detect anomalies, triggering API-driven restorations when problems are identified. This offering competes with Rubrik's similar tool and native rollback capabilities from vendors like Cisco. Gartner predicts a significant increase in the integration of task-specific agents in enterprise applications, while Forrester emphasizes the need for guardrails and strong oversight in agentic AI development.
This tutorial explores how to use LLM embeddings as features in time series forecasting models. It covers generating embeddings from time series descriptions, preparing data, and evaluating the performance of models with and without LLM embeddings.
This article details how Google SREs are leveraging Gemini 3 and Gemini CLI to accelerate incident response, root cause analysis, and postmortem creation, ultimately reducing Mean Time To Mitigation (MTTM) and improving system reliability.
>When deployed strategically, agents can empower SREs to offload low-risk, toilsome tasks so they can focus on the most critical matters.
Agents in practice include:
* **Contextual Information:** Providing SREs with details from previously resolved incidents involving the same service, including responder notes.
* **Root Cause Analysis:** Suggesting potential origins of an issue and identifying recent configuration changes that might be responsible.
* **Automated Remediation:** Handling low-risk, well-defined issues without human intervention, with SRE review of after-action reports.
* **Diagnostic Suggestions:** Nudging SREs towards running specific diagnostics for partially understood incidents and supplying them automatically.
* **Runbook Generation:** Automatically creating and updating runbooks based on successful remediation steps, preventing recurring issues.
.