SemanticScuttle - klotz.me » klotz: holmesgpt

Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF tools

STCLab's SRE team shares their experience building an AI-driven investigation pipeline to automate the triage of Kubernetes alerts. By utilizing HolmesGPT, they implemented a ReAct pattern that allows LLMs to autonomously select tools like Prometheus, Loki, and kubectl based on specific context. The core finding was that high-quality markdown runbooks containing exclusion rules were more critical for successful investigations than the underlying AI model itself.
Key points:
* Implementation of HolmesGPT using the ReAct agent pattern for autonomous troubleshooting.
* Integration with Robusta to manage Slack routing, deduplication, and thread matching.
* The vital role of runbooks in narrowing search spaces and reducing wasted tool calls.
* Comparison between self-hosted models via KubeAI and managed API approaches.
* Significant reduction in manual triage time from 20 minutes to under two minutes per investigation.

2026-04-24 Tags: kubernetes, holmesgpt, sre, production engineering, observability, cncf, prometheus, robusta, llm by klotz

SemanticScuttle - klotz.me

klotz: holmesgpt*

Linked Tags

Related Tags