STCLab's SRE team shares their experience building an AI-driven investigation pipeline to automate the triage of Kubernetes alerts. By utilizing HolmesGPT, they implemented a ReAct pattern that allows LLMs to autonomously select tools like Prometheus, Loki, and kubectl based on specific context. The core finding was that high-quality markdown runbooks containing exclusion rules were more critical for successful investigations than the underlying AI model itself.
Key points:
* Implementation of HolmesGPT using the ReAct agent pattern for autonomous troubleshooting.
* Integration with Robusta to manage Slack routing, deduplication, and thread matching.
* The vital role of runbooks in narrowing search spaces and reducing wasted tool calls.
* Comparison between self-hosted models via KubeAI and managed API approaches.
* Significant reduction in manual triage time from 20 minutes to under two minutes per investigation.