STCLab's SRE team shares their experience building an AI-driven investigation pipeline to automate the triage of Kubernetes alerts. By utilizing HolmesGPT, they implemented a ReAct pattern that allows LLMs to autonomously select tools like Prometheus, Loki, and kubectl based on specific context. The core finding was that high-quality markdown runbooks containing exclusion rules were more critical for successful investigations than the underlying AI model itself.
Key points:
* Implementation of HolmesGPT using the ReAct agent pattern for autonomous troubleshooting.
* Integration with Robusta to manage Slack routing, deduplication, and thread matching.
* The vital role of runbooks in narrowing search spaces and reducing wasted tool calls.
* Comparison between self-hosted models via KubeAI and managed API approaches.
* Significant reduction in manual triage time from 20 minutes to under two minutes per investigation.
A malicious release of litellm version 1.82.8 was published to PyPI on March 24, 2026.
The package contains a hidden .pth file that executes on every Python interpreter startup, spawning a subprocess that triggers the same .pth again, creating an exponential fork bomb.
The malware harvests credentials (SSH keys, cloud provider tokens, Kubernetes configs, environment variables, etc.), encrypts them with a hard‑coded RSA key, and exfiltrates them to a malicious domain.
OpenShell is a safe, private runtime environment designed for autonomous AI agents. It provides sandboxed execution with declarative YAML policies to control file access, data exfiltration, and network activity. Built with an agent-first approach, OpenShell offers pre-built skills for tasks like cluster debugging and policy generation.
Currently in alpha, it focuses on single-player mode and aims to expand to multi-tenant enterprise deployments. OpenShell uses a containerized K3s Kubernetes cluster for isolation and enforces security across filesystem, network, process, and inference layers. It supports agents like Claude, OpenCode, and Copilot, managing credentials securely.
The Model Context Protocol (MCP) is becoming a key component in the agentic AI space, enabling models to interact with external tools and data. The project's 2026 roadmap focuses on addressing challenges for production deployment. Key priorities include improving scalability by evolving the transport and session model, clarifying agent communication and task lifecycle management, maturing governance structures for wider community contribution, and preparing for enterprise requirements like audit trails and authentication. The roadmap also highlights ongoing exploration of areas like event-driven updates and security.
Agentic workflows are rapidly accelerating the volume of pull requests, and validation is quickly becoming the most critical bottleneck. Teams using service meshes like Istio are well-positioned to solve it in ephemeral environments.
This article discusses the author's experience setting up reverse proxies for self-hosted services, finding the process surprisingly straightforward despite extensive and often overwhelming documentation. It compares several popular options like Nginx, Traefik, Caddy, Envoy, SWAG, and HAProxy, ultimately recommending Caddy for its simplicity and features. It also touches on the relative ease of reverse proxy setup compared to configuring the services they front.
This article explores the emerging category of AI-powered operations agents, comparing AI DevOps engineers and AI SRE agents, how cloud providers are responding, and what engineers should consider when evaluating these tools.
Over the last year, MCP accomplished a rapid rise to popularity that few other standards or technologies have achieved so quickly. This article details the unlikely rise of the Model Context Protocol (MCP) and its journey to becoming a generally accepted standard for AI connectivity.
An effort to create a fully functional Kubernetes cluster with 1 million active nodes. The article details the challenges and solutions for scaling Kubernetes to this size, covering networking, state management (etcd), and the scheduler.
Plural is bringing AI into the DevOps lifecycle with a new release that leverages a unified GitOps platform as a RAG engine. This provides AI-powered troubleshooting, natural language infrastructure querying, autonomous upgrade assistance, and agentic workflows for infrastructure modification, all with enterprise-grade guardrails.