AWS has released the general availability of its DevOps Agent, a generative AI assistant designed to automate incident investigation and operational tasks. Built on Amazon Bedrock AgentCore, the tool integrates with observability platforms, code repositories, and CI/CD pipelines to autonomously triage issues and correlate telemetry data. New capabilities include support for investigating applications in Azure and on-premises environments, custom agent skills, and personalized reporting.
Key highlights:
* Autonomous incident investigation triggered by webhooks from sources like CloudWatch or PagerDuty.
* Integration with major tools including Datadog, Grafana, Splunk, GitHub, and GitLab.
* Reported performance improvements of up to 75% lower MTTR during preview.
* Pricing model based on cumulative time spent on operational tasks per second.
Airbnb's observability engineering team has transitioned from a legacy StatsD and proprietary Veneur-based aggregation pipeline to a modern, open-source stack utilizing OpenTelemetry Protocol (OTLP), the OpenTelemetry Collector, and VictoriaMetrics' vmagent. The new system handles over 100 million samples per second in production while reducing costs by roughly an order of magnitude.
Key technical highlights include:
* Migration strategy using dual-emitting metrics to bridge legacy StatsD libraries with OTLP adoption.
* Performance improvements, including a reduction in JVM CPU time spent on metrics processing from 10% to under 1%.
* Use of vmagent for streaming aggregation and horizontal sharding to manage high-cardinality data.
* Implementation of a zero injection technique within the vmagent tier to solve Prometheus counter reset edge cases.
* A two-layer architecture consisting of stateless router pods and stateful aggregator pods.
This article examines the development of Microsoft’s Azure SRE Agent, designed to mitigate operational toil in mission-critical environments. By utilizing an "agentic workflow" of specialized AI agents, Microsoft has integrated automation across the entire software development lifecycle. This human-AI partnership has autonomously resolved over 35,000 incidents and saved more than 50,000 developer hours, accelerating root cause analysis and mitigation while maintaining rigorous governance and human oversight.
This article details how HPE is addressing operational fatigue and burnout in IT teams through the introduction of agentic AI operations. HPE's new system utilizes skills-based AI agents that work alongside human operators to reduce alert noise, improve response times, and cut root cause analysis time by at least half, according to early adopters.
The focus is on augmenting human capabilities rather than replacing them, with a strong emphasis on auditability, transparency, and human oversight in AI-driven actions. The system aims to break down data silos and provide proactive insights to prevent issues before they escalate.
This article discusses how AI is changing infrastructure as code (IaC) and the challenges it presents. Spacelift's co-founder, Marcin Wyszynski, explains that while AI tools can democratize infrastructure provisioning, the lack of understanding of the generated code poses risks. He draws a parallel to learning a foreign language – AI can produce the code, but teams need to comprehend it to avoid potentially disastrous infrastructure changes.
Spacelift's solution, Intent, focuses on deterministic guardrails and integration with tools like Open Policy Agent to ensure safe and controlled AI-driven infrastructure management. The core challenge is balancing speed and control in a rapidly evolving landscape.
The Model Context Protocol (MCP) is becoming a key component in the agentic AI space, enabling models to interact with external tools and data. The project's 2026 roadmap focuses on addressing challenges for production deployment. Key priorities include improving scalability by evolving the transport and session model, clarifying agent communication and task lifecycle management, maturing governance structures for wider community contribution, and preparing for enterprise requirements like audit trails and authentication. The roadmap also highlights ongoing exploration of areas like event-driven updates and security.
The New Stack encourages its readers to contribute to Towards Data Science, a leading platform for data science and AI. Recognizing the increasing convergence of cloud infrastructure, DevOps, and AI engineering, the article invites practitioners to share their experiences with building and deploying AI systems. Successful TDS submissions are technically detailed, timely, and specific. Authors can also benefit from editorial support, promotion, and potential payment opportunities, while building their reputation within the AI community.
Agentic workflows are rapidly accelerating the volume of pull requests, and validation is quickly becoming the most critical bottleneck. Teams using service meshes like Istio are well-positioned to solve it in ephemeral environments.
Amazon Web Services (AWS) recently made a significant move by laying off approximately 40% of its DevOps staff. This decision wasn't a sign of downsizing, but rather a strategic shift towards automation and a new tool called 'Dahlia'. This article explores the reasons behind the layoffs, the capabilities of Dahlia, and its potential impact on the future of DevOps.
The article details Amazon Web Services' (AWS) recent decision to lay off a significant portion (around 40%) of its DevOps workforce, specifically those involved in managing and maintaining its own internal infrastructure. This isn't a sign of AWS abandoning DevOps, but rather a strategic shift *towards* fully embracing a "platform engineering" approach and leveraging automation tools.
* **Shift to Platform Engineering:** AWS is building internal "developer platforms" – self-service tools and standardized components – to empower application development teams to manage their own infrastructure and deployments with less reliance on centralized DevOps teams.
* **Key Tools Driving the Change:** The article highlights three main tools enabling this transition:
* **Pulumi:** An Infrastructure-as-Code (IaC) tool allowing developers to define infrastructure using familiar programming languages (Python, JavaScript, Go, etc.).
* **Crossplane:** An open-source Kubernetes add-on that extends Kubernetes to manage infrastructure across multiple cloud providers.
* **Backstage:** A developer portal created by Spotify, now open-source, that provides a centralized interface for developers to discover, create, and manage software components and infrastructure.
* **Impact of the Layoffs:** The layoffs were concentrated in teams traditionally responsible for manual infrastructure provisioning and maintenance. The remaining DevOps staff are being re-focused on building and maintaining the internal developer platforms.
* **Wider Industry Trend:** This move by AWS reflects a broader trend in the industry towards platform engineering, driven by the need for faster innovation, increased developer productivity, and reduced operational overhead.
In essence, AWS is automating away much of the traditional DevOps work, allowing developers to self-serve their infrastructure needs through these platform tools. This is a strategic move to scale its internal development efforts and accelerate innovation.
A recent article by Google Cloud SREs describes how they use the AI-powered Gemini CLI internally to resolve real-world outages. This approach improves reliability in critical infrastructure operations and reduces incident response time by integrating intelligent reasoning directly into the terminal-based operational tools.