SemanticScuttle - klotz.me

Tags: devops*

0 bookmark(s) - Sort by: Date ↓ / Title /

AWS Announces General Availability of DevOps Agent for Automated Incident Investigation

AWS has released the general availability of its DevOps Agent, a generative AI assistant designed to automate incident investigation and operational tasks. Built on Amazon Bedrock AgentCore, the tool integrates with observability platforms, code repositories, and CI/CD pipelines to autonomously triage issues and correlate telemetry data. New capabilities include support for investigating applications in Azure and on-premises environments, custom agent skills, and personalized reporting.
Key highlights:
* Autonomous incident investigation triggered by webhooks from sources like CloudWatch or PagerDuty.
* Integration with major tools including Datadog, Grafana, Splunk, GitHub, and GitLab.
* Reported performance improvements of up to 75% lower MTTR during preview.
* Pricing model based on cumulative time spent on operational tasks per second.

2026-04-19 Tags: devops, aws, llm, incident response, sre, aiops, amazon bedrock by klotz

Airbnb Migrates High-Volume Metrics Pipeline to OpenTelemetry

Airbnb's observability engineering team has transitioned from a legacy StatsD and proprietary Veneur-based aggregation pipeline to a modern, open-source stack utilizing OpenTelemetry Protocol (OTLP), the OpenTelemetry Collector, and VictoriaMetrics' vmagent. The new system handles over 100 million samples per second in production while reducing costs by roughly an order of magnitude.
Key technical highlights include:
* Migration strategy using dual-emitting metrics to bridge legacy StatsD libraries with OTLP adoption.
* Performance improvements, including a reduction in JVM CPU time spent on metrics processing from 10% to under 1%.
* Use of vmagent for streaming aggregation and horizontal sharding to manage high-cardinality data.
* Implementation of a zero injection technique within the vmagent tier to solve Prometheus counter reset edge cases.
* A two-layer architecture consisting of stateless router pods and stateful aggregator pods.

2026-04-14 Tags: airbnb, opentelemetry, otlp, victoriametrics, vmagent, observability, devops, metrics pipeline by klotz

How we build Azure SRE Agent with agentic workflows

This article examines the development of Microsoft’s Azure SRE Agent, designed to mitigate operational toil in mission-critical environments. By utilizing an "agentic workflow" of specialized AI agents, Microsoft has integrated automation across the entire software development lifecycle. This human-AI partnership has autonomously resolved over 35,000 incidents and saved more than 50,000 developer hours, accelerating root cause analysis and mitigation while maintaining rigorous governance and human oversight.

2026-04-06 Tags: azure, sre agent, agentic workflows, production engineering, sre, microsoft, devops, automation, incident management, sdlc by klotz

HPE’s AI agents cut root cause analysis time in half

This article details how HPE is addressing operational fatigue and burnout in IT teams through the introduction of agentic AI operations. HPE's new system utilizes skills-based AI agents that work alongside human operators to reduce alert noise, improve response times, and cut root cause analysis time by at least half, according to early adopters.
The focus is on augmenting human capabilities rather than replacing them, with a strong emphasis on auditability, transparency, and human oversight in AI-driven actions. The system aims to break down data silos and provide proactive insights to prevent issues before they escalate.

2026-03-28 Tags: production engineering, agentic, hpe, devops, sre, root cause analysis, alert fatigue, llm by klotz

AI can write your infrastructure code. There’s a reason most teams won’t let it.

This article discusses how AI is changing infrastructure as code (IaC) and the challenges it presents. Spacelift's co-founder, Marcin Wyszynski, explains that while AI tools can democratize infrastructure provisioning, the lack of understanding of the generated code poses risks. He draws a parallel to learning a foreign language – AI can produce the code, but teams need to comprehend it to avoid potentially disastrous infrastructure changes.
Spacelift's solution, Intent, focuses on deterministic guardrails and integration with tools like Open Policy Agent to ensure safe and controlled AI-driven infrastructure management. The core challenge is balancing speed and control in a rapidly evolving landscape.

2026-03-22 Tags: llm, production engineering, sre, infrastructure as code, iac, spacelift, opentofu, terraform, devops, platform engineering, open policy agent by klotz

MCP’s biggest growing pains for production use will soon be solved

The Model Context Protocol (MCP) is becoming a key component in the agentic AI space, enabling models to interact with external tools and data. The project's 2026 roadmap focuses on addressing challenges for production deployment. Key priorities include improving scalability by evolving the transport and session model, clarifying agent communication and task lifecycle management, maturing governance structures for wider community contribution, and preparing for enterprise requirements like audit trails and authentication. The roadmap also highlights ongoing exploration of areas like event-driven updates and security.

2026-03-15 Tags: llm, agents, model context protocol, open source, mcp, kubernetes, devops, enterprise by klotz

Publish your data, AI techniques, and agentic engineering work on Towards Data Science

The New Stack encourages its readers to contribute to Towards Data Science, a leading platform for data science and AI. Recognizing the increasing convergence of cloud infrastructure, DevOps, and AI engineering, the article invites practitioners to share their experiences with building and deploying AI systems. Successful TDS submissions are technically detailed, timely, and specific. Authors can also benefit from editorial support, promotion, and potential payment opportunities, while building their reputation within the AI community.

2026-03-12 Tags: ai, data science, machine learning, publishing, towards data science, agentic engineering, cloud infrastructure, devops, llm by klotz

The agent pull request flood is here. If you run Istio, you’re halfway to solving it.

Agentic workflows are rapidly accelerating the volume of pull requests, and validation is quickly becoming the most critical bottleneck. Teams using service meshes like Istio are well-positioned to solve it in ephemeral environments.

2026-02-28 Tags: agents, istio, service mesh, validation, ephemeral environments, kubernetes, ci_cd, software delivery, devops, agentic workflows by klotz

AWS Laid Off 40% of Its DevOps Staff—What They’re Using Instead Will Shock You

Amazon Web Services (AWS) recently made a significant move by laying off approximately 40% of its DevOps staff. This decision wasn't a sign of downsizing, but rather a strategic shift towards automation and a new tool called 'Dahlia'. This article explores the reasons behind the layoffs, the capabilities of Dahlia, and its potential impact on the future of DevOps.

The article details Amazon Web Services' (AWS) recent decision to lay off a significant portion (around 40%) of its DevOps workforce, specifically those involved in managing and maintaining its own internal infrastructure. This isn't a sign of AWS abandoning DevOps, but rather a strategic shift *towards* fully embracing a "platform engineering" approach and leveraging automation tools.

* **Shift to Platform Engineering:** AWS is building internal "developer platforms" – self-service tools and standardized components – to empower application development teams to manage their own infrastructure and deployments with less reliance on centralized DevOps teams.
* **Key Tools Driving the Change:** The article highlights three main tools enabling this transition:
* **Pulumi:** An Infrastructure-as-Code (IaC) tool allowing developers to define infrastructure using familiar programming languages (Python, JavaScript, Go, etc.).
* **Crossplane:** An open-source Kubernetes add-on that extends Kubernetes to manage infrastructure across multiple cloud providers.
* **Backstage:** A developer portal created by Spotify, now open-source, that provides a centralized interface for developers to discover, create, and manage software components and infrastructure.
* **Impact of the Layoffs:** The layoffs were concentrated in teams traditionally responsible for manual infrastructure provisioning and maintenance. The remaining DevOps staff are being re-focused on building and maintaining the internal developer platforms.
* **Wider Industry Trend:** This move by AWS reflects a broader trend in the industry towards platform engineering, driven by the need for faster innovation, increased developer productivity, and reduced operational overhead.

In essence, AWS is automating away much of the traditional DevOps work, allowing developers to self-serve their infrastructure needs through these platform tools. This is a strategic move to scale its internal development efforts and accelerate innovation.

2026-02-23 Tags: aws, devops, automation, dahlia, layoffs, cloud computing, infrastructure as code, terraform, cloudformation by klotz

From Paging to Postmortem: Google Cloud SREs on Using Gemini CLI for Outage Response

A recent article by Google Cloud SREs describes how they use the AI-powered Gemini CLI internally to resolve real-world outages. This approach improves reliability in critical infrastructure operations and reduces incident response time by integrating intelligent reasoning directly into the terminal-based operational tools.

2026-02-15 Tags: devops, llm production engineering, ml, incident response, aiops, cloud, google cloud, agents, site reliability engineering by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: devops*

Linked Tags

Related Tags