klotz: devops*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. AWS has launched the public preview of OpenTelemetry (OTel) metrics support within Amazon CloudWatch, enabling developers to send metrics directly via the OTLP protocol. This update completes CloudWatch's support for logs, traces, and metrics using open standards.

    - Support for high-cardinality metrics with up to 150 labels per metric.
    - Integration of PromQL, allowing users to use Prometheus query language within the CloudWatch console and Managed Grafana.
    - Automatic enrichment of ingested metrics with AWS resource metadata such as account ID, Region, and resource tags.
  2. At GrafanaCON 2026, Grafana Labs announced significant updates including the launch of Grafana 13 and a major architectural overhaul for Loki. The new Loki design moves away from replication-at-ingestion toward using Kafka as a durability layer to reduce data duplication and improve query performance. Additionally, the company introduced GCX, a new CLI tool in public preview designed to integrate observability data directly into agentic development environments like Claude Code and Cursor, allowing engineers to resolve production issues without leaving their coding tools.
    :
    - Loki rearchitected with Kafka to reduce storage overhead and improve query speed.
    - Introduction of GCX CLI for seamless observability integration within AI coding agents.
    - Launch of Grafana 13 featuring dynamic dashboards and expanded data source support.
    - New AI Observability product in public preview for monitoring LLM applications.
  3. AWS has released the general availability of its DevOps Agent, a generative AI assistant designed to automate incident investigation and operational tasks. Built on Amazon Bedrock AgentCore, the tool integrates with observability platforms, code repositories, and CI/CD pipelines to autonomously triage issues and correlate telemetry data. New capabilities include support for investigating applications in Azure and on-premises environments, custom agent skills, and personalized reporting.
    Key highlights:
    * Autonomous incident investigation triggered by webhooks from sources like CloudWatch or PagerDuty.
    * Integration with major tools including Datadog, Grafana, Splunk, GitHub, and GitLab.
    * Reported performance improvements of up to 75% lower MTTR during preview.
    * Pricing model based on cumulative time spent on operational tasks per second.
  4. Airbnb's observability engineering team has transitioned from a legacy StatsD and proprietary Veneur-based aggregation pipeline to a modern, open-source stack utilizing OpenTelemetry Protocol (OTLP), the OpenTelemetry Collector, and VictoriaMetrics' vmagent. The new system handles over 100 million samples per second in production while reducing costs by roughly an order of magnitude.
    Key technical highlights include:
    * Migration strategy using dual-emitting metrics to bridge legacy StatsD libraries with OTLP adoption.
    * Performance improvements, including a reduction in JVM CPU time spent on metrics processing from 10% to under 1%.
    * Use of vmagent for streaming aggregation and horizontal sharding to manage high-cardinality data.
    * Implementation of a zero injection technique within the vmagent tier to solve Prometheus counter reset edge cases.
    * A two-layer architecture consisting of stateless router pods and stateful aggregator pods.
  5. This article examines the development of Microsoft’s Azure SRE Agent, designed to mitigate operational toil in mission-critical environments. By utilizing an "agentic workflow" of specialized AI agents, Microsoft has integrated automation across the entire software development lifecycle. This human-AI partnership has autonomously resolved over 35,000 incidents and saved more than 50,000 developer hours, accelerating root cause analysis and mitigation while maintaining rigorous governance and human oversight.
  6. This article details how HPE is addressing operational fatigue and burnout in IT teams through the introduction of agentic AI operations. HPE's new system utilizes skills-based AI agents that work alongside human operators to reduce alert noise, improve response times, and cut root cause analysis time by at least half, according to early adopters.
    The focus is on augmenting human capabilities rather than replacing them, with a strong emphasis on auditability, transparency, and human oversight in AI-driven actions. The system aims to break down data silos and provide proactive insights to prevent issues before they escalate.
  7. This article discusses how AI is changing infrastructure as code (IaC) and the challenges it presents. Spacelift's co-founder, Marcin Wyszynski, explains that while AI tools can democratize infrastructure provisioning, the lack of understanding of the generated code poses risks. He draws a parallel to learning a foreign language – AI can produce the code, but teams need to comprehend it to avoid potentially disastrous infrastructure changes.
    Spacelift's solution, Intent, focuses on deterministic guardrails and integration with tools like Open Policy Agent to ensure safe and controlled AI-driven infrastructure management. The core challenge is balancing speed and control in a rapidly evolving landscape.
  8. The Model Context Protocol (MCP) is becoming a key component in the agentic AI space, enabling models to interact with external tools and data. The project's 2026 roadmap focuses on addressing challenges for production deployment. Key priorities include improving scalability by evolving the transport and session model, clarifying agent communication and task lifecycle management, maturing governance structures for wider community contribution, and preparing for enterprise requirements like audit trails and authentication. The roadmap also highlights ongoing exploration of areas like event-driven updates and security.
  9. The New Stack encourages its readers to contribute to Towards Data Science, a leading platform for data science and AI. Recognizing the increasing convergence of cloud infrastructure, DevOps, and AI engineering, the article invites practitioners to share their experiences with building and deploying AI systems. Successful TDS submissions are technically detailed, timely, and specific. Authors can also benefit from editorial support, promotion, and potential payment opportunities, while building their reputation within the AI community.
  10. Agentic workflows are rapidly accelerating the volume of pull requests, and validation is quickly becoming the most critical bottleneck. Teams using service meshes like Istio are well-positioned to solve it in ephemeral environments.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: devops

About - Propulsed by SemanticScuttle