SemanticScuttle - klotz.me » Tags: observability

Tags: observability*

Observability refers to the ability to understand the internal state of a system by observing its output. It involves monitoring, logging, and tracing various other forms of data collection to gain insights into the system's behavior, performance, and health. In the context of cloud engineering, observability is crucial for maintaining the efficiency and reliability of distributed systems, as it helps identify and diagnose issues, optimize performance, and ensure security. Observability tools, such as Splunk, Honeycomb, and OpenTelemetry, are used to collect and analyze metrics, logs, and traces, enabling capacity planning, root cause analysis and incident response.

0 bookmark(s) - Sort by: Date ↓ / Title /

Inside the LLM Call: GenAI Observability with OpenTelemetry

This article explores how OpenTelemetry Semantic Conventions for Generative AI provide deep visibility into LLM-powered applications by standardizing the recording of model calls, tool invocations, and token exchanges. It provides a practical walkthrough on exporting telemetry from tools like VS Code Copilot and using the Aspire Dashboard to visualize traces, metrics, and chat-style conversations.

2026-05-14 Tags: opentelemetry, genai, observability, llm, semantic conventions, tracing, metrics, james newton-king by klotz

Amazon CloudWatch Introduces OpenTelemetry Metrics Support in Preview

AWS has launched the public preview of OpenTelemetry (OTel) metrics support within Amazon CloudWatch, enabling developers to send metrics directly via the OTLP protocol. This update completes CloudWatch's support for logs, traces, and metrics using open standards.

- Support for high-cardinality metrics with up to 150 labels per metric.
- Integration of PromQL, allowing users to use Prometheus query language within the CloudWatch console and Managed Grafana.
- Automatic enrichment of ingested metrics with AWS resource metadata such as account ID, Region, and resource tags.

2026-04-29 Tags: amazon cloudwatch, opentelemetry, otlp, promql, aws, observability, devops, metrics, eks by klotz

Grafana Rearchitects Loki with Kafka and Ships a CLI to Bring Observability into Coding Agent

At GrafanaCON 2026, Grafana Labs announced significant updates including the launch of Grafana 13 and a major architectural overhaul for Loki. The new Loki design moves away from replication-at-ingestion toward using Kafka as a durability layer to reduce data duplication and improve query performance. Additionally, the company introduced GCX, a new CLI tool in public preview designed to integrate observability data directly into agentic development environments like Claude Code and Cursor, allowing engineers to resolve production issues without leaving their coding tools.
:
- Loki rearchitected with Kafka to reduce storage overhead and improve query speed.
- Introduction of GCX CLI for seamless observability integration within AI coding agents.
- Launch of Grafana 13 featuring dynamic dashboards and expanded data source support.
- New AI Observability product in public preview for monitoring LLM applications.

2026-04-29 Tags: devops, observability, grafana, loki, apache kafka, llm, cli, observability bus, logging, production engineering by klotz

Auto-diagnosing Kubernetes alerts with HolmesGPT and CNCF tools

STCLab's SRE team shares their experience building an AI-driven investigation pipeline to automate the triage of Kubernetes alerts. By utilizing HolmesGPT, they implemented a ReAct pattern that allows LLMs to autonomously select tools like Prometheus, Loki, and kubectl based on specific context. The core finding was that high-quality markdown runbooks containing exclusion rules were more critical for successful investigations than the underlying AI model itself.
Key points:
* Implementation of HolmesGPT using the ReAct agent pattern for autonomous troubleshooting.
* Integration with Robusta to manage Slack routing, deduplication, and thread matching.
* The vital role of runbooks in narrowing search spaces and reducing wasted tool calls.
* Comparison between self-hosted models via KubeAI and managed API approaches.
* Significant reduction in manual triage time from 20 minutes to under two minutes per investigation.

2026-04-24 Tags: kubernetes, holmesgpt, sre, production engineering, observability, cncf, prometheus, robusta, llm by klotz

Airbnb Migrates High-Volume Metrics Pipeline to OpenTelemetry

Airbnb's observability engineering team has transitioned from a legacy StatsD and proprietary Veneur-based aggregation pipeline to a modern, open-source stack utilizing OpenTelemetry Protocol (OTLP), the OpenTelemetry Collector, and VictoriaMetrics' vmagent. The new system handles over 100 million samples per second in production while reducing costs by roughly an order of magnitude.
Key technical highlights include:
* Migration strategy using dual-emitting metrics to bridge legacy StatsD libraries with OTLP adoption.
* Performance improvements, including a reduction in JVM CPU time spent on metrics processing from 10% to under 1%.
* Use of vmagent for streaming aggregation and horizontal sharding to manage high-cardinality data.
* Implementation of a zero injection technique within the vmagent tier to solve Prometheus counter reset edge cases.
* A two-layer architecture consisting of stateless router pods and stateful aggregator pods.

2026-04-14 Tags: airbnb, opentelemetry, otlp, victoriametrics, vmagent, observability, devops, metrics pipeline by klotz

Welcome to Prove AI

Prove AI is developing an observability-first foundation designed for production generative AI systems. Their mission is to enable engineering teams to understand, diagnose, and remediate failures within complex AI pipelines, including LLM inference, retrieval processes, and agent orchestration.
The current release, v0.1, provides an opinionated observability pipeline specifically for generative AI workloads through:
- A containerized, OpenTelemetry-based telemetry pipeline.
- Preconfigured collection of traces, metrics, and logs tailored for AI systems.
- Instrumentation patterns for RAG pipelines, embeddings, LLM inference, and agent-based systems.
- Compatibility with standard backends like Prometheus.

2026-04-14 Tags: llm, observability, production engineering, opentelemetry, inference, rag, telemetry by klotz

Agent Starter Pack

A Python package designed to provide production-ready templates for Generative AI agents on Google Cloud. It allows developers to focus on agent logic by automating the surrounding infrastructure, including CI/CD pipelines, observability, security, and deployment via Cloud Run or Agent Engine.
Key features and offerings include:
- Pre-built agent templates such as ReAct, RAG (Retrieval-Augmented Generation), multi-agent systems, and real-time multimodal agents using Gemini.
- Automated CI/CD integration with Google Cloud Build and GitHub Actions.
- Data pipelines for RAG using Terraform, supporting Vertex AI Search and Vector Search.
- Support for various frameworks including Google's Agent Development Kit (ADK) and LangGraph.
- Integration with the Gemini CLI for architectural guidance directly in the terminal.

2026-04-14 Tags: gcp, gemini, agents, observability, llm, google, github, agent-starter-pack by klotz

Infinite Monitor

Infinite Monitor is an AI-powered dashboard builder that allows users to describe the widget they want in plain English, and an AI agent will write, build, and deploy it in real time. Each widget is a full React app running in an isolated iframe, offering flexibility and customization. Users can drag, resize, and organize these widgets on an infinite canvas for various applications like cybersecurity, OSINT, trading, and prediction markets.
The project supports multiple AI providers and offers features like dashboard awareness, live web search, and a widget marketplace. It prioritizes security with local-first storage and threat scanning.

2026-03-22 Tags: osint, dashboard, agents, claude, sigint, llm, observability, shrunk, boxer by klotz

Prove AI - Self-Hosted GenAI Telemetry

"Prove AI is a self-hosted solution designed to accelerate GenAI performance monitoring. It allows AI engineers to capture, customize, and monitor GenAI metrics on their own terms, without vendor lock-in. Built on OpenTelemetry, Prove AI connects to existing OpenTelemetry pipelines and surfaces meaningful metrics quickly.
Key features include a unified web-based interface for consolidating performance metrics like token throughput, latency distributions, and service health. It enables faster debugging, improved time-to-metric, and better measurement of GenAI ROI. The platform is open-source, free to deploy, and offers full control over telemetry data."

2026-03-22 Tags: telemetry, opentelemetry, monitoring, self-hosted, performance, debugging, metrics, llm, observability by klotz

Sampling: the philosopher's stone of distributed tracing

Distributed tracing is crucial for modern observability, offering richer context than logs. However, the volume of tracing data can be overwhelming. Sampling addresses this by selectively retaining data, with two main approaches: head sampling (deciding upfront) and tail sampling (deciding after collecting all spans). Head sampling is simpler but can miss localized issues. Tail sampling, while more accurate, is complex to implement at scale, requiring buffering, stateful processing, and potentially impacting system resilience. Furthermore, sampling inherently affects the accuracy of RED metrics (request rate, error rate, duration), necessitating metric materialization *before* sampling.

2026-03-19 Tags: distributed tracing, sampling, opentelemetry, observability, head sampling, tail sampling, metrics, red metrics, span, trace by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: observability*

Linked Tags

Related Tags