Tags: opentelemetry*

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. Airbnb's observability engineering team has transitioned from a legacy StatsD and proprietary Veneur-based aggregation pipeline to a modern, open-source stack utilizing OpenTelemetry Protocol (OTLP), the OpenTelemetry Collector, and VictoriaMetrics' vmagent. The new system handles over 100 million samples per second in production while reducing costs by roughly an order of magnitude.
    Key technical highlights include:
    * Migration strategy using dual-emitting metrics to bridge legacy StatsD libraries with OTLP adoption.
    * Performance improvements, including a reduction in JVM CPU time spent on metrics processing from 10% to under 1%.
    * Use of vmagent for streaming aggregation and horizontal sharding to manage high-cardinality data.
    * Implementation of a zero injection technique within the vmagent tier to solve Prometheus counter reset edge cases.
    * A two-layer architecture consisting of stateless router pods and stateful aggregator pods.
  2. Prove AI is developing an observability-first foundation designed for production generative AI systems. Their mission is to enable engineering teams to understand, diagnose, and remediate failures within complex AI pipelines, including LLM inference, retrieval processes, and agent orchestration.
    The current release, v0.1, provides an opinionated observability pipeline specifically for generative AI workloads through:
    - A containerized, OpenTelemetry-based telemetry pipeline.
    - Preconfigured collection of traces, metrics, and logs tailored for AI systems.
    - Instrumentation patterns for RAG pipelines, embeddings, LLM inference, and agent-based systems.
    - Compatibility with standard backends like Prometheus.
  3. "Prove AI is a self-hosted solution designed to accelerate GenAI performance monitoring. It allows AI engineers to capture, customize, and monitor GenAI metrics on their own terms, without vendor lock-in. Built on OpenTelemetry, Prove AI connects to existing OpenTelemetry pipelines and surfaces meaningful metrics quickly.
    Key features include a unified web-based interface for consolidating performance metrics like token throughput, latency distributions, and service health. It enables faster debugging, improved time-to-metric, and better measurement of GenAI ROI. The platform is open-source, free to deploy, and offers full control over telemetry data."
  4. Distributed tracing is crucial for modern observability, offering richer context than logs. However, the volume of tracing data can be overwhelming. Sampling addresses this by selectively retaining data, with two main approaches: head sampling (deciding upfront) and tail sampling (deciding after collecting all spans). Head sampling is simpler but can miss localized issues. Tail sampling, while more accurate, is complex to implement at scale, requiring buffering, stateful processing, and potentially impacting system resilience. Furthermore, sampling inherently affects the accuracy of RED metrics (request rate, error rate, duration), necessitating metric materialization *before* sampling.
  5. This article details building end-to-end observability for LLM applications using FastAPI and OpenTelemetry. It emphasizes a code-first approach, manually designing traces, spans, and semantic attributes to capture the full lifecycle of LLM-powered requests. The guide advocates for a structured approach to tracing RAG workflows, focusing on clear span boundaries, safe metadata capture (hashing prompts/responses), token usage tracking, and integration with observability backends like Jaeger, Grafana Tempo, or specialized LLM platforms. It highlights the importance of understanding LLM behavior beyond traditional infrastructure metrics.
  6. Agoda engineers developed API Agent, a system with zero code and zero deployments that enables a single Model Context Protocol (MCP) server to connect to internal REST or GraphQL APIs. The system is designed to reduce the operational overhead of managing multiple APIs with distinct schemas and authentication methods, allowing teams to query services through AI assistants without building individual MCP servers for each API.
  7. Google Cloud has announced native support for the OpenTelemetry Protocol (OTLP) in its Cloud Trace service, allowing developers to send trace data directly using OTLP and eliminating the need for vendor-specific exporters. This includes increased storage limits for attributes and spans.
  8. The company's transition from fragmented observability tools to a unified system using OpenTelemetry and OneUptime dramatically improved incident response times, reducing MTTR from 41 to 9 minutes. By correlating logs, metrics, and traces through structured logging and intelligent sampling, they eliminated much of the noise and confusion that previously slowed root cause analysis. The shift also reduced the number of dashboards engineers needed to check per incident and significantly lowered the percentage of incidents with unknown causes.

    Key practices included instrumenting once with OpenTelemetry, enforcing cardinality limits, and archiving raw data for future analysis. The move away from 100% trace capture and over-instrumentation helped manage data volume while maintaining visibility into anomalies. This transformation emphasized that effective observability isn't about collecting more data, but about designing correlated signals that support intentional diagnosis and reduce cognitive load.
  9. A guide to building a robust logging system in Python, covering structured logging, log levels, handlers, formatters, filters, and integrating logging with modern observability practices.
  10. This article compares three telemetry pipeline solutions – Cribl, Edge Delta, and DIY OpenTelemetry – based on scalability, performance, data management, intelligence, and cost. It details the strengths and weaknesses of each approach to help organizations choose the best solution for their observability and security data needs.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: tagged with "opentelemetry"

About - Propulsed by SemanticScuttle