SemanticScuttle - klotz.me » Tags: observability+production engineering

Tags: observability* + production engineering*

0 bookmark(s) - Sort by: Date ↓ / Title /

Opik: LLM Observability and Agent Optimization

Ship measurable improvements in your GenAI systems with Opik, your open-source LLM observability and agent optimization platform. Trusted by over 150,000 developers and thousands of companies.

2025-11-26 Tags: llm, observability, agents, mlops, experiment tracking, model evaluation, production engineering, foss, comet by klotz

From logs to insights: The AI breakthrough redefining observability

Elastic's new Streams feature uses AI to transform noisy logs into actionable insights, helping SREs diagnose and resolve issues faster. The article discusses how AI is poised to become the primary tool for incident diagnosis and address skill shortages in IT infrastructure management.

Here's a breakdown of the technical details:

* **Problem:** Modern IT (especially Kubernetes) generates massive amounts of log data (30-50GB/day per cluster) making manual analysis for root cause identification slow, costly, and prone to errors. Existing observability tools often treat logs as a last resort.
* **Elastic's Solution (Streams):**
* **AI-powered Parsing & Partitioning:** Automatically extracts relevant fields from raw logs, reducing manual effort.
* **Anomaly Detection:** Surfaces critical errors and anomalies from logs, providing early warnings.
* **Automated Remediation:** Aims to not only identify issues but also suggest or automatically implement fixes.
* **Workflow Shift:** Streams aims to move away from the traditional observability workflow (metrics -> alerts -> dashboards -> traces -> logs) to a log-centric approach where AI proactively processes logs to create actionable insights.
* **Future Direction:** The article highlights the potential of **Large Language Models (LLMs)** to further automate observability, including generating automated runbooks and playbooks for remediation. LLMs could also help address the shortage of skilled SREs by augmenting their expertise.
* **Integration:** Streams is integrated into Elastic Observability.

2025-11-06 Tags: llm, observability, logs, sre, elastic, streams, root cause analysis, production engineering by klotz

How Nubank Built its in-house log platform

This article details how Nubank built its own in-house logging platform to address issues of cost, scalability, and control over their logging infrastructure. Initially reliant on a vendor solution, they found costs rising unpredictably and experienced limitations in observability and data retention.

To solve this, Nubank divided the project into two major steps: **The Observability Stream** (ingestion and processing) and the **Query & Log Platform** (storage and querying).

* **Observability Stream:** Fluent Bit for data collection, a Data Buffer Service for micro-batching, and an in-house Filter & Process Service.
* **Query & Log Platform:** Trino as the query engine, AWS S3 for storage, and Parquet for data format.

The new platform currently ingests 1 trillion logs daily, stores 45 PB of searchable data with a 45-day retention, and handles almost 15,000 queries daily. Nubank reports the platform costs 50% less than comparable market solutions while providing them with greater control, scalability, and the ability to customize features. The project underscored Nubank's value of challenging the status quo and leveraging a combination of open-source and in-house development.

2025-10-28 Tags: logging, nubank, observability, trino, aws s3, parquet, data, data engineering, production engineering, observability bus by klotz

Prompt Engineering for Time-Series Analysis with Large Language Models

This article explores how prompt engineering can be used to improve time-series analysis with Large Language Models (LLMs), covering core strategies, preprocessing, anomaly detection, and feature engineering. It provides practical prompts and examples for various tasks.

2025-10-16 Tags: llm, prompt engineering, time series, forecasting, anomaly detection, feature engineering, data science, machine learning, production engineering, observability by klotz

TraceRoot.AI

TraceRoot.AI is an AI-native observability platform that helps developers fix production bugs faster by analyzing structured logs and traces. It offers SDK integration, AI agents for root cause analysis, and a platform for comprehensive visualizations.

2025-08-30 Tags: observability, traceroot.ai, debugging, logs, traces, root cause analysis, sdk, automation, monitoring, sre, devops, production engineering, hallux.ai by klotz

Find the Root Cause in Your Code's Trace

TraceRoot accelerates the debugging process with AI-powered insights. It integrates seamlessly into your development workflow, providing real-time trace and log analysis, code context understanding, and intelligent assistance. It offers both a cloud and self-hosted version, with SDKs available for Python and JavaScript/TypeScript.

2025-08-30 Tags: agent, debugging, monitoring, trace, observability, multi-agent-systems, llm, production engineering, devops, sre, hallux.ai, root cause analysis, github by klotz

The Missing Layer in AI Infrastructure: Aggregating Agentic Traffic

The article discusses the emergence of 'agentic traffic' – outbound API calls made by autonomous AI agents – and the need for a new infrastructure layer, an 'AI Gateway', to govern and secure this traffic. It outlines the components of an AI Gateway and the importance of security, compliance, and observability in managing agentic AI.

2025-08-25 Tags: gateway, agents llm, api gateway, infrastructure, security, observability, governance, production engineering by klotz

Logs, Metrics & Traces: A Before and After Story

The company's transition from fragmented observability tools to a unified system using OpenTelemetry and OneUptime dramatically improved incident response times, reducing MTTR from 41 to 9 minutes. By correlating logs, metrics, and traces through structured logging and intelligent sampling, they eliminated much of the noise and confusion that previously slowed root cause analysis. The shift also reduced the number of dashboards engineers needed to check per incident and significantly lowered the percentage of incidents with unknown causes.

Key practices included instrumenting once with OpenTelemetry, enforcing cardinality limits, and archiving raw data for future analysis. The move away from 100% trace capture and over-instrumentation helped manage data volume while maintaining visibility into anomalies. This transformation emphasized that effective observability isn't about collecting more data, but about designing correlated signals that support intentional diagnosis and reduce cognitive load.

2025-08-21 Tags: observability, opentelemetry, logs, metrics, traces, production engineering by klotz

Can LLMs replace on call SREs today?

**Experiment Goal:** Determine if LLMs can autonomously perform root cause analysis (RCA) on live application

Five LLMs were given access to OpenTelemetry data from a demo application,:
* They were prompted with a naive instruction: "Identify the issue, root cause, and suggest solutions."
* Four distinct anomalies were used, each with a known root cause established through manual investigation.
* Performance was measured by: accuracy, guidance required, token usage, and investigation time.
* Models: Claude Sonnet 4, OpenAI GPT-o3, OpenAI GPT-4.1, Gemini 2.5 Pro

* **Autonomous RCA is not yet reliable.** The LLMs generally fell short of replacing SREs. Even GPT-5 (not explicitly tested, but implied as a benchmark) wouldn't outperform the others.
* **LLMs are useful as assistants.** They can help summarize findings, draft updates, and suggest next steps.
* **A fast, searchable observability stack (like ClickStack) is crucial.** LLMs need access to good data to be effective.
* **Models varied in performance:**
* Claude Sonnet 4 and OpenAI o3 were the most successful, often identifying the root cause with minimal guidance.
* GPT-4.1 and Gemini 2.5 Pro required more prompting and struggled to query data independently.
* **Models can get stuck in reasoning loops.** They may focus on one aspect of the problem and miss other important clues.
* **Token usage and cost varied significantly.**

**Specific Anomaly Results (briefly):**

* **Anomaly 1 (Payment Failure):** Claude Sonnet 4 and OpenAI o3 solved it on the first prompt. GPT-4.1 and Gemini 2.5 Pro needed guidance.
* **Anomaly 2 (Recommendation Cache Leak):** Claude Sonnet 4 identified the service restart issue but missed the cache problem initially. OpenAI o3 identified the memory leak. GPT-4.1 and Gemini 2.5 Pro struggled.

2025-08-16 Tags: hallux, click house, observability, llm, openai, claude, gemini, are, automation, production engineering, lionel palacin, al brown by klotz

llm-observe-hub

Real-time observability and analytics platform for local LLMs, with dashboard and API.

2025-07-22 Tags: llm, observability, analytics, dashboard, localllama, production engineering, github by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

Tags: observability* + production engineering*

Linked Tags

Related Tags