SemanticScuttle - klotz.me » klotz: data engineering+production engineering

klotz: data engineering* + production engineering*

How Nubank Built its in-house log platform

This article details how Nubank built its own in-house logging platform to address issues of cost, scalability, and control over their logging infrastructure. Initially reliant on a vendor solution, they found costs rising unpredictably and experienced limitations in observability and data retention.

To solve this, Nubank divided the project into two major steps: **The Observability Stream** (ingestion and processing) and the **Query & Log Platform** (storage and querying).

* **Observability Stream:** Fluent Bit for data collection, a Data Buffer Service for micro-batching, and an in-house Filter & Process Service.
* **Query & Log Platform:** Trino as the query engine, AWS S3 for storage, and Parquet for data format.

The new platform currently ingests 1 trillion logs daily, stores 45 PB of searchable data with a 45-day retention, and handles almost 15,000 queries daily. Nubank reports the platform costs 50% less than comparable market solutions while providing them with greater control, scalability, and the ability to customize features. The project underscored Nubank's value of challenging the status quo and leveraging a combination of open-source and in-house development.

2025-10-28 Tags: logging, nubank, observability, trino, aws s3, parquet, data, data engineering, production engineering, observability bus by klotz

Orchestration

An article discussing the role of data orchestrators in managing complex data workflows, their evolution, and various tools available for orchestration.

2025-06-21 Tags: data, orchestration, dagster, prefect, airflow, data pipelines, ssp.sh, data engineering, production engineering by klotz

Keboola MCP Server: Build production-grade data pipelines with just a prompt

Keboola MCP Server enables AI-powered data pipeline creation and management. It allows users to build, ship, and govern data workflows using natural language and AI assistants, integrating with tools like Claude and Cursor. It's free to use, with costs based on standard Keboola usage.

2025-06-14 Tags: data, pipeline, llm, data engineering, mcp, keboola, automation, etl, production engineering by klotz

New – Amazon S3 Batch Operations

Amazon S3 Batch Operations allows you to process hundreds, millions, or even billions of S3 objects efficiently. You can perform various actions such as copying objects, setting tags, restoring from Glacier, or invoking AWS Lambda functions on each object without writing custom code.

2025-02-23 Tags: batch, aws, s3, lambda, s3 inventory, production engineering, data engineering by klotz

Arize Phoenix

Arize Phoenix is an open-source observability library for AI experimentation, evaluation, and troubleshooting, built by Arize AI.

2025-02-08 Tags: arize phoenix, ai, observability, experiments, evaluation, troubleshooting, visualization, opentelemetry, openinference, production engineering, data engineering by klotz

Talk to Airflow — Build an AI Agent Using PydanticAI and Gemini 2.0

An article on building an AI agent to interact with Apache Airflow using PydanticAI and Gemini 2.0, providing a structured and reliable method for managing DAGs through natural language queries.

- Agent interacts with Apache Airflow via the Airflow REST API.
- Agent can understand natural language queries about workflows, fetch real-time status updates, and return structured data.
- Sample DAGs are implemented for demonstration purposes.

2024-12-29 Tags: agent, apache, airflow, pydanticai, gemini 2.0, llm, data engineering, structured output, airflow dag, natural language queries, production engineering by klotz

Breser - a simple query syntax

Breser stands for Business Rules & Expression Syntax for Easy Retrieval. It is a powerful and flexible query language designed for efficient log processing and structured data filtering.

2024-12-17 Tags: breser, logs, data, filtering, query language, log analysis, query, expression, spl, production engineering, data engineering by klotz

Building a Robust Data Observability Framework

How to ensure data quality and integrity using open-source tools for observability in data pipelines.

2024-08-29 Tags: observability, data pipeline, data engineering, production engineering by klotz

The definitive guide to data pipelines

Data pipelines are essential for connecting data across systems and platforms. This article provides a deep dive into how data pipelines are implemented, their use cases, and how they're evolving with generative AI.

2024-08-27 Tags: data engineering, pipeline, observability, data governance, mlops, production engineering by klotz

Tracking in Practice: Code, Data and ML Model

A guide to tracking in MLOps, covering code, data, and machine learning model tracking

2024-07-12 Tags: mlops, data engineering, production engineering by klotz

First / Previous / Next / Last / Page 1 of 0

SemanticScuttle - klotz.me

klotz: data engineering* + production engineering*

Linked Tags

Related Tags