SemanticScuttle - klotz.me

How Nubank Built its in-house log platform

This article details how Nubank built its own in-house logging platform to address issues of cost, scalability, and control over their logging infrastructure. Initially reliant on a vendor solution, they found costs rising unpredictably and experienced limitations in observability and data retention.

To solve this, Nubank divided the project into two major steps: **The Observability Stream** (ingestion and processing) and the **Query & Log Platform** (storage and querying).

* **Observability Stream:** Fluent Bit for data collection, a Data Buffer Service for micro-batching, and an in-house Filter & Process Service.
* **Query & Log Platform:** Trino as the query engine, AWS S3 for storage, and Parquet for data format.

The new platform currently ingests 1 trillion logs daily, stores 45 PB of searchable data with a 45-day retention, and handles almost 15,000 queries daily. Nubank reports the platform costs 50% less than comparable market solutions while providing them with greater control, scalability, and the ability to customize features. The project underscored Nubank's value of challenging the status quo and leveraging a combination of open-source and in-house development.

2025-10-28 Tags: logging, nubank, observability, trino, aws s3, parquet, data, data engineering, production engineering, observability bus by klotz

13 clever APIs for capturing every kind of data

APIs let you get at fascinating and useful treasure troves of data. Here’s a look at the wide world of APIs for finding and manipulating data in your applications.

2025-10-11 Tags: api, data, content by klotz

Command Line Utility | Embedding Atlas

This page details the command-line utility for the Embedding Atlas, a tool for exploring large text datasets with metadata. It covers installation, data loading (local and Hugging Face), visualization of embeddings using SentenceTransformers and UMAP, and usage instructions with available options.

2025-08-13 Tags: embedding, text, data, visualization, umap, sentence transformers, command line, hugging face, parquet, duckdb by klotz

Website-Crawler

Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape entire website with Website Crawler

2025-09-05 Tags: json, crawler, data, scraper, github, java, llm by klotz

Data Visualization And Aggregation: Time Series Databases, Grafana And More

An article discussing the importance of time series databases and data visualization tools like Grafana for managing and interpreting streams of data in various applications.

The author mentions several time series databases (TSDs) and visualization tools, focusing on their features, advantages, and some limitations. The article also provides an example of a Building Management and Control (BMaC) project that uses InfluxDB and Grafana for data visualization.

| Database | Description | Notable Features |
|-------------------|-------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
| InfluxDB | Partially open source, with version 3 being an edge data collector. | Shard-based storage, compaction levels, time series index, optional retention. |
| Apache Kudu | Column-based database optimized for multidimensional OLAP workloads. | Part of the Apache Hadoop ecosystem. |
| Prometheus | Developed at SoundCloud for metrics monitoring. | Written in Go, similar to InfluxDB v1 and v2. |
| RRDTool | All-in-one package with a circular buffer TSD that also does graphing. | Language bindings for various programming languages. |
| Graphite | Similar to RRDTool but uses a Django web-based application to render graphs. | Web-based graphing. |
| TimescaleDB | Extends PostgreSQL, supporting typical SQL queries with TSD functionality and optimizations. | Supports all typical SQL queries. |

The article also discusses Grafana as a popular tool for creating dashboards to visualize time series data, mentioning its compatibility with multiple TSDs and SQL databases. It concludes by highlighting the importance of understanding one's specific needs before choosing a TSD and visualization solution.

2025-06-30 Tags: hackaday, time series, database, grafana, data, visualization, influxdb, prometheus, rrdtool, graphite, timescaledb, splunkon by klotz

Starting With DuckDB and Python (Overview)

This video course introduces DuckDB, an open-source database for data analytics in Python. It covers creating databases from files (Parquet, CSV, JSON), querying with SQL and the Python API, concurrent access, and integration with pandas and Polars.

2025-06-25 Tags: duckdb, python, database, olap, sql, pandas, polars, data, analytics, csv, json, parquet by klotz

Building A Modern Dashboard with Python and Taipy

A guide to building a front-end data application using Taipy, comparing it to Streamlit and Gradio, and providing a step-by-step implementation of a sales performance dashboard.

2025-06-24 Tags: data science, data, visualization, python, taipy, dashboard, streamlit, gradio, shrunk, hallux by klotz

Orchestration

An article discussing the role of data orchestrators in managing complex data workflows, their evolution, and various tools available for orchestration.

2025-06-21 Tags: data, orchestration, dagster, prefect, airflow, data pipelines, ssp.sh, data engineering, production engineering by klotz

Keboola MCP Server: Build production-grade data pipelines with just a prompt

Keboola MCP Server enables AI-powered data pipeline creation and management. It allows users to build, ship, and govern data workflows using natural language and AI assistants, integrating with tools like Claude and Cursor. It's free to use, with costs based on standard Keboola usage.

2025-06-14 Tags: data, pipeline, llm, data engineering, mcp, keboola, automation, etl, production engineering by klotz

An anomaly detection framework anyone can use

PhD student Sarah Alnegheimish is developing Orion, an open-source, user-friendly machine learning framework for detecting anomalies in large-scale industrial and operational settings. She focuses on making machine learning systems accessible, transparent, and trustworthy, and is exploring repurposing pre-trained models for anomaly detection.

2025-05-29 Tags: emacs, machine learning, anomaly detection, open-source, orion, artificial intelligence, data, systems design, algorithms, sarah alnegheimish, mit by klotz

SemanticScuttle - klotz.me

Tags: data*

Linked Tags

Related Tags