This article details how Nubank built its own in-house logging platform to address issues of cost, scalability, and control over their logging infrastructure. Initially reliant on a vendor solution, they found costs rising unpredictably and experienced limitations in observability and data retention.
To solve this, Nubank divided the project into two major steps: **The Observability Stream** (ingestion and processing) and the **Query & Log Platform** (storage and querying).
* **Observability Stream:** Fluent Bit for data collection, a Data Buffer Service for micro-batching, and an in-house Filter & Process Service.
* **Query & Log Platform:** Trino as the query engine, AWS S3 for storage, and Parquet for data format.
The new platform currently ingests 1 trillion logs daily, stores 45 PB of searchable data with a 45-day retention, and handles almost 15,000 queries daily. Nubank reports the platform costs 50% less than comparable market solutions while providing them with greater control, scalability, and the ability to customize features. The project underscored Nubank's value of challenging the status quo and leveraging a combination of open-source and in-house development.
The Model Context Protocol (MCP) is a new open protocol that allows AI models to interact with external systems in a standardized, extensible way. In this tutorial, you’ll install MCP, explore its client-server architecture, and work with its core concepts: prompts, resources, and tools.
Google has introduced LangExtract, an open-source Python library designed to help developers extract structured information from unstructured text using large language models such as the Gemini models. The library simplifies the process of converting free-form text into structured data, offering features like controlled generation, text chunking, parallel processing, and integration with various LLMs.
An article discussing the role of data orchestrators in managing complex data workflows, their evolution, and various tools available for orchestration.
This article is part 4 of a crash course on the Model Context Protocol (MCP). It focuses on resources and prompts, explaining their mechanics, distinctions, and implementation, and how they differ from tools. It covers resource types, discovery mechanisms, and application-controlled access patterns.
Keboola MCP Server enables AI-powered data pipeline creation and management. It allows users to build, ship, and govern data workflows using natural language and AI assistants, integrating with tools like Claude and Cursor. It's free to use, with costs based on standard Keboola usage.
Apache Spark 4.0 marks a major milestone with advancements in SQL language enhancements, Spark Connect, reliability, Python capabilities, and structured streaming. It's designed to be more powerful, ANSI-compliant, and user-friendly while maintaining compatibility.
The article discusses how Visa leverages retrieval-augmented generation (RAG) and deep learning to enhance operations. It describes Visa's 'Secure ChatGPT,' which offers a multi-model interface for secure internal use, and how RAG improves policy-related data retrieval. The article also explores Visa's data infrastructure and AI's role in fraud prevention.
This article describes a workflow using Large Language Models (LLMs) to automate the process of normalising spreadsheet data, making it tidy and machine-readable for easier analysis and insights.
A deep dive into the structure and performance benefits of Parquet files, including columnar storage, partitioning strategies, and row groups.