klotz: data*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. This article details how Nubank built its own in-house logging platform to address issues of cost, scalability, and control over their logging infrastructure. Initially reliant on a vendor solution, they found costs rising unpredictably and experienced limitations in observability and data retention.

    To solve this, Nubank divided the project into two major steps: **The Observability Stream** (ingestion and processing) and the **Query & Log Platform** (storage and querying).

    * **Observability Stream:** Fluent Bit for data collection, a Data Buffer Service for micro-batching, and an in-house Filter & Process Service.
    * **Query & Log Platform:** Trino as the query engine, AWS S3 for storage, and Parquet for data format.

    The new platform currently ingests 1 trillion logs daily, stores 45 PB of searchable data with a 45-day retention, and handles almost 15,000 queries daily. Nubank reports the platform costs 50% less than comparable market solutions while providing them with greater control, scalability, and the ability to customize features. The project underscored Nubank's value of challenging the status quo and leveraging a combination of open-source and in-house development.
  2. APIs let you get at fascinating and useful treasure troves of data. Here’s a look at the wide world of APIs for finding and manipulating data in your applications.
    2025-10-11 Tags: , , by klotz
  3. This page details the command-line utility for the Embedding Atlas, a tool for exploring large text datasets with metadata. It covers installation, data loading (local and Hugging Face), visualization of embeddings using SentenceTransformers and UMAP, and usage instructions with available options.
  4. Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape entire website with Website Crawler
    2025-09-05 Tags: , , , , , , by klotz
  5. An article discussing the importance of time series databases and data visualization tools like Grafana for managing and interpreting streams of data in various applications.

    The author mentions several time series databases (TSDs) and visualization tools, focusing on their features, advantages, and some limitations. The article also provides an example of a Building Management and Control (BMaC) project that uses InfluxDB and Grafana for data visualization.

    | Database | Description | Notable Features |
    |-------------------|-------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------|
    | InfluxDB | Partially open source, with version 3 being an edge data collector. | Shard-based storage, compaction levels, time series index, optional retention. |
    | Apache Kudu | Column-based database optimized for multidimensional OLAP workloads. | Part of the Apache Hadoop ecosystem. |
    | Prometheus | Developed at SoundCloud for metrics monitoring. | Written in Go, similar to InfluxDB v1 and v2. |
    | RRDTool | All-in-one package with a circular buffer TSD that also does graphing. | Language bindings for various programming languages. |
    | Graphite | Similar to RRDTool but uses a Django web-based application to render graphs. | Web-based graphing. |
    | TimescaleDB | Extends PostgreSQL, supporting typical SQL queries with TSD functionality and optimizations. | Supports all typical SQL queries. |

    The article also discusses Grafana as a popular tool for creating dashboards to visualize time series data, mentioning its compatibility with multiple TSDs and SQL databases. It concludes by highlighting the importance of understanding one's specific needs before choosing a TSD and visualization solution.
  6. This video course introduces DuckDB, an open-source database for data analytics in Python. It covers creating databases from files (Parquet, CSV, JSON), querying with SQL and the Python API, concurrent access, and integration with pandas and Polars.
  7. A guide to building a front-end data application using Taipy, comparing it to Streamlit and Gradio, and providing a step-by-step implementation of a sales performance dashboard.
  8. An article discussing the role of data orchestrators in managing complex data workflows, their evolution, and various tools available for orchestration.
  9. Keboola MCP Server enables AI-powered data pipeline creation and management. It allows users to build, ship, and govern data workflows using natural language and AI assistants, integrating with tools like Claude and Cursor. It's free to use, with costs based on standard Keboola usage.
  10. PhD student Sarah Alnegheimish is developing Orion, an open-source, user-friendly machine learning framework for detecting anomalies in large-scale industrial and operational settings. She focuses on making machine learning systems accessible, transparent, and trustworthy, and is exploring repurposing pre-trained models for anomaly detection.

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: data

About - Propulsed by SemanticScuttle