klotz: parquet*

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

  1. This article details how Nubank built its own in-house logging platform to address issues of cost, scalability, and control over their logging infrastructure. Initially reliant on a vendor solution, they found costs rising unpredictably and experienced limitations in observability and data retention.

    To solve this, Nubank divided the project into two major steps: **The Observability Stream** (ingestion and processing) and the **Query & Log Platform** (storage and querying).

    * **Observability Stream:** Fluent Bit for data collection, a Data Buffer Service for micro-batching, and an in-house Filter & Process Service.
    * **Query & Log Platform:** Trino as the query engine, AWS S3 for storage, and Parquet for data format.

    The new platform currently ingests 1 trillion logs daily, stores 45 PB of searchable data with a 45-day retention, and handles almost 15,000 queries daily. Nubank reports the platform costs 50% less than comparable market solutions while providing them with greater control, scalability, and the ability to customize features. The project underscored Nubank's value of challenging the status quo and leveraging a combination of open-source and in-house development.
  2. This page details the command-line utility for the Embedding Atlas, a tool for exploring large text datasets with metadata. It covers installation, data loading (local and Hugging Face), visualization of embeddings using SentenceTransformers and UMAP, and usage instructions with available options.
  3. This video course introduces DuckDB, an open-source database for data analytics in Python. It covers creating databases from files (Parquet, CSV, JSON), querying with SQL and the Python API, concurrent access, and integration with pandas and Polars.
  4. A deep dive into the structure and performance benefits of Parquet files, including columnar storage, partitioning strategies, and row groups.
    2025-03-14 Tags: , , , , by klotz
  5. PyStore is a simple (yet powerful) datastore for Pandas dataframes, designed with storing timeseries data in mind. It leverages Pandas, Numpy, Dask, and Parquet (via pyarrow) for efficient data handling.
  6. usersDF.write.format("orc")
    .option("orc.bloom.filter.columns", "favorite_color")
    .option("orc.dictionary.key.threshold", "1.0")
    .option("orc.column.encoding.direct", "name")
    .save("users_with_options.orc")
    Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo
    2021-12-01 Tags: , , , , by klotz
  7. // set Parquet file block size and page size values
    int blockSize = 256 * 1024 * 1024;
    int pageSize = 64 * 1024;
     
    2017-02-15 Tags: , , , by klotz
  8. 2017-02-13 Tags: , , , , , by klotz
  9. 2014-02-19 Tags: , , , by klotz
  10. 2014-02-19 Tags: , , , by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0 SemanticScuttle - klotz.me: Tags: parquet

About - Propulsed by SemanticScuttle