0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag
A deep dive into the structure and performance benefits of Parquet files, including columnar storage, partitioning strategies, and row groups.
PyStore is a simple (yet powerful) datastore for Pandas dataframes, designed with storing timeseries data in mind. It leverages Pandas, Numpy, Dask, and Parquet (via pyarrow) for efficient data handling.
usersDF.write.format("orc") .option("orc.bloom.filter.columns", "favorite_color") .option("orc.dictionary.key.threshold", "1.0") .option("orc.column.encoding.direct", "name") .save("users_with_options.orc") Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo
// set Parquet file block size and page size values int blockSize = 256 * 1024 * 1024; int pageSize = 64 * 1024;
First / Previous / Next / Last
/ Page 1 of 0