This video course introduces DuckDB, an open-source database for data analytics in Python. It covers creating databases from files (Parquet, CSV, JSON), querying with SQL and the Python API, concurrent access, and integration with pandas and Polars.
A deep dive into the structure and performance benefits of Parquet files, including columnar storage, partitioning strategies, and row groups.
PyStore is a simple (yet powerful) datastore for Pandas dataframes, designed with storing timeseries data in mind. It leverages Pandas, Numpy, Dask, and Parquet (via pyarrow) for efficient data handling.
usersDF.write.format("orc")
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.option("orc.column.encoding.direct", "name")
.save("users_with_options.orc")
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo
// set Parquet file block size and page size values
int blockSize = 256 * 1024 * 1024;
int pageSize = 64 * 1024;