SemanticScuttle - klotz.me » klotz: spark

klotz: spark*

Spark is an open-source, distributed computing framework for large-scale data processing, originally developed by the UC Berkeley AmpLab It is designed to be fast and general enough to handle a wide variety of workloads, including ETL, machine learning, streaming, and graph processing. It is built on top of Hadoop, Yarn, or other substrates and provides a programming interface for programming with an ecosystem of libraries for machine learning, graph processing, and streaming. Spark is used in cloud engineering and machine learning science for its ability to process large amounts of data quickly and efficiently. It is written in Scala, and can be used with Python, Java, and R for production-level applications. It integrates with Kubernetes and cloud providers for scalability and management.

0 bookmark(s) - Sort by: Date ↓ / Title / - Bookmarks from other users for this tag

1.5 Years of Spark Knowledge in 8 Tips

2023-12-24 Tags: spark, databricks by klotz
llama-2 on spark

2023-08-03 Tags: llama-2, llm, spark by klotz
Beam Programming Guide

2023-04-25 Tags: apache, beam, python, spark, streaming by klotz
Design Your Pipeline: Merging Multiple Sources

2023-04-25 Tags: apache, beam, spark, python, join, streaming by klotz
Google announces updates for BigQuery data warehouse • The Register

2023-01-07 Tags: google, bigquery, spark, data engineering by klotz
Spark on EMR — Cost Optimization. First-hand experience of cost-saving… | by Amit Singh Rathore | May, 2022 | Medium

2022-05-16 Tags: emr, eks, spark, aws, cost, optimization, data engineering by klotz
Modern Data Stack: Which Place for Spark ? | by Furcy Pin | Jan, 2022 | Towards Data Science

2022-01-31 Tags: spark, bigquery, snowflake, data engineering by klotz
Run Pandas as Fast as Spark. Why the Pandas API on Spark is a total… | by Adrián González Carpintero | Nov, 2021 | Towards Data Science

2021-12-06 Tags: pandas, spark, data science, python by klotz
Generic Load/Save Functions - Spark 3.2.0 Documentation

usersDF.write.format("orc")
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.option("orc.column.encoding.direct", "name")
.save("users_with_options.orc")
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo

2021-12-01 Tags: spark, orc, bloom filter, parquet, hadoop by klotz
How to Execute Pandas Workloads in a Distributed Manner With Apache Spark - The Databricks Blog

2021-10-05 Tags: pandas, spark, data, frame, parallelism, data engineering by klotz

Top of the page

First / Previous / Next / Last / Page 1 of 0

About - Propulsed by SemanticScuttle