Tags: spark*

Spark is an open-source, distributed computing framework for large-scale data processing, originally developed by the UC Berkeley AmpLab It is designed to be fast and general enough to handle a wide variety of workloads, including ETL, machine learning, streaming, and graph processing. It is built on top of Hadoop, Yarn, or other substrates and provides a programming interface for programming with an ecosystem of libraries for machine learning, graph processing, and streaming. Spark is used in cloud engineering and machine learning science for its ability to process large amounts of data quickly and efficiently. It is written in Scala, and can be used with Python, Java, and R for production-level applications. It integrates with Kubernetes and cloud providers for scalability and management.

0 bookmark(s) - Sort by: Date ↓ / Title /

  1. 2021-04-21 Tags: , , , , , by klotz
  2. 2021-04-21 Tags: , , by klotz
  3. 2021-04-21 Tags: , , by klotz
  4. Apache logfile parser with Spark
    2021-04-01 Tags: , , , , by klotz
  5. 2021-04-01 Tags: , , , , , , by klotz
  6. Extract the 11 elements from each log

    def map_log(line):
    match = re.search('^(S+) (S+) (S+) (S+) [- » (d{4})] "(S+)s*(S+)s*(S+)s*(+)?s*"* (d{3}) (S+)',line)
    if match is None:
    match = re.search('^(S+) (S+) (S+) (S+) [- » (d{4})] "(S+)s*(+)>*( w/s. » +)s*(S+)s*(d{3})s*(S+)',line)
    return(match.groups())
    parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line 1 » == 1).map(lambda line : line 0 » )
    parsed_rdd2 = parsed_rdd.map(lambda line: map_log(line))
    2021-04-01 Tags: , , , , , by klotz
  7. 2021-04-01 Tags: , , , , by klotz
  8. 2021-03-18 Tags: , , , by klotz
  9. 2021-03-17 Tags: , by klotz

Top of the page

First / Previous / Next / Last / Page 2 of 0 SemanticScuttle - klotz.me: tagged with "spark"

About - Propulsed by SemanticScuttle