SemanticScuttle - klotz.me » klotz: cdf+apache+spark

Spark RDDs Vs DataFrames vs SparkSQL – Part 3 : Web Server Log Analysis | DataScience+

Extract the 11 elements from each log

def map_log(line):
match = re.search('^(S+) (S+) (S+) (S+) [- » (d{4})] "(S+)s*(S+)s*(S+)s*(+)?s*"* (d{3}) (S+)',line)
if match is None:
match = re.search('^(S+) (S+) (S+) (S+) [- » (d{4})] "(S+)s*(+)>*( w/s. » +)s*(S+)s*(d{3})s*(S+)',line)
return(match.groups())
parsed_rdd = rdd.map(lambda line: parse_log2(line)).filter(lambda line: line 1 » == 1).map(lambda line : line 0 » )
parsed_rdd2 = parsed_rdd.map(lambda line: map_log(line))

2021-04-01 Tags: spark, apache, rdd, cdf, logs, pyspark by klotz

SemanticScuttle - klotz.me

klotz: cdf* + apache* + spark*

Linked Tags

Related Tags