Be cautious to include legacy resources as part of the big data system

Very often, many organizations insist to involve legacy resources (e.g., applications, data storage) into the big data system. On one hand, it could accelerate and ease the implementation of a big data use case, but it also creates a bottleneck in the workflow that would be problematic in the long term.

If the monolithic applications have better processing speed than distributed applications, they can be included. Otherwise, change them with a big data solution.

For example, for the case below, include R data.table and python pandas but not gawk/mawk/perl

–Elapsed time for aggregating and sorting 100 mn. rows (2.4 GB) on 16 CPUs, 114 GB RAM–
<<Monolithic, single core>>
– gawk: GNU awk, the base awk= 3’57”
– mawk: improved version of gawk= 2’33’
– perl: another text processing language= 4’43”
– python pandas = 0′ 53″
<<Monolithic, multiple cores>>
– R data.table= 0’4″
– Python pandas= 0’53”
<<Distributed, multiple cores>>
– Spark= 1’42”

The commands are as follows.

time awk 'BEGIN {FS=",";OFS=","}{A[$1]+=$2;I[$1]++} END { for(i in A) print i,A[i]/I[i],I[i]}' d.csv | sort -r -t "," -k2 | head -n 5 > d2.csv
time /scratch/awahyudi/mawk/bin/mawk 'BEGIN {FS=",";OFS=","}{A[$1]+=$2;I[$1]++} END { for(i in A) print i,A[i]/I[i],I[i]}' d.csv | sort -r -t "," -k2 | head -n 5 > d2.csv
time perl -lanF, -e '$H{$F[0]} += $F[1]; $I{$F[0]}++; END { print $_.",".$H{$_}/$I{$_} for(keys %H) }' d.csv | sort -r -t "," -k2 | head -n 5 > d6.csv
time Rscript -e 'library(data.table); d = fread("/scratch/awahyudi/d.csv"); setnames(d, c("x", "y")); print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5))'
time python -c 'import pandas as pd; d = pd.read_csv("/scratch/awahyudi/d.csv", names=["x", "y"]); output = d.groupby("x").agg("mean").sort_values("y", ascending=False).head(5); print output'
time echo 'sc.setLogLevel("FATAL"); val d ="/scratch/awahyudi/d.csv"); d.createOrReplaceTempView("d"); spark.sql("SELECT _c0,avg(_c1) as avg_c1 FROM d GROUP BY _c0 ORDER BY avg_c1 DESC LIMIT 5").collect()' | spark-shell


Leave a Reply

Your email address will not be published. Required fields are marked *