Very often, many organizations insist to involve legacy resources (e.g., applications, data storage) into the big data system. On one hand, it could accelerate and ease the implementation of a big data use case, but it also creates a bottleneck in the workflow that would be problematic in the long term.
If the monolithic applications have better processing speed than distributed applications, they can be included. Otherwise, change them with a big data solution.
For example, for the case below, include R data.table and python pandas but not gawk/mawk/perl
–Elapsed time for aggregating and sorting 100 mn. rows (2.4 GB) on 16 CPUs, 114 GB RAM–
<<Monolithic, single core>>
– gawk: GNU awk, the base awk= 3’57”
– mawk: improved version of gawk= 2’33’
– perl: another text processing language= 4’43”
– python pandas = 0′ 53″
<<Monolithic, multiple cores>>
– R data.table= 0’4″
– Python pandas= 0’53”
<<Distributed, multiple cores>>
– Spark= 1’42”
The commands are as follows.
# GAWK time awk 'BEGIN {FS=",";OFS=","}{A[$1]+=$2;I[$1]++} END { for(i in A) print i,A[i]/I[i],I[i]}' d.csv | sort -r -t "," -k2 | head -n 5 > d2.csv
# MAWK time /scratch/awahyudi/mawk/bin/mawk 'BEGIN {FS=",";OFS=","}{A[$1]+=$2;I[$1]++} END { for(i in A) print i,A[i]/I[i],I[i]}' d.csv | sort -r -t "," -k2 | head -n 5 > d2.csv
# PERL time perl -lanF, -e '$H{$F[0]} += $F[1]; $I{$F[0]}++; END { print $_.",".$H{$_}/$I{$_} for(keys %H) }' d.csv | sort -r -t "," -k2 | head -n 5 > d6.csv
# R DATA.TABLE time Rscript -e 'library(data.table); d = fread("/scratch/awahyudi/d.csv"); setnames(d, c("x", "y")); print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5))'
# PYTHON PANDAS time python -c 'import pandas as pd; d = pd.read_csv("/scratch/awahyudi/d.csv", names=["x", "y"]); output = d.groupby("x").agg("mean").sort_values("y", ascending=False).head(5); print output'
# SPARK time echo 'sc.setLogLevel("FATAL"); val d = spark.read.csv("/scratch/awahyudi/d.csv"); d.createOrReplaceTempView("d"); spark.sql("SELECT _c0,avg(_c1) as avg_c1 FROM d GROUP BY _c0 ORDER BY avg_c1 DESC LIMIT 5").collect()' | spark-shell
Comments