Be cautious to include legacy resources as part of the big data system

Very often, many organizations insist to involve legacy resources (e.g., applications, data storage) into the big data system. On one hand, it could accelerate and ease the implementation of a big data use case, but it also creates a bottleneck in the workflow that would be problematic in the long term.

If the monolithic applications have better processing speed than distributed applications, they can be included. Otherwise, change them with a big data solution.

For example, for the case below, include R data.table and python pandas but not gawk/mawk/perl

–Elapsed time for aggregating and sorting 100 mn. rows (2.4 GB) on 16 CPUs, 114 GB RAM–
<<Monolithic, single core>>
– gawk: GNU awk, the base awk= 3’57”
– mawk: improved version of gawk= 2’33’
– perl: another text processing language= 4’43”
– python pandas = 0′ 53″
<<Monolithic, multiple cores>>
– R data.table= 0’4″
– Python pandas= 0’53”
<<Distributed, multiple cores>>
– Spark= 1’42”

The commands are as follows.

# GAWK
time awk 'BEGIN {FS=",";OFS=","}{A[$1]+=$2;I[$1]++} END { for(i in A) print i,A[i]/I[i],I[i]}' d.csv | sort -r -t "," -k2 | head -n 5 > d2.csv

# MAWK
time /scratch/awahyudi/mawk/bin/mawk 'BEGIN {FS=",";OFS=","}{A[$1]+=$2;I[$1]++} END { for(i in A) print i,A[i]/I[i],I[i]}' d.csv | sort -r -t "," -k2 | head -n 5 > d2.csv

# PERL
time perl -lanF, -e '$H{$F[0]} += $F[1]; $I{$F[0]}++; END { print $_.",".$H{$_}/$I{$_} for(keys %H) }' d.csv | sort -r -t "," -k2 | head -n 5 > d6.csv

# R DATA.TABLE
time Rscript -e 'library(data.table); d = fread("/scratch/awahyudi/d.csv"); setnames(d, c("x", "y")); print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5))'

# PYTHON PANDAS
time python -c 'import pandas as pd; d = pd.read_csv("/scratch/awahyudi/d.csv", names=["x", "y"]); output = d.groupby("x").agg("mean").sort_values("y", ascending=False).head(5); print output'

# SPARK
time echo 'sc.setLogLevel("FATAL"); val d = spark.read.csv("/scratch/awahyudi/d.csv"); d.createOrReplaceTempView("d"); spark.sql("SELECT _c0,avg(_c1) as avg_c1 FROM d GROUP BY _c0 ORDER BY avg_c1 DESC LIMIT 5").collect()' | spark-shell

Comments

Pingback: Why we need a big data platform such as Hadoop & Spark? - The power of data analytics
Pingback: Google BigQuery, a serverless batch query of big datasets - .:: Data Sains Lab ::.
Pingback: Google BigQuery, a serverless Datawarehouse-as-a-Service to batch query huge datasets (Part 2) - .:: Data Sains Lab ::.

Share this:

Related Posts

Comments

Leave a Reply Cancel reply