On the last post, I mentioned that aggregating & sorting 100 million rows dataset (~ 2.4 GB) using monolithic approach takes 4 seconds to 5 minutes (R data.table, ptyhon pandas, awk, perl) to complete. Spark, a distributed platform that could be horizontally paralleled, takes almost 2 minutes. I extend the trial using Spark atop YARN (i.e., processing data stored in HDFS managed by YARN using Spark) which runs 3.3 minutes for completion.
The command I used is as follows.
time echo 'sc.setLogLevel("FATAL"); val d = spark.read.csv("/user/awahyudi/d.csv"); d.createOrReplaceTempView("d"); spark.sql("SELECT _c0,avg(_c1) as avg_c1 FROM d GROUP BY _c0 ORDER BY avg_c1 DESC LIMIT 5").collect()' | spark-shell --master yarn --deploy-mode client
Why hassle ourselves to use the distributed approach? Well, the size of today’s datasets does not fit into a single machine’s RAM anymore. Although we can pour more disks & RAMs, the expansion has a limitation, i.e., at a certain point, it will reach the maximum RAM & disk a machine can provide. It hinders us to operate such a dataset.
Therefore, people came out with the distributed approach, i.e., scaling out resources (computing, RAM, disks) horizontally. Actually, the idea is not new. In the academic & research environment, HPC (high-performance computing) has been used. However, specialized hardware is needed, i.e. data from disk clusters need to be transferred to computing clusters using a very high-speed bus such as InfiniBand.
Hadoop then came in 2003 with an idea: distributing data & processing in commodity hardware. Therefore, it differs from HPC by having data locality, i.e., processing the data locally where it stores. Embarrassingly massive parallelism is its popular name in academic society. Hadoop allows us processing unlimited size of data, distributed across multiple computers (nodes). However, doing operation on disks is very slow. Spark, then came in 2012, moves data from local disk to local RAM to accelerate the process. It also provides multiple high-level abstractions (SQL, machine learning, stream) using various languages (Scala, R, python).
However, the game changes recently because cloud operators such as AWS and Google Cloud offer a very huge-resource machine with an affordable price such as AWS’ u-12tb1.metal (448 CPUs, ~12 TB RAMs).
As a consequence, it raises a question to the need for distributed processing. If the data fits in RAMs, why we bother ourselves with the complexity of the distributed approach?
As a summary, it is important to understand your data objectives in order to select the right data processing platform. If the data fits into RAM, then monolithic solution using R or python is suitable. However, for batch processing huge amount of data, distributed approach such as Hadoop and Spark is the only solution you have.