Apache Hadoop: What is that & how to install and use it? (Part 2)

Part 2: How to install a standalone Hadoop Now, we are going to install a standalone Hadoop. The easiest way is to use VM sandbox provided by vendors such as Hortonworks/Cloudera and MapR. However, since the sandbox has many components (not only Hadoop, but also HBase, Spark, Hive, Oozie, etc.), it requires substantial resources (4 … Read moreApache Hadoop: What is that & how to install and use it? (Part 2)

Why we need a big data platform such as Hadoop & Spark?

On the last post, I mentioned that aggregating & sorting 100 million rows dataset (~ 2.4 GB) using monolithic approach takes 4 seconds to 5 minutes (R data.table, ptyhon pandas, awk, perl) to complete. Spark, a distributed platform that could be horizontally paralleled, takes almost 2 minutes. I extend the trial using Spark atop YARN … Read moreWhy we need a big data platform such as Hadoop & Spark?