Apache Hadoop: What is that & how to install and use it? (Part 2)

Part 2: How to install a standalone Hadoop Now, we are going to install a standalone Hadoop. The easiest way is to use VM sandbox provided by vendors such as Hortonworks/Cloudera and MapR. However, since the sandbox has many components (not only Hadoop, but also HBase, Spark, Hive, Oozie, etc.), it requires substantial resources (4 … Read moreApache Hadoop: What is that & how to install and use it? (Part 2)

Apache Hadoop: What is that & how to install and use it? (Part 1)

Next: How to install a standalone Hadoop Part 1: Understanding Apache Hadoop as a Big Data Distributed Processing & Storage Cluster In the last post, I discussed on which occasion we prefer distributed approach such as Hadoop and Spark over the monolithic approach. I will discuss more detail about Apache Hadoop in this article. This … Read moreApache Hadoop: What is that & how to install and use it? (Part 1)

Why we need a big data platform such as Hadoop & Spark?

On the last post, I mentioned that aggregating & sorting 100 million rows dataset (~ 2.4 GB) using monolithic approach takes 4 seconds to 5 minutes (R data.table, ptyhon pandas, awk, perl) to complete. Spark, a distributed platform that could be horizontally paralleled, takes almost 2 minutes. I extend the trial using Spark atop YARN … Read moreWhy we need a big data platform such as Hadoop & Spark?

Be cautious to include legacy resources as part of the big data system

Very often, many organizations insist to involve legacy resources (e.g., applications, data storage) into the big data system. On one hand, it could accelerate and ease the implementation of a big data use case, but it also creates a bottleneck in the workflow that would be problematic in the long term. If the monolithic applications … Read moreBe cautious to include legacy resources as part of the big data system