agungw132 - .:: Data Sains Lab ::.

Polyglot data science applications

1st June 20191st June 2019 by agungw132

There is no such a “Swiss Army knife” tool; every tool has its advantages in a certain circumstance, e.g., we know R has the most comprehensive statistical packages, but it also lacks scalability support. Python, another language, has tons of crowded discussion so that looking for a solution from the community is trivial. What if … Read morePolyglot data science applications

Why we need a big data platform such as Hadoop & Spark?

1st June 20191st June 2019 by agungw132

On the last post, I mentioned that aggregating & sorting 100 million rows dataset (~ 2.4 GB) using monolithic approach takes 4 seconds to 5 minutes (R data.table, ptyhon pandas, awk, perl) to complete. Spark, a distributed platform that could be horizontally paralleled, takes almost 2 minutes. I extend the trial using Spark atop YARN … Read moreWhy we need a big data platform such as Hadoop & Spark?

Be cautious to include legacy resources as part of the big data system

1st June 20191st June 2019 by agungw132

Very often, many organizations insist to involve legacy resources (e.g., applications, data storage) into the big data system. On one hand, it could accelerate and ease the implementation of a big data use case, but it also creates a bottleneck in the workflow that would be problematic in the long term. If the monolithic applications … Read moreBe cautious to include legacy resources as part of the big data system

Share this:

Share this:

Share this: