Dataiku: flexible data science tools

In the previous post, the flexibility given by data science tools greatly reduces the performance, i.e., the execution speed.

Fortunately, Dataiku, a data science tool, provides multiple ways to aggregate big data:
1) using the built-in building blocks;
2) using a custom R script with the built-in I/O blocks; or
3) using an independent custom R script (data processing using data.table package).

3 ways of doing the task

The R script embedded in the block is as follow. The data could be downloaded from here.

  1. library(data.table)
  2. d = fread("/scratch/awahyudi/d.csv")
  3. setnames(d, c("x", "y"))
  4. print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5))

Here is the result of aggregating 100 million rows dataset on 16 CPUs, 114 GB RAM machine:
1) ~ 27 minutes
2) ~ 18 minutes
3) 9 seconds

Combining the inclusion of custom codes and the richness of UI & features in a platform such as Dataiku gets both benefits (flexibility & performance).

Leave a Reply

Your email address will not be published. Required fields are marked *