Dataiku: flexible data science tools - .:: Data Sains Lab ::.

In the previous post, the flexibility given by data science tools greatly reduces the performance, i.e., the execution speed.

Fortunately, Dataiku, a data science tool, provides multiple ways to aggregate big data:
1) using the built-in building blocks;
2) using a custom R script with the built-in I/O blocks; or
3) using an independent custom R script (data processing using data.table package).

The R script embedded in the block is as follow. The data could be downloaded from here.

```
library(data.table)
```
```
d = fread("/scratch/awahyudi/d.csv")
```
```
setnames(d, c("x", "y"))
```

print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5))

Here is the result of aggregating 100 million rows dataset on 16 CPUs, 114 GB RAM machine:
1) ~ 27 minutes
2) ~ 18 minutes
3) 9 seconds

Combining the inclusion of custom codes and the richness of UI & features in a platform such as Dataiku gets both benefits (flexibility & performance).

Share this:

Related Posts

Leave a Reply Cancel reply