Data science tools such as Rapidminer, Dataiku, and KNIME offer so much flexibility and provide easy-to-understand building blocks that abstract data processing functions. It allows data analysts implementing a business case quickly. However, it comes with a price: slowing down the execution speed due to variable transfer between tasks.
Here is the trial.
Aggregating 100 million rows data on 16 cores Xeon 2.4 GHz & RAM 144 GB.
1) transfer data to RAM;
2) set index;
4) sort descending;
5) sample five rows.
1) Rapidminer workflow: ~5 minute
2) R data.table workflow: 4 second