Syncsort DMX-h & IBM SPSS Modeler

DMX-h vs. SPSS Modeler

Two other popular data processing platform in the IT world are explored, i.e., DMX-h and SPSS Modeler.

1) DMX-h I was an extensive user of this beast software in 2009-2012. It is an amazing ETL platform, I used to process terabytes of chunked files which was completed in a short time (compared to a relational DB). There was a number of times we scaled up the resources using vertical scalability, i.e., adding more RAMs/disks. Its secret recipe is to utilize disks as virtual RAMs so we never encountered out-of-memory state. Today it allows horizontal scalability by encapsulating Hadoop & Spark processes underneath. Moreover, connectors to cloud services S3, GCS, salesforce are provided. Note that it is not a data science tool, i.e., no machine learning, statistics, visualization blocks are provided.

2) SPSS Modeler It is a full pledge of data science tools offering a rich library of statistics and machine learning. Numerous cloud connectors are also available. Also, custom scripts (R, python) could be embedded. It allows ingesting HDFS data and processing them locally using Hive or Spark.

— The trial —
Aggregating & sorting 100 mn. rows on 4 CPUs, 32 GB RAMs
1) R-data.table= 117 sec (baseline)
2) DMX-h= 186 sec
3) Modeler= 326 sec

The R script used is as follows.

time Rscript -e 'library(data.table); d = fread("/scratch/awahyudi/d.csv"); setnames(d, c("x", "y")); print(head(d[, list(ym=mean(y)), by=x][order(-ym)],5))'

Leave a Reply

Your email address will not be published. Required fields are marked *