InfluxDB cluster: Setup & Installation

In any good IT platform, main features must be prepared in front, by-design. Scalability and high availability are two must-have capabilities in any big data system today. Out-of-service, inaccessible, server busy, and too long response time are some issues that are avoided by any organization for years. Impressed by the performance of Influxdb compared to … Read moreInfluxDB cluster: Setup & Installation

Battle of ML/DL framework on stand-alone vs. distributed platform

Will steep improvement of algorithm + decrease on hardware cost (CPU, memory, disk) drag the distributed approach irrelevant? IMHO, at this time, the winner is h2oai which gives an impressive performance in stand-alone mode and supports distributed platform (i.e., atop Spark using h2o sparkling water). I was so surprised that Standford’s statistics maestro, Tibshirani & … Read moreBattle of ML/DL framework on stand-alone vs. distributed platform

Flexibility vs. Speed

Data science tools such as Rapidminer, Dataiku, and KNIME offer so much flexibility and provide easy-to-understand building blocks that abstract data processing functions. It allows data analysts implementing a business case quickly. However, it comes with a price: slowing down the execution speed due to variable transfer between tasks. Here is the trial. Aggregating 100 … Read moreFlexibility vs. Speed

CPU vs. GPU

Inspired by the benchmark from Matt Dowle (https://h2oai.github.io/db-benchmark/), I compared his benchmark with GPU (Detail: https://lnkd.in/e7iHg7N). For processing big data, GPU K20 2 GB is slightly better than 20 cores CPU Xeon 2.6 GHz 125.8 GB RAM, even much better in some tests 🙂 Of course, the performance comes with a price. Thanks to Omnisci … Read moreCPU vs. GPU

GPU Database

I think one of the promising technology in the next couple of years is the use of GPU for accelerating any kinds of job. One of the company follows the direction is OmniSci (formerly MapD). They have a live demo showing how fast GPU processes almost 400 million tweets and visualizes them geographically in less … Read moreGPU Database

Syncsort DMX-h & IBM SPSS Modeler

Two other popular data processing platform in the IT world are explored, i.e., DMX-h and SPSS Modeler. 1) DMX-h I was an extensive user of this beast software in 2009-2012. It is an amazing ETL platform, I used to process terabytes of chunked files which was completed in a short time (compared to a relational … Read moreSyncsort DMX-h & IBM SPSS Modeler