Home Lab

I like to learn something new everyday, whether it is related to my PhD research (big data value creation) or not. I have investigated and learned many big data artifacts in multiple layers, i.e., Openstack, OpenFlow, Cumulus VX, VSphere, HPC, etc. in technology/infrastructure layer MPP/distributed solutions such as Hadoop, Spark, Elasticsearch, Kafka, MapD etc. in … Read moreHome Lab

Benchmark Python’s Dataframe: Pandas vs. Datatable vs. PySpark SQL

Setup Machine: 16-thread Xeon 2.6 GHz, 32 GB RAM, NVME PCIx16 System: Ubuntu 16.04, Spark 2.4.4, Python 3.7.4, Pandas 0.25.1, Datatable 0.10.1 Data: 100 million rows generated CSV (1.6 GB gzip compressed) Operation: Create a dataframe from a compressed file Group the dataframe by 3 columns Aggregate 2 different columns with 2 different function (group … Read moreBenchmark Python’s Dataframe: Pandas vs. Datatable vs. PySpark SQL