PySpark up to 150X faster than Pandas & trumps both Pandas & Koalas on simple benchmark test.
While Pandas is an easy to use and powerful tool, when we start to use large datasets, we can see Pandas may not be the best solution.
I did comparison test on my 2015 MacBook 2.7 GHz Dual-Core Intel Core i5 and 8 GB 1867 MHz DDR3 to Pandas, Koalas and PySpark. Per Koalas’ documentation, Koalas implements “the pandas DataFrame API on top of Apache Spark.” Per PySpark’s documentation, “PySpark is the Python API for Spark.”
To do the test, you’ll need to install both PySpark and Koalas. You’ll also need to install the Java JDK if you haven’t already.
Import Packages per below
For our test, we’ll use the “all_stocks_5yr.csv” from Kaggle:
https://www.kaggle.com/camnugent/sandp500
Let’s look at the data for the different packages. Of note, Koalas has many similar function to Pandas and even looks like Pandas.
To do a performance test, we’re going to do:
1. A Group By
2. Concat (Pandas and Koalas) /Union (PySpark) the dataframe to make a larger dataframe double the original size
3. Repeat 1 & 2 with larger dataframe
Our dataframe from Kaggle sits at 619,040 rows. We’ll set how many times we want to concat our data with the variable “num_iter”. For this example, we’ll concat 5 times which will make our comparisons more computationally intensive each iteration. See code below for Pandas, Koalas and PySpark.
Results dataframe below.
Group By results show that across the board, PySpark was the winner. We see that when at 19,809,280 rows, the Group By speed of PySpark is 153X faster than Pandas (1.454501/0.009491). It is interesting to see that Pandas beat Koalas when the dataframe was smaller.
Concat/Union shows similar results, that across the board, PySpark was the winner.
I’d be exited to see any other results people get from benchmarking.
Code can be found on my GitHub at: https://github.com/chrisrichgruber/pandas_koalas_pyspark