PySpark up to 150X faster than Pandas & trumps both Pandas & Koalas on simple benchmark test.

3 min readDec 28, 2020

While Pandas is an easy to use and powerful tool, when we start to use large datasets, we can see Pandas may not be the best solution.

I did comparison test on my 2015 MacBook 2.7 GHz Dual-Core Intel Core i5 and 8 GB 1867 MHz DDR3 to Pandas, Koalas and PySpark. Per Koalas’ documentation, Koalas implements “the pandas DataFrame API on top of Apache Spark.” Per PySpark’s documentation, “PySpark is the Python API for Spark.”

To do the test, you’ll need to install both PySpark and Koalas. You’ll also need to install the Java JDK if you haven’t already.

Install PySpark and Koalas

Import Packages per below

For our test, we’ll use the “all_stocks_5yr.csv” from Kaggle:
https://www.kaggle.com/camnugent/sandp500

Import Stock CSV

Let’s look at the data for the different packages. Of note, Koalas has many similar function to Pandas and even looks like Pandas.

To do a performance test, we’re going to do:
1. A Group By
2. Concat (Pandas and Koalas) /Union (PySpark) the dataframe to make a larger dataframe double the original size
3. Repeat 1 & 2 with larger dataframe

Our dataframe from Kaggle sits at 619,040 rows. We’ll set how many times we want to concat our data with the variable “num_iter”. For this example, we’ll concat 5 times which will make our comparisons more computationally intensive each iteration. See code below for Pandas, Koalas and PySpark.

Results dataframe below.

Group By results show that across the board, PySpark was the winner. We see that when at 19,809,280 rows, the Group By speed of PySpark is 153X faster than Pandas (1.454501/0.009491). It is interesting to see that Pandas beat Koalas when the dataframe was smaller.

Concat/Union shows similar results, that across the board, PySpark was the winner.

I’d be exited to see any other results people get from benchmarking.

Code can be found on my GitHub at: https://github.com/chrisrichgruber/pandas_koalas_pyspark

PySpark up to 150X faster than Pandas & trumps both Pandas & Koalas on simple benchmark test.

Written by Christopher Richgruber

No responses yet