PySpark up to 150X faster than Pandas & trumps both Pandas & Koalas on simple benchmark test.

Christopher Richgruber
3 min readDec 28, 2020

While Pandas is an easy to use and powerful tool, when we start to use large datasets, we can see Pandas may not be the best solution.

I did comparison test on my 2015 MacBook 2.7 GHz Dual-Core Intel Core i5 and 8 GB 1867 MHz DDR3 to Pandas, Koalas and PySpark. Per Koalas’ documentation, Koalas implements “the pandas DataFrame API on top of Apache Spark.” Per PySpark’s documentation, “PySpark is the Python API for Spark.”

To do the test, you’ll need to install both PySpark and Koalas. You’ll also need to install the Java JDK if you haven’t already.

Install PySpark and Koalas

Import Packages per below

Import Packages

For our test, we’ll use the “all_stocks_5yr.csv” from Kaggle:
https://www.kaggle.com/camnugent/sandp500

Import Stock CSV

Let’s look at the data for the different packages. Of note, Koalas has many similar function to Pandas and even looks like Pandas.

Panda, Koalas and PySpark Dataframes

To do a performance test, we’re going to do:
1. A Group By
2. Concat (Pandas and Koalas) /Union (PySpark) the dataframe to make a larger dataframe double the original size
3. Repeat 1 & 2 with larger dataframe

Our dataframe from Kaggle sits at 619,040 rows. We’ll set how many times we want to concat our data with the variable “num_iter”. For this example, we’ll concat 5 times which will make our comparisons more computationally intensive each iteration. See code below for Pandas, Koalas and PySpark.

Pandas
Koalas
PySpark

Results dataframe below.

Results data

Group By results show that across the board, PySpark was the winner. We see that when at 19,809,280 rows, the Group By speed of PySpark is 153X faster than Pandas (1.454501/0.009491). It is interesting to see that Pandas beat Koalas when the dataframe was smaller.

Group By Comparison

Concat/Union shows similar results, that across the board, PySpark was the winner.

Concat/Union Comparison

I’d be exited to see any other results people get from benchmarking.

Code can be found on my GitHub at: https://github.com/chrisrichgruber/pandas_koalas_pyspark

--

--