Home Artificial Intelligence Measuring The Speed of Recent Pandas 2.0 Against Polars and Datatable — Still Not Good Enough

Measuring The Speed of Recent Pandas 2.0 Against Polars and Datatable — Still Not Good Enough

3
Measuring The Speed of Recent Pandas 2.0 Against Polars and Datatable — Still Not Good Enough

Image by writer via Midjourney

People have been complaining about Pandas’ speed ever since they tried reading their first gigabyte-sized dataset with read_csv and realized they’d to attend for – gasp – five seconds. And yes, I used to be certainly one of those complainers.

Five seconds won’t sound rather a lot, but when loading the dataset itself takes that much runtime, it often means subsequent operations will take as long. And since speed is one of the crucial essential things in quick, dirty data exploration, you’ll be able to get very frustrated.

Because of this, folks at PyData recently announced the planned release of Pandas 2.0 with the freshly minted PyArrow backend. For those totally unaware, PyArrow, by itself, is a nifty little library designed for high-performance, memory-efficient manipulation of arrays.

People sincerely hope the brand new backend will bring considerable speed-ups over the vanilla Pandas. This text will test that glimmer of hope by comparing the PyArrow backend against two of the fastest DataFrame libraries: Datatable and Polars.

Have not people already done this?

What’s the point of doing this benchmark when H20 currently runs the favored that measures the computation speed of virtually 15 libraries on three data manipulation operations over three different dataset sizes? My benchmark couldn’t possibly be as complete.

Well, for one, the benchmark didn’t include Pandas with the PyArrow backend and was last updated in 2021, which was ages ago.

Secondly, the benchmark was run on a monster of a machine with 40 CPU cores hopped up on 128 GB RAM and 20 GB GPU as well (cuDF, anyone?). The final populace doesn’t often have access to such machines, so it is crucial to see the differences between the libraries on on a regular basis devices like mine. It encompasses a modest CPU with a dozen cores and 32 gigs of RAM.

Lastly, I advocate for total transparency in the method, so I’ll explain the benchmark code intimately and present it as a GitHub Gist to run on your individual machine.

Installation and setup

We start by installing the RC (release candidate) of Pandas 2.0 together with the most recent versions of PyArrow, Datatable, and Polars.

pip install -U "pandas==2.0.0rc0" pyarrow datatable polars
import datatable as dt
import pandas as pd
import polars as pl
dt.__version__
'1.0.0'
pd.__version__
'2.0.0rc0'
pl.__version__
'0.16.14'

I created an artificial dataset with NumPy and Faker libraries to simulate typical features in a census dataset and saved it in CSV and Parquet formats. Listed below are the paths to the files:

from pathlib import Path

data = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"

Take a look at this GitHub gist to see the code that generated the info.

There are 50 million rows of seven features, clocking up the file size to about 2.5 GBs.

Benchmark results

Before showing the code, let’s examine the good things — the benchmark results:

R9PzXSyd4RxeQAAAABJRU5ErkJggg==.png
Image by writer

Right off the bat, we are able to see that PyArrow Pandas is available in last (or second to last in groupby) across all categories.

Please, don’t mistake the nonexistent bars in reading and writing parquet categories for 0 runtimes. Those operations aren’t supported in Datatable.

In other categories, Datatable and Polars share the highest spot, with Polars having a slight edge.

Writing to CSVs has all the time been a slow process for Pandas, and I suppose a recent backend is not enough to vary that.

Do you have to switch?

So, time for the million-dollar query — do you have to switch to the faster Polars or Datatable?

And the reply is the I-so-hate “it depends.” Are you willing to sacrifice Pandas’ almost two-decade maturity and, let’s admit it, stupidly easy and familiar syntax for superior speed?

In that case, take note that the time you spend learning the syntax of a recent library may balance out its performance gains.

But, if all you do is figure with massive datasets, learning either of those fast libraries could also be well definitely worth the effort in the long term.

Should you resolve to stick to Pandas, give the Enhancing Performance page of the Pandas user guide an intensive, attentive read. It outlines some suggestions and tricks so as to add extra fuel to the Pandas engine without resorting to third-party libraries.

Also, if you happen to are stuck with a big CSV file and still wish to use Pandas, it is best to memorize the next code snippet:

import datatable as dt
import pandas as pd

df = dt.fread("data.csv").to_pandas()

It reads the file with the speed of Datatable, and the conversion to a Pandas DataFrame is nearly instantaneous.

Benchmark code

OK, let’s finally see the code.

The very first thing to do after importing the libraries is to define a DataFrame to store the benchmark results. This may make things much easier during plotting:

import time

import datatable as dt
import pandas as pd
import polars as pl

# Define a DataFrame to store the outcomes
results_df = pd.DataFrame(
columns=["Function", "Library", "Runtime (s)"]
)

It has three columns, one for the duty name, one other for the library name, and one other for storing the runtime.

Then, we define a timer decorator that performs the next tasks:

  1. Measures the runtime of the decorated function.
  2. Extracts the function’s name and the worth of its library parameter.
  3. Stores the runtime, function name, and library name into the passed results DataFrame.
def timer(results: pd.DataFrame):
"""
A decorator to measure the runtime of the passed function.
It stores the runtime, the function name, and the passed
function's "library" parameter into the `results` DataFrame
as a single row.
"""

The concept is to define a single general function like read_csv that reads CSV files with either of the three libraries, which could be controlled with a parameter like library:

# Task 1: Reading CSVs
@timer(results_df)
def read_csv(path, library):
if library == "pandas":
return pd.read_csv(path, engine="pyarrow")
elif library == "polars":
return pl.read_csv(path)
elif library == "datatable":
return dt.fread(str(path))

Notice how we’re decorating the function with timer(results_df).

We define functions for the remaining of the tasks in an identical way (see the function bodies from the Gist):

# Task 2: Writing to CSVs
@timer(results_df)
def write_to_csv(df, path, library):
...

# Task 3: Reading to Parquet
@timer(results_df)
def read_parquet(path, library):
...

# Task 4: Writing to Parquet
@timer(results_df)
def write_to_parquet(df, path, library):
...

# Task 5: Sort
@timer(results_df)
def sort(df, column, library):
...

# Task 6: Groupby
@timer(results_df)
def groupby(df, library):
...

Then, we run the functions for every of the libraries:

from pathlib import Path

# Define the file paths
data = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"

# libraries = ["pandas", "polars", "datatable"]
l = "datatable"

# Task 3/4
df = read_parquet(data_parquet, library=l)
write_to_parquet(df, data_parquet, library=l)

# Task 1/2
df = read_csv(data_csv, library=l)
write_to_csv(df, data_csv, library=l)

# Task 5/6
sort(df, "age", library=l)
groupby(df, library=l)

To flee memory errors, I avoided loops and ran the benchmark in a Jupyter Notebook thrice, changing the l variable.

Then, we create the figure of the benchmark with the next easy bar chart in lovely Seaborn:

g = sns.catplot(
data=results_df,
kind="bar",
x="Function",
y="Runtime (s)",
hue="Library",
)
R9PzXSyd4RxeQAAAABJRU5ErkJggg==.png
Image by writer

Things are changing

For years now, Pandas have stood on the shoulders of NumPy because it boomed in popularity. NumPy was kind enough to lend its features for fast computations and array manipulations.

But this approach was limited due to NumPy’s terrible support for text and missing values. Pandas couldn’t use native Python data types like lists and dictionaries because that might be a laughing stock on a large scale.

So, Pandas has been moving away from NumPy on the sly for just a few years now. For instance, it introduced PyArrow datatypes for strings in 2020 already. It has been using extensions written in other languages, resembling C++ and Rust, for other complex data types like dates with time zones or categoricals.

Now, Pandas 2.0 has a fully-fledged backend to support all data types with Apache Arrow’s PyArrow implementation. Other than the apparent speed improvements, it provides a lot better support for missing values, interoperability, and a wider range of knowledge types.

So, despite the fact that the backend will still be slower than other DataFrame libraries, I’m eagerly awaiting its official release. Thanks for reading!

Listed below are just a few pages to learn more about Pandas 2.0 and the PyArrow backend:

Should you enjoyed this text and, let’s face it, its bizarre writing style, consider supporting me by signing as much as turn into a Medium member. Membership costs 4.99$ a month and provides you unlimited access to all my stories and lots of of 1000’s of articles written by more experienced folk. Should you enroll through this link, I’ll earn a small commission with no extra cost to your pocket.

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here