Measuring The Speed of Latest Pandas 2.0 Against Polars and Datatable

Artificial Intelligence

Measuring The Speed of Latest Pandas 2.0 Against Polars and Datatable — Still Not Good Enough

admin

March 31, 2023

Measuring The Speed of Latest Pandas 2.0 Against Polars and Datatable — Still Not Good Enough

Though the brand new PyArrow backend for Pandas is bringing exciting features, it still looks disappointing when it comes to speed.

People have been complaining about Pandas’ speed ever since they tried reading their first gigabyte-sized dataset with read_csv and realized they’d to attend for – gasp – five seconds. And yes, I used to be certainly one of those complainers.

Five seconds won’t sound rather a lot, but when loading the dataset itself takes that much runtime, it often means subsequent operations will take as long. And since speed is one of the vital essential things in quick, dirty data exploration, you possibly can get very frustrated.

Because of this, folks at PyData recently announced the planned release of Pandas 2.0 with the freshly minted PyArrow backend. For those totally unaware, PyArrow, by itself, is a nifty little library designed for high-performance, memory-efficient manipulation of arrays.

People sincerely hope the brand new backend will bring considerable speed-ups over the vanilla Pandas. This text will test that glimmer of hope by comparing the PyArrow backend against two of the fastest DataFrame libraries: Datatable and Polars.

Have not people already done this?

What’s the point of doing this benchmark when H20 currently runs the favored that measures the computation speed of virtually 15 libraries on three data manipulation operations over three different dataset sizes? My benchmark couldn’t possibly be as complete.

Well, for one, the benchmark didn’t include Pandas with the PyArrow backend and was last updated in 2021, which was ages ago.

Secondly, the benchmark was run on a monster of a machine with 40 CPU cores hopped up on 128 GB RAM and 20 GB GPU besides (cuDF, anyone?). The overall populace doesn’t often have access to such machines, so it is vital to see the differences between the libraries on on a regular basis devices like mine. It contains a modest CPU with a dozen cores and 32 gigs of RAM.

Lastly, I advocate for total transparency in the method, so I’ll explain the benchmark code intimately and present it as a GitHub Gist to run on your individual machine.

Installation and setup

We start by installing the RC (release candidate) of Pandas 2.0 together with the most recent versions of PyArrow, Datatable, and Polars.

pip install -U "pandas==2.0.0rc0" pyarrow datatable polars

import datatable as dt
import pandas as pd
import polars as pl

dt.__version__

'1.0.0'

pd.__version__

'2.0.0rc0'

pl.__version__

'0.16.14'

I created an artificial dataset with NumPy and Faker libraries to simulate typical features in a census dataset and saved it in CSV and Parquet formats. Listed here are the paths to the files:

from pathlib import Pathdata = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"

Try this GitHub gist to see the code that generated the information.

There are 50 million rows of seven features, clocking up the file size to about 2.5 GBs.

Benchmark results

Before showing the code, let’s examine the good things — the benchmark results:

R9PzXSyd4RxeQAAAABJRU5ErkJggg==.png — Image by writer

Right off the bat, we will see that PyArrow Pandas is available in last (or second to last in groupby) across all categories.

Please, don’t mistake the nonexistent bars in reading and writing parquet categories for 0 runtimes. Those operations aren’t supported in Datatable.

In other categories, Datatable and Polars share the highest spot, with Polars having a slight edge.

Writing to CSVs has all the time been a slow process for Pandas, and I assume a recent backend is not enough to alter that.

Must you switch?

So, time for the million-dollar query — must you switch to the faster Polars or Datatable?

And the reply is the I-so-hate “it depends.” Are you willing to sacrifice Pandas’ almost two-decade maturity and, let’s admit it, stupidly easy and familiar syntax for superior speed?

In that case, remember that the time you spend learning the syntax of a recent library may balance out its performance gains.

But, if all you do is figure with massive datasets, learning either of those fast libraries could also be well definitely worth the effort in the long term.

For those who determine to keep on with Pandas, give the Enhancing Performance page of the Pandas user guide an intensive, attentive read. It outlines some suggestions and tricks so as to add extra fuel to the Pandas engine without resorting to third-party libraries.

Also, if you happen to are stuck with a big CSV file and still need to use Pandas, it is best to memorize the next code snippet:

import datatable as dt
import pandas as pddf = dt.fread("data.csv").to_pandas()

It reads the file with the speed of Datatable, and the conversion to a Pandas DataFrame is sort of instantaneous.

Benchmark code

OK, let’s finally see the code.

The very first thing to do after importing the libraries is to define a DataFrame to store the benchmark results. It will make things much easier during plotting:

import timeimport datatable as dt
import pandas as pd
import polars as pl
# Define a DataFrame to store the outcomes
results_df = pd.DataFrame(
columns=["Function", "Library", "Runtime (s)"]
)

It has three columns, one for the duty name, one other for the library name, and one other for storing the runtime.

Then, we define a timer decorator that performs the next tasks:

Measures the runtime of the decorated function.
Extracts the function’s name and the worth of its library parameter.
Stores the runtime, function name, and library name into the passed results DataFrame.

def timer(results: pd.DataFrame):
"""
A decorator to measure the runtime of the passed function. 
It stores the runtime, the function name, and the passed 
function's "library" parameter into the `results` DataFrame 
as a single row.
"""

The thought is to define a single general function like read_csv that reads CSV files with either of the three libraries, which will be controlled with a parameter like library:

# Task 1: Reading CSVs
@timer(results_df)
def read_csv(path, library):
if library == "pandas":
return pd.read_csv(path, engine="pyarrow")
elif library == "polars":
return pl.read_csv(path)
elif library == "datatable":
return dt.fread(str(path))

Notice how we’re decorating the function with timer(results_df).

We define functions for the remaining of the tasks in an analogous way (see the function bodies from the Gist):

# Task 2: Writing to CSVs
@timer(results_df)
def write_to_csv(df, path, library):
...# Task 3: Reading to Parquet
@timer(results_df)
def read_parquet(path, library):
...
# Task 4: Writing to Parquet
@timer(results_df)
def write_to_parquet(df, path, library):
...
# Task 5: Sort
@timer(results_df)
def sort(df, column, library):
...
# Task 6: Groupby
@timer(results_df)
def groupby(df, library):
...

Then, we run the functions for every of the libraries:

from pathlib import Path# Define the file paths
data = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"
# libraries = ["pandas", "polars", "datatable"]
l = "datatable"
# Task 3/4
df = read_parquet(data_parquet, library=l)
write_to_parquet(df, data_parquet, library=l)
# Task 1/2
df = read_csv(data_csv, library=l)
write_to_csv(df, data_csv, library=l)
# Task 5/6
sort(df, "age", library=l)
groupby(df, library=l)

To flee memory errors, I avoided loops and ran the benchmark in a Jupyter Notebook 3 times, changing the l variable.

Then, we create the figure of the benchmark with the next easy bar chart in lovely Seaborn:

g = sns.catplot(
data=results_df,
kind="bar",
x="Function",
y="Runtime (s)",
hue="Library",
)

Things are changing

For years now, Pandas have stood on the shoulders of NumPy because it boomed in popularity. NumPy was kind enough to lend its features for fast computations and array manipulations.

But this approach was limited due to NumPy’s terrible support for text and missing values. Pandas couldn’t use native Python data types like lists and dictionaries because that will be a laughing stock on a large scale.

So, Pandas has been moving away from NumPy on the sly for a couple of years now. For instance, it introduced PyArrow datatypes for strings in 2020 already. It has been using extensions written in other languages, akin to C++ and Rust, for other complex data types like dates with time zones or categoricals.

Now, Pandas 2.0 has a fully-fledged backend to support all data types with Apache Arrow’s PyArrow implementation. Aside from the apparent speed improvements, it provides a lot better support for missing values, interoperability, and a wider range of information types.

So, though the backend will still be slower than other DataFrame libraries, I’m eagerly awaiting its official release. Thanks for reading!

Listed here are a couple of pages to learn more about Pandas 2.0 and the PyArrow backend: