Measuring The Speed of Recent Pandas 2.0 Against Polars and Datatable — Still Not Good Enough

-

Image by writer via Midjourney

People have been complaining about Pandas’ speed ever since they tried reading their first gigabyte-sized dataset with read_csv and realized they’d to attend for – gasp – five seconds. And yes, I used to be certainly one of those complainers.

Five seconds won’t sound rather a lot, but when loading the dataset itself takes that much runtime, it often means subsequent operations will take as long. And since speed is one of the crucial essential things in quick, dirty data exploration, you’ll be able to get very frustrated.

Because of this, folks at PyData recently announced the planned release of Pandas 2.0 with the freshly minted PyArrow backend. For those totally unaware, PyArrow, by itself, is a nifty little library designed for high-performance, memory-efficient manipulation of arrays.

People sincerely hope the brand new backend will bring considerable speed-ups over the vanilla Pandas. This text will test that glimmer of hope by comparing the PyArrow backend against two of the fastest DataFrame libraries: Datatable and Polars.

Have not people already done this?

What’s the point of doing this benchmark when H20 currently runs the favored that measures the computation speed of virtually 15 libraries on three data manipulation operations over three different dataset sizes? My benchmark couldn’t possibly be as complete.

Well, for one, the benchmark didn’t include Pandas with the PyArrow backend and was last updated in 2021, which was ages ago.

Secondly, the benchmark was run on a monster of a machine with 40 CPU cores hopped up on 128 GB RAM and 20 GB GPU as well (cuDF, anyone?). The final populace doesn’t often have access to such machines, so it is crucial to see the differences between the libraries on on a regular basis devices like mine. It encompasses a modest CPU with a dozen cores and 32 gigs of RAM.

Lastly, I advocate for total transparency in the method, so I’ll explain the benchmark code intimately and present it as a GitHub Gist to run on your individual machine.

Installation and setup

We start by installing the RC (release candidate) of Pandas 2.0 together with the most recent versions of PyArrow, Datatable, and Polars.

pip install -U "pandas==2.0.0rc0" pyarrow datatable polars
import datatable as dt
import pandas as pd
import polars as pl
dt.__version__
'1.0.0'
pd.__version__
'2.0.0rc0'
pl.__version__
'0.16.14'

I created an artificial dataset with NumPy and Faker libraries to simulate typical features in a census dataset and saved it in CSV and Parquet formats. Listed below are the paths to the files:

from pathlib import Path

data = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"

Take a look at this GitHub gist to see the code that generated the info.

There are 50 million rows of seven features, clocking up the file size to about 2.5 GBs.

Benchmark results

Before showing the code, let’s examine the good things — the benchmark results:

R9PzXSyd4RxeQAAAABJRU5ErkJggg==.png
Image by writer

Right off the bat, we are able to see that PyArrow Pandas is available in last (or second to last in groupby) across all categories.

Please, don’t mistake the nonexistent bars in reading and writing parquet categories for 0 runtimes. Those operations aren’t supported in Datatable.

In other categories, Datatable and Polars share the highest spot, with Polars having a slight edge.

Writing to CSVs has all the time been a slow process for Pandas, and I suppose a recent backend is not enough to vary that.

Do you have to switch?

So, time for the million-dollar query — do you have to switch to the faster Polars or Datatable?

And the reply is the I-so-hate “it depends.” Are you willing to sacrifice Pandas’ almost two-decade maturity and, let’s admit it, stupidly easy and familiar syntax for superior speed?

In that case, take note that the time you spend learning the syntax of a recent library may balance out its performance gains.

But, if all you do is figure with massive datasets, learning either of those fast libraries could also be well definitely worth the effort in the long term.

Should you resolve to stick to Pandas, give the Enhancing Performance page of the Pandas user guide an intensive, attentive read. It outlines some suggestions and tricks so as to add extra fuel to the Pandas engine without resorting to third-party libraries.

Also, if you happen to are stuck with a big CSV file and still wish to use Pandas, it is best to memorize the next code snippet:

import datatable as dt
import pandas as pd

df = dt.fread("data.csv").to_pandas()

It reads the file with the speed of Datatable, and the conversion to a Pandas DataFrame is nearly instantaneous.

Benchmark code

OK, let’s finally see the code.

The very first thing to do after importing the libraries is to define a DataFrame to store the benchmark results. This may make things much easier during plotting:

import time

import datatable as dt
import pandas as pd
import polars as pl

# Define a DataFrame to store the outcomes
results_df = pd.DataFrame(
columns=["Function", "Library", "Runtime (s)"]
)

It has three columns, one for the duty name, one other for the library name, and one other for storing the runtime.

Then, we define a timer decorator that performs the next tasks:

  1. Measures the runtime of the decorated function.
  2. Extracts the function’s name and the worth of its library parameter.
  3. Stores the runtime, function name, and library name into the passed results DataFrame.
def timer(results: pd.DataFrame):
"""
A decorator to measure the runtime of the passed function.
It stores the runtime, the function name, and the passed
function's "library" parameter into the `results` DataFrame
as a single row.
"""

The concept is to define a single general function like read_csv that reads CSV files with either of the three libraries, which could be controlled with a parameter like library:

# Task 1: Reading CSVs
@timer(results_df)
def read_csv(path, library):
if library == "pandas":
return pd.read_csv(path, engine="pyarrow")
elif library == "polars":
return pl.read_csv(path)
elif library == "datatable":
return dt.fread(str(path))

Notice how we’re decorating the function with timer(results_df).

We define functions for the remaining of the tasks in an identical way (see the function bodies from the Gist):

# Task 2: Writing to CSVs
@timer(results_df)
def write_to_csv(df, path, library):
...

# Task 3: Reading to Parquet
@timer(results_df)
def read_parquet(path, library):
...

# Task 4: Writing to Parquet
@timer(results_df)
def write_to_parquet(df, path, library):
...

# Task 5: Sort
@timer(results_df)
def sort(df, column, library):
...

# Task 6: Groupby
@timer(results_df)
def groupby(df, library):
...

Then, we run the functions for every of the libraries:

from pathlib import Path

# Define the file paths
data = Path("data")
data_csv = data / "census_data.csv"
data_parquet = data / "census_data.parquet"

# libraries = ["pandas", "polars", "datatable"]
l = "datatable"

# Task 3/4
df = read_parquet(data_parquet, library=l)
write_to_parquet(df, data_parquet, library=l)

# Task 1/2
df = read_csv(data_csv, library=l)
write_to_csv(df, data_csv, library=l)

# Task 5/6
sort(df, "age", library=l)
groupby(df, library=l)

To flee memory errors, I avoided loops and ran the benchmark in a Jupyter Notebook thrice, changing the l variable.

Then, we create the figure of the benchmark with the next easy bar chart in lovely Seaborn:

g = sns.catplot(
data=results_df,
kind="bar",
x="Function",
y="Runtime (s)",
hue="Library",
)
R9PzXSyd4RxeQAAAABJRU5ErkJggg==.png
Image by writer

Things are changing

For years now, Pandas have stood on the shoulders of NumPy because it boomed in popularity. NumPy was kind enough to lend its features for fast computations and array manipulations.

But this approach was limited due to NumPy’s terrible support for text and missing values. Pandas couldn’t use native Python data types like lists and dictionaries because that might be a laughing stock on a large scale.

So, Pandas has been moving away from NumPy on the sly for just a few years now. For instance, it introduced PyArrow datatypes for strings in 2020 already. It has been using extensions written in other languages, resembling C++ and Rust, for other complex data types like dates with time zones or categoricals.

Now, Pandas 2.0 has a fully-fledged backend to support all data types with Apache Arrow’s PyArrow implementation. Other than the apparent speed improvements, it provides a lot better support for missing values, interoperability, and a wider range of knowledge types.

So, despite the fact that the backend will still be slower than other DataFrame libraries, I’m eagerly awaiting its official release. Thanks for reading!

Listed below are just a few pages to learn more about Pandas 2.0 and the PyArrow backend:

Should you enjoyed this text and, let’s face it, its bizarre writing style, consider supporting me by signing as much as turn into a Medium member. Membership costs 4.99$ a month and provides you unlimited access to all my stories and lots of of 1000’s of articles written by more experienced folk. Should you enroll through this link, I’ll earn a small commission with no extra cost to your pocket.

admin

What are your thoughts on this topic?
Let us know in the comments below.

3 COMMENTS

Subscribe
Notify of
guest
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
calm music
calm music
4 months ago

calm music

winter music
winter music
4 months ago

winter music

para kazandıran uçak oyunu
para kazandıran uçak oyunu
19 days ago

Share this article

Recent posts

Conversational AI revolutionizes the shopper experience landscape

I feel the identical applies after we discuss either agents or employees or supervisors. They do not necessarily wish to be alt-tabbing or...

Former Twitter engineers are constructing Particle, an AI-powered news reader

A team led by former Twitter engineers is rethinking how AI may be used to assist people process news and data. Particle.news, which entered...

China, shocked by the looks of 'Sora'… “China is only a 'fine-tuned version' of the USA”

China showed a shocked response to OpenAI's video-generating artificial intelligence (AI) 'Sora'. There's concern that the technology gap has widened to the purpose...

What’s Multitenancy in Vector Databases?

While you upload and manage your data on GitHub that nobody else can see unless you make it public, you share physical infrastructure with...

Synapsoft launches Synap document viewer on ‘GPT Store’

Synapsoft (CEO Jeon Kyeong-heon), a specialist in artificial intelligence (AI) digital document software as a service (SaaS), announced on the twenty second that it...

Recent comments

skapa binance-konto on LLMs and the Emerging ML Tech Stack
бнанс рестраця для США on Model Evaluation in Time Series Forecasting
Bonus Pendaftaran Binance on Meet Our Fleet
Créer un compte gratuit on About Me — How I give AI artists a hand
To tài khon binance on China completely blocks ‘Chat GPT’
Regístrese para obtener 100 USDT on Reducing bias and improving safety in DALL·E 2
crystal teeth whitening on What babies can teach AI
binance referral bonus on DALL·E API now available in public beta
www.binance.com prihlásení on Neural Networks and Life
Büyü Yapılmışsa Nasıl Bozulur on Introduction to PyTorch: from training loop to prediction
yıldızname on OpenAI Function Calling
Kısmet Bağlılığını Çözmek İçin Dua on Examining Flights within the U.S. with AWS and Power BI
Kısmet Bağlılığını Çözmek İçin Dua on How Meta’s AI Generates Music Based on a Reference Melody
Kısmet Bağlılığını Çözmek İçin Dua on ‘이루다’의 스캐터랩, 기업용 AI 시장에 도전장
uçak oyunu bahis on Thanks!
para kazandıran uçak oyunu on Make Machine Learning Work for You
medyum on Teaching with AI
aviator oyunu oyna on Machine Learning for Beginners !
yıldızname on Final DXA-nation
adet kanı büyüsü on ‘Fake ChatGPT’ app on the App Store
Eşini Eve Bağlamak İçin Dua on LLMs and the Emerging ML Tech Stack
aviator oyunu oyna on AI as Artist’s Augmentation
Büyü Yapılmışsa Nasıl Bozulur on Some Guy Is Trying To Turn $100 Into $100,000 With ChatGPT
Eşini Eve Bağlamak İçin Dua on Latest embedding models and API updates
Kısmet Bağlılığını Çözmek İçin Dua on Jorge Torres, Co-founder & CEO of MindsDB – Interview Series
gideni geri getiren büyü on Joining the battle against health care bias
uçak oyunu bahis on A faster method to teach a robot
uçak oyunu bahis on Introducing the GPT Store
para kazandıran uçak oyunu on Upgrading AI-powered travel products to first-class
para kazandıran uçak oyunu on 10 Best AI Scheduling Assistants (September 2023)
aviator oyunu oyna on 🤗Hugging Face Transformers Agent
Kısmet Bağlılığını Çözmek İçin Dua on Time Series Prediction with Transformers
para kazandıran uçak oyunu on How China is regulating robotaxis
bağlanma büyüsü on MLflow on Cloud
para kazandıran uçak oyunu on Can The 2024 US Elections Leverage Generative AI?
Canbar Büyüsü on The reverse imitation game
bağlanma büyüsü on The NYU AI School Returns Summer 2023
para kazandıran uçak oyunu on Beyond ChatGPT; AI Agent: A Recent World of Staff
Büyü Yapılmışsa Nasıl Bozulur on The Murky World of AI and Copyright
gideni geri getiren büyü on ‘Midjourney 5.2’ creates magical images
Büyü Yapılmışsa Nasıl Bozulur on Microsoft launches the brand new Bing, with ChatGPT inbuilt
gideni geri getiren büyü on MemCon 2023: We’ll Be There — Will You?
adet kanı büyüsü on Meet the Fellow: Umang Bhatt
aviator oyunu oyna on Meet the Fellow: Umang Bhatt
abrir uma conta na binance on The reverse imitation game
código de indicac~ao binance on Neural Networks and Life
Larry Devin Vaughn Wall on How China is regulating robotaxis
Jon Aron Devon Bond on How China is regulating robotaxis
otvorenie úctu na binance on Evolution of Blockchain by DLC
puravive reviews consumer reports on AI-Driven Platform Could Streamline Drug Development
puravive reviews consumer reports on How OpenAI is approaching 2024 worldwide elections
www.binance.com Registrácia on DALL·E now available in beta