Home Artificial Intelligence The three Reasons Why I Have Permanently Switched From Pandas To Polars

The three Reasons Why I Have Permanently Switched From Pandas To Polars

0
The three Reasons Why I Have Permanently Switched From Pandas To Polars

Photo by Hans-Jurgen Mager on Unsplash

On the time of writing this post, it’s been six years since I landed my first job in data science. And, for those entire six years spent doing data science, Pandas has been the muse of all my work: exploratory data analyses, impact analyses, data validations, model experimentation, you name it. My profession was built on top of Pandas!

Pointless to say, I had some serious Pandas lock-in.

That’s, until I discovered Polars, the brand new “blazingly fast DataFrame library” for Python.

In this text, I’ll explain:

  1. What Polars is, and what makes it so fast;
  2. The three explanation why I actually have permanently switched from Pandas to Polars;
    – The .arr namespace;
    .scan_parquet() and .sink_parquet();
    – Data-oriented programming.

Possibly you’ve heard of Polars, perhaps you have not! Either way it’s slowly taking on Python’s data-processing landscape, starting right here on Towards Data Science:

So what makes Polars so fast? From the Polars User Guide:

Polars completely written in Rust (no runtime overhead!) and uses Arrow – the native arrow2 Rust implementation – as its foundation…
Polars is written in Rust which provides it C/C++ performance and allows it to totally control performance critical parts in a question engine…
…Unlike tools corresponding to dask – which tries to parallelize existing single-threaded libraries like NumPy and Pandas – Polars is written from the bottom up, designed for parallelization of queries on DataFrames

And there you might have it. Polars just isn’t only a framework for alleviating the single-threaded nature of Pandas, like dask or modin; slightly, it’s a full makeover of the Python dataframe, including the highly optimal Apache Arrow columnar memory format as its foundation, and its own query optimization engine in addition. And the outcomes on speed are mind-blowing (as per h2oai’s data benchmark):

Image captured from h2oai’s data benchmark tool.

On a groupby operation of a 5GB dataframe, Polars is greater than 6 times faster than Pandas!

This speed alone is sufficient to get anyone interested. But as you’ll see in the remainder of this text, the speed is what got me interested, but it surely’s really the syntax that made me fall in love.

1. The .arr Namespace

Imagine the next scenario in Pandas: you might have a dataset of families and a few details about them, including an inventory of all of the family members:

import pandas as pd
df = pd.DataFrame({
"last_name": ["Johnson", "Jackson", "Smithson"],
"members": [["John", "Ron", "Con"], ["Jack", "Rack"], ["Smith", "Pith", "With", "Lith"]],
"city_of_residence": ["Boston", "New York City", "Dallas"]
})
print(df)

>>>> last_name members city_of_residence
0 Johnson [John, Ron, Con] Boston
1 Jackson [Jack, Rack] Recent York City
2 Smithson [Smith, Pith, With, Lith] Dallas

To your evaluation, you must create a latest column from the primary element of the members list. How do you do that? A search of the Pandas API will leave you lost, but a transient stackoverflow search will show you the reply!

The prevailing method to extract a component of an inventory in a Pandas column is to make use of the .str namespace (stackoverflow ref1, stackoverflow ref2), like this:

df["family_leader"] = df["members"].str[0]
print(df)

>>>> last_name members city_of_residence family_leader
0 Johnson [John, Ron, Con] Boston John
1 Jackson [Jack, Rack] Recent York City Jack
2 Smithson [Smith, Pith, With, Lith] Dallas Smith

In case you’re like me, you’re probably wondering, “why do I actually have to make use of the .str namespace to handle a list data-type?”.

Unfortunately, Pandas‘s .str namespace cannot do all list operations that one might desire; some things would require a costly .apply for instance. In Polars, nevertheless, this just isn’t an issue. By conforming to Apache Arrow’s columnar data format, Polars has all standard data-types, and appropriate namespaces for handling all of them – including lists:

import polars as pl
df = pl.DataFrame({
"last_name": ["Johnson", "Jackson", "Smithson"],
"members": [["John", "Ron", "Con"], ["Jack", "Rack"], ["Smith", "Pith", "With", "Lith"]],
"city_of_residence": ["Boston", "New York City", "Dallas"]
})
df = df.with_columns([
pl.col("members").arr.get(0).alias("family_leader")])
print(df)

>>>> ┌───────────┬─────────────────────────────┬───────────────────┬───────────────┐
│ last_name ┆ members ┆ city_of_residence ┆ family_leader │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ str ┆ str │
╞═══════════╪═════════════════════════════╪═══════════════════╪═══════════════╡
│ Johnson ┆ ["John", "Ron", "Con"] ┆ Boston ┆ John │
│ Jackson ┆ ["Jack", "Rack"] ┆ Recent York City ┆ Jack │
│ Smithson ┆ ["Smith", "Pith", … "Lith"] ┆ Dallas ┆ Smith │
└───────────┴─────────────────────────────┴───────────────────┴───────────────┘

That’s right: Polars is so explicit about data-types, that it even tells you the data-type of every column in your dataframe each time you print it!

It doesn’t stop here though. Not only does the Pandas API require use of 1 data-type’s namespace for handling of one other data-type, however the API has turn out to be so bloated that there are sometimes some ways to do the identical thing. This may be confusing, especially for newcomers. Consider the next code snippet:

import pandas as pd

df = pd.DataFrame({
"a": [1, 1, 1],
"b": [4, 5, 6]
})

column_name_indexer = ["a"]
boolean_mask_indexer = df["b"]==5
slice_indexer = slice(1, 3)

for o in [column_name_indexer, boolean_mask_indexer, slice_indexer]:
print(df[o])

On this code snippet, the identical Pandas syntax df[...] can do three distinct operations: retrieve a column of the dataframe, perform a row-based boolean mask on the dataframe, and retrieve a slice of the dataframe by index.

One other troubling example is that, to process dict columns with Pandas, you normally need to do a costly apply() function; Polars, then again, has a struct data-type for handling dict columns directly!

In Pandas, you’ll be able to’t do every part you wish, and for the things which you could do, there’s sometimes multiple ways to do them. Compare this with Polars, where you’ll be able to do every part, the data-types are clear, and there is normally just one solution to do the identical thing.

2. .scan_parquet() and .sink_parquet()

The most effective things about Polars is the undeniable fact that it offers two API’s: an eager API and a lazy API.

The eager API runs all commands in-memory, like Pandas.

The lazy API, nevertheless, does every part only when explicitly asked for a response (e.g. with a .collect() statement), a bit like dask. And, upon being asked for a response, Polars will lean on its query optimization engine to get you your end in the fastest time possible.

Consider the next code snippet, comparing the syntax of the Polars eager DataFrame to that of its lazy counterpart LazyFrame:

import polars as pl
eager_df = pl.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})
lazy_df = pl.LazyFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})

The syntax is remarkably similar! The truth is, the one major difference between the eager API and the lazy API is in dataframe creation, reading, and writing, making it quite easy to modify between the 2:

Table by Creator

And that brings us to .scan_parquet() and .sink_parquet().

Through the use of .scan_parquet() as your data input function, LazyFrame as your dataframe, and .sink_parquet() as your data output function, you’ll be able to process larger than memory datasets! Now that is cool, especially if you compare it with what the creator of Pandas himself, Wes McKinney, has said about Pandas‘s memory footprint in a post titled “Apache Arrow and the “10 Things I Hate About Pandas” back in 2017:

“my rule of thumb for pandas is that it’s best to have 5 to 10 times as much RAM as the scale of your dataset”.

3. Data-Oriented Programming

Pandas treats dataframes like objects, enabling Object-Oriented Programming; but Polars treats dataframes as data tables, enabling Data-Oriented Programming.

Let me explain.

With dataframes, most of what we would like to do is run queries or transformations; we would like so as to add columns, pivot along two variables, aggregate, group by, you name it. Even when we would like to subset a dataset into train and test for training and evaluating a machine learning model, those are SQL-like query expressions in nature.

And it’s true — with Pandas, you’ll be able to do a lot of the transformations, manipulations, and queries in your data that you simply would want. Nevertheless, frustratingly, some transformations and queries simply can’t be done in a single expression, or one query in the event you will. Unlike other query and data-processing languages like SQL or Spark, many queries in Pandas require multiple successive, distinct task expressions, and this could make things messy. Consider the next code snippet, where we create a dataframe of individuals and their ages, and we would like to see how many individuals there are in each decade:

import pandas as pd
df = (
pd.DataFrame({
"name": ["George", "Polly", "Golly", "Dolly"],
"age": [3, 4, 13, 44]
})
)
df["decade"] = (df["age"] / 10).astype(int) * 10
decade_counts = (
df
.groupby("decade")
["name"]
.agg("count")
)
print(decade_counts)

>>>> decade
0 2
10 1
40 1

There’s no way around it — we have now to do our query in three task expressions. To get it right down to two expressions, we could have used the rarely seen .assign() operator rather than the df["decade"] = ... operation, but that’s it! It won’t seem to be a giant problem here, but if you end up needing seven, eight, nine successive task expressions to get the job done, things can begin to get a bit unreadable and hard to take care of.

In Polars, though, this question may be cleanly written as one expression:

import polars as pl
decade_counts = (
pl.DataFrame({
"name": ["George", "Polly", "Golly", "Dolly"],
"age": [3, 4, 13, 44]
})
.with_columns([
((pl.col("age") / 10).cast(pl.Int32) * 10).alias("decade")
])
.groupby("decade")
.agg(
pl.col("name").count().alias("count")
)
)
print(decade_counts)
>>>> ┌────────┬───────┐
│ decade ┆ count │
│ --- ┆ --- │
│ i32 ┆ u32 │
╞════════╪═══════╡
│ 0 ┆ 2 │
│ 10 ┆ 1 │
│ 40 ┆ 1 │
└────────┴───────┘

So smooth.

You may read all this and think to yourself “why do I need to do every part in a single expression though?”. It’s true, perhaps you don’t. In spite of everything, many data pipelines use intermediate queries, save intermediate results to tables, and query those intermediate tables to get to the end result, and even to watch data quality.

But, like SQL, Spark, or other non-Pandas data-processing languages, Polars gives you 100% flexibility to interrupt up your query where you must with the intention to maximize readability, while Pandas forces you to interrupt up your query based on its API’s limitations. This can be a huge boon not just for code-readability, but additionally for ease of development!

Further still, as an added bonus, in the event you use the lazy API with Polars, then you definately can break your query wherever you wish, into as many parts as you wish, and the entire thing might be optimized into one query under the hood anyway.

What I’ve discussed in this text is only a glimpse into the prevalence of Polars over Pandas; there remain still many functions in Polars that harken to SQL, Spark, and other data-processing languages (e.g. pipe(), when(), and filter(), to call a number of).

And while Polars is now my go-to library for data processing and evaluation in Python, I do still use Pandas for narrow use-cases like styling dataframes for display in reports and presentations or communication with spreadsheets. That said, I fully expect Polars to subsume Pandas little by little as time goes on.

What Next?

Getting began with a latest tool is tough; especially if it’s a latest dataframe library, something which is so pivotal to our work as data scientists! I got began by taking Liam Brannigan’s Udemy course “Data Evaluation with Polars”, and I can highly recommend it — it covers all the fundamentals of Polars, and helped make the transition quite easy for me (I receive no referral bonus from suggesting this course; I simply liked it that much!). And that brings me to my final point…

Acknowledgements

A special thanks to Liam Brannigan on your Polars course, without which I’m undecided I’d have made the transition from Pandas to Polars. And, after all, an enormous thanks to Ritchie Vink, the creator of Polars! Not only have you ever created an awesome library, but you promptly responded to my questions and comments about Polars on each LinkedIn and Github – you haven’t only created a tremendous tool, but additionally a welcoming community around it. And to you, the reader – thanks for reading; I wish you joyful data-crunching 🙂

LEAVE A REPLY

Please enter your comment!
Please enter your name here