I got here for the speed, but I stayed for the syntax
On the time of writing this post, it’s been six years since I landed my first job in data science. And, for those entire six years spent doing data science, Pandas
has been the muse of all my work: exploratory data analyses, impact analyses, data validations, model experimentation, you name it. My profession was built on top of Pandas
!
Pointless to say, I had some serious Pandas
lock-in.
That’s, until I discovered Polars
, the brand new “blazingly fast DataFrame library” for Python.
In this text, I’ll explain:
- What
Polars
is, and what makes it so fast; - The three explanation why I actually have permanently switched from
Pandas
toPolars
;
– The.arr
namespace;
–.scan_parquet()
and.sink_parquet()
;
– Data-oriented programming.
Possibly you’ve heard of Polars
, perhaps you have not! Either way it’s slowly taking on Python’s data-processing landscape, starting right here on Towards Data Science:
So what makes Polars
so fast? From the Polars
User Guide:
Polars
completely written inRust
(no runtime overhead!) and usesArrow
– the native arrow2Rust
implementation – as its foundation…Polars
is written in Rust which provides it C/C++ performance and allows it to totally control performance critical parts in a question engine…
…Unlike tools corresponding to dask – which tries to parallelize existing single-threaded libraries like NumPy and Pandas – Polars is written from the bottom up, designed for parallelization of queries on DataFrames
And there you might have it. Polars
just isn’t only a framework for alleviating the single-threaded nature of Pandas
, like dask
or modin
; slightly, it’s a full makeover of the Python dataframe, including the highly optimal Apache Arrow columnar memory format as its foundation, and its own query optimization engine in addition. And the outcomes on speed are mind-blowing (as per h2oai’s data benchmark):
On a groupby operation of a 5GB dataframe, Polars
is greater than 6 times faster than Pandas
!
This speed alone is sufficient to get anyone interested. But as you’ll see in the remainder of this text, the speed is what got me interested, but it surely’s really the syntax that made me fall in love.
1. The .arr
Namespace
Imagine the next scenario in Pandas
: you might have a dataset of families and a few details about them, including an inventory of all of the family members:
import pandas as pd
df = pd.DataFrame({
"last_name": ["Johnson", "Jackson", "Smithson"],
"members": [["John", "Ron", "Con"], ["Jack", "Rack"], ["Smith", "Pith", "With", "Lith"]],
"city_of_residence": ["Boston", "New York City", "Dallas"]
})
print(df)>>>> last_name members city_of_residence
0 Johnson [John, Ron, Con] Boston
1 Jackson [Jack, Rack] Recent York City
2 Smithson [Smith, Pith, With, Lith] Dallas
To your evaluation, you must create a latest column from the primary element of the members
list. How do you do that? A search of the Pandas
API will leave you lost, but a transient stackoverflow search will show you the reply!
The prevailing method to extract a component of an inventory in a Pandas column is to make use of the .str
namespace (stackoverflow ref1, stackoverflow ref2), like this:
df["family_leader"] = df["members"].str[0]
print(df)>>>> last_name members city_of_residence family_leader
0 Johnson [John, Ron, Con] Boston John
1 Jackson [Jack, Rack] Recent York City Jack
2 Smithson [Smith, Pith, With, Lith] Dallas Smith
In case you’re like me, you’re probably wondering, “why do I actually have to make use of the .str
namespace to handle a list
data-type?”.
Unfortunately, Pandas
‘s .str
namespace cannot do all list
operations that one might desire; some things would require a costly .apply
for instance. In Polars
, nevertheless, this just isn’t an issue. By conforming to Apache Arrow’s columnar data format, Polars
has all standard data-types, and appropriate namespaces for handling all of them – including list
s:
import polars as pl
df = pl.DataFrame({
"last_name": ["Johnson", "Jackson", "Smithson"],
"members": [["John", "Ron", "Con"], ["Jack", "Rack"], ["Smith", "Pith", "With", "Lith"]],
"city_of_residence": ["Boston", "New York City", "Dallas"]
})
df = df.with_columns([
pl.col("members").arr.get(0).alias("family_leader")])
print(df)>>>> ┌───────────┬─────────────────────────────┬───────────────────┬───────────────┐
│ last_name ┆ members ┆ city_of_residence ┆ family_leader │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ str ┆ str │
╞═══════════╪═════════════════════════════╪═══════════════════╪═══════════════╡
│ Johnson ┆ ["John", "Ron", "Con"] ┆ Boston ┆ John │
│ Jackson ┆ ["Jack", "Rack"] ┆ Recent York City ┆ Jack │
│ Smithson ┆ ["Smith", "Pith", … "Lith"] ┆ Dallas ┆ Smith │
└───────────┴─────────────────────────────┴───────────────────┴───────────────┘
That’s right: Polars
is so explicit about data-types, that it even tells you the data-type of every column in your dataframe each time you print it!
It doesn’t stop here though. Not only does the Pandas
API require use of 1 data-type’s namespace for handling of one other data-type, however the API has turn out to be so bloated that there are sometimes some ways to do the identical thing. This may be confusing, especially for newcomers. Consider the next code snippet:
import pandas as pddf = pd.DataFrame({
"a": [1, 1, 1],
"b": [4, 5, 6]
})
column_name_indexer = ["a"]
boolean_mask_indexer = df["b"]==5
slice_indexer = slice(1, 3)
for o in [column_name_indexer, boolean_mask_indexer, slice_indexer]:
print(df[o])
On this code snippet, the identical Pandas
syntax df[...]
can do three distinct operations: retrieve a column of the dataframe, perform a row-based boolean mask on the dataframe, and retrieve a slice of the dataframe by index.
One other troubling example is that, to process dict
columns with Pandas
, you normally need to do a costly apply()
function; Polars
, then again, has a struct
data-type for handling dict
columns directly!
In Pandas
, you’ll be able to’t do every part you wish, and for the things which you could do, there’s sometimes multiple ways to do them. Compare this with Polars
, where you’ll be able to do every part, the data-types are clear, and there is normally just one solution to do the identical thing.
2. .scan_parquet()
and .sink_parquet()
The most effective things about Polars
is the undeniable fact that it offers two API’s: an eager API and a lazy API.
The eager API runs all commands in-memory, like Pandas
.
The lazy API, nevertheless, does every part only when explicitly asked for a response (e.g. with a .collect()
statement), a bit like dask
. And, upon being asked for a response, Polars
will lean on its query optimization engine to get you your end in the fastest time possible.
Consider the next code snippet, comparing the syntax of the Polars
eager DataFrame
to that of its lazy counterpart LazyFrame
:
import polars as pl
eager_df = pl.DataFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})
lazy_df = pl.LazyFrame({
"a": [1, 2, 3],
"b": [4, 5, 6]
})
The syntax is remarkably similar! The truth is, the one major difference between the eager API and the lazy API is in dataframe creation, reading, and writing, making it quite easy to modify between the 2:
And that brings us to .scan_parquet()
and .sink_parquet()
.
Through the use of .scan_parquet()
as your data input function, LazyFrame
as your dataframe, and .sink_parquet()
as your data output function, you’ll be able to process larger than memory datasets! Now that is cool, especially if you compare it with what the creator of Pandas
himself, Wes McKinney, has said about Pandas
‘s memory footprint in a post titled “Apache Arrow and the “10 Things I Hate About Pandas” back in 2017:
“my rule of thumb for pandas is that it’s best to have 5 to 10 times as much RAM as the scale of your dataset”.
3. Data-Oriented Programming
Pandas
treats dataframes like objects, enabling Object-Oriented Programming; but Polars
treats dataframes as data tables, enabling Data-Oriented Programming.
Let me explain.
With dataframes, most of what we would like to do is run queries or transformations; we would like so as to add columns, pivot along two variables, aggregate, group by, you name it. Even when we would like to subset a dataset into train and test for training and evaluating a machine learning model, those are SQL-like query expressions in nature.
And it’s true — with Pandas
, you’ll be able to do a lot of the transformations, manipulations, and queries in your data that you simply would want. Nevertheless, frustratingly, some transformations and queries simply can’t be done in a single expression, or one query in the event you will. Unlike other query and data-processing languages like SQL or Spark, many queries in Pandas
require multiple successive, distinct task expressions, and this could make things messy. Consider the next code snippet, where we create a dataframe of individuals and their ages, and we would like to see how many individuals there are in each decade:
import pandas as pd
df = (
pd.DataFrame({
"name": ["George", "Polly", "Golly", "Dolly"],
"age": [3, 4, 13, 44]
})
)
df["decade"] = (df["age"] / 10).astype(int) * 10
decade_counts = (
df
.groupby("decade")
["name"]
.agg("count")
)
print(decade_counts)>>>> decade
0 2
10 1
40 1
There’s no way around it — we have now to do our query in three task expressions. To get it right down to two expressions, we could have used the rarely seen .assign()
operator rather than the df["decade"] = ...
operation, but that’s it! It won’t seem to be a giant problem here, but if you end up needing seven, eight, nine successive task expressions to get the job done, things can begin to get a bit unreadable and hard to take care of.
In Polars
, though, this question may be cleanly written as one expression:
import polars as pl
decade_counts = (
pl.DataFrame({
"name": ["George", "Polly", "Golly", "Dolly"],
"age": [3, 4, 13, 44]
})
.with_columns([
((pl.col("age") / 10).cast(pl.Int32) * 10).alias("decade")
])
.groupby("decade")
.agg(
pl.col("name").count().alias("count")
)
)
print(decade_counts)
>>>> ┌────────┬───────┐
│ decade ┆ count │
│ --- ┆ --- │
│ i32 ┆ u32 │
╞════════╪═══════╡
│ 0 ┆ 2 │
│ 10 ┆ 1 │
│ 40 ┆ 1 │
└────────┴───────┘
So smooth.
You may read all this and think to yourself “why do I need to do every part in a single expression though?”. It’s true, perhaps you don’t. In spite of everything, many data pipelines use intermediate queries, save intermediate results to tables, and query those intermediate tables to get to the end result, and even to watch data quality.
But, like SQL, Spark, or other non-Pandas
data-processing languages, Polars
gives you 100% flexibility to interrupt up your query where you must with the intention to maximize readability, while Pandas
forces you to interrupt up your query based on its API’s limitations. This can be a huge boon not just for code-readability, but additionally for ease of development!
Further still, as an added bonus, in the event you use the lazy API with Polars
, then you definately can break your query wherever you wish, into as many parts as you wish, and the entire thing might be optimized into one query under the hood anyway.
What I’ve discussed in this text is only a glimpse into the prevalence of Polars
over Pandas
; there remain still many functions in Polars
that harken to SQL, Spark, and other data-processing languages (e.g. pipe()
, when()
, and filter()
, to call a number of).
And while Polars
is now my go-to library for data processing and evaluation in Python, I do still use Pandas
for narrow use-cases like styling dataframes for display in reports and presentations or communication with spreadsheets. That said, I fully expect Polars
to subsume Pandas
little by little as time goes on.
What Next?
Getting began with a latest tool is tough; especially if it’s a latest dataframe library, something which is so pivotal to our work as data scientists! I got began by taking Liam Brannigan’s Udemy course “Data Evaluation with Polars”, and I can highly recommend it — it covers all the fundamentals of Polars
, and helped make the transition quite easy for me (I receive no referral bonus from suggesting this course; I simply liked it that much!). And that brings me to my final point…
Acknowledgements
A special thanks to Liam Brannigan on your Polars
course, without which I’m undecided I’d have made the transition from Pandas
to Polars
. And, after all, an enormous thanks to Ritchie Vink, the creator of Polars
! Not only have you ever created an awesome library, but you promptly responded to my questions and comments about Polars
on each LinkedIn and Github – you haven’t only created a tremendous tool, but additionally a welcoming community around it. And to you, the reader – thanks for reading; I wish you joyful data-crunching 🙂