Home Artificial Intelligence Polars: The Super Fast Dataframe Library for Python — Goodbye Pandas?

Polars: The Super Fast Dataframe Library for Python — Goodbye Pandas?

3
Polars: The Super Fast Dataframe Library for Python — Goodbye Pandas?

Image via Midjourney

A faster dataframe library than pandas is here and its name is polars!

Polars is a library written in Rust and uses Arrow as its foundation. This library is quicker than pandas especially in terms of working with large datasets.

Although Polar is written in Rust, you don’t must know Rust to make use of it, but there’s a Python package that you could use to start with it. Actually, if you happen to already know pandas, learning polars must be easy.

But, first, let’s see why you need to select polars over other options.

Why use Polars?

Listed here are some explanation why you need to select polars.

  • It uses all available cores in your computer.
  • It optimizes queries to scale back unneeded work/memory allocations.
  • It handles datasets larger than your available RAM.
  • It has a strict schema (data types must be known before running the query).

But just don’t take my word for it. Let’s see some numbers.

Here’s a test performance shown within the Polars documentation. In accordance with the image below, polars is way faster than other options.

Source: Polars doc

How can polars outperform pandas?

Unlike pandas, polars is lazy and semi-lazy. In lazy Polars, we are able to do query optimization on a whole query to be able to improve performance and memory pressure. That said, you can do all of your work eagerly with polars as you’d do with pandas.

Now let’s learn the right way to use polars!

First Things First: Install The Library

To put in Polars, we have now to run the command below.

# pip
pip install polars

# conda
conda install polars

Note that we’d like to have Python 3.7 or above.

Read a dataset with Polars

Identical to with pandas, we are able to read CSV files with polars. Let’s import polars and browse a CSV file (click here to download this CSV file)

import polars as pl
df = pl.read_csv("StudentsPerformance.csv")

Here’s how the dataframe looks.

Did you notice anything strange about this dataframe?

The info type is laid out in the column names and there isn’t a index! In the event you’re a pandas user, you need to be used to seeing indexes in a dataframe, but polars doesn’t have index.

Here’s why (in accordance with its docs):

Polars goals to have predictable results and readable queries, as such we predict an index doesn’t help us reach that objective. We imagine the semantics of a question mustn’t change by the state of an index or a reset_index call.

What does that mean for pandas users?

Well, we won’t must use the.loc or iloc methods anymore or get theSettingWithCopyWarning in Polars.

But our df dataframe remains to be much like pandas. We are able to get the column attribute, identical to we’d do with Pandas.

>>> df.columns

['id',
'gender',
'race/ethnicity',
'parental level of education',
'lunch',
'test preparation course',
'math score',
'reading score',
'writing score']

Let’s explore what else we are able to do with polars and the way it differs from pandas.

Find out how to select columns with Polars

Say we wish to pick the “gender” column from our dataframe. Here’s how we’d do it with polars.

# Select 1 column
df.select(pl.col('gender'))

We can even select 2 columns by adding the [].

# Select 2+ columns
df.select(pl.col(['gender', 'math score']))

Or all of the columns!

# Select all columns
df.select(pl.col('*'))

Find out how to create columns with Polars

Let’s sum the ‘math rating’ and ‘reading rating’ columns and put the end in a latest column named “sum.”

To create a column with Polars we have now to make use of .with_columns . Here’s the right way to use it and the way it differs from pandas.

# polars: create "sum" column
df.with_columns(
(pl.col('math rating') + pl.col('reading rating')).alias("sum")
)

# pandas: df['sum'] = df['math score'] + df['reading score']

As you possibly can see, we also need to make use of .alias to call the column.

Now let’s create an “average” column. We’ll calculate the typical of ‘math rating,’ ‘reading rating,’ and ‘writing rating.’

# polars: create "average" column
df.with_columns(
pl.col(['math score', 'reading score', 'writing score']).mean().alias('average')
)

# pandas: df['average'] = df[['math score', 'reading score', 'writing score']].mean(axis=1)

Find out how to filter data with Polars

Say we wish to filter only the feminine gender. We are able to filter data with polars using .filter.

# polars: easy filtering
df.filter(pl.col('gender')=='female')

# pandas: df[df['gender'] == 'female']

We can even filter based on multiple conditions. Let’s filter only “female” from “group B.”

# Multiple filtering 
df.filter(
(pl.col('gender')=='female') &
(pl.col('race/ethnicity')=='group B')
)

# pandas: df[(df['gender'] == 'female') & (df['race/ethnicity'] == 'group B')]

Find out how to group by with Polars

Grouping with polars could be very much like pandas. We have now to make use of .groupby after which indicate the combination function.

Let’s group by “race/ethnicity” and count the weather in each group.

# Group by
df.groupby("race/ethnicity").count()

Identical to pandas, isn’t it?

Joining dataframes with Polars

To hitch dataframes with polars, we use .join. The syntax of this function is comparable to the .merge function we have now on pandas.

Before joining dataframes, download the second CSV named “LanguageScore.csv” and browse it as df2.

df2 = pl.read_csv("LanguageScore.csv")

Now, let’s join df and df2. They’ve a typical column named id .

# Join dataframes
df.join(df2, on='id')

Now we have now the “language rating” column in our df dataframe.

You may as well add the how parameter to point the form of join you would like.

# Inner, left and outer join
df.join(df2, on='id', how='inner')
df.join(df2, on='id', how='left')
df.join(df2, on='id', how='outer')

Concatenate dataframes with Polars

To concatenate dataframes with polars we use .concat, but, unlike pandas, to point whether we wish a horizontal or vertical concatenation we simply must add the how parameter and kind either “horizontal” or “vertical.”

Note that vertical concatenation makes a dataframe longer, while horizontal concatenation makes a dataframe wider.

Let’s add the “language rating” column from df2 to df . To achieve this, we have now to concatenate each dataframes horizontally. Here’s how.

# Concatenate dataframes
pl.concat([df, df2], how="horizontal")

But here’s the catch, the dataframes to concatenate can’t have a single column in common.

Each our dataframes have the column “id”, so we have now to drop one in all them before concatenating them.

# drop column "id" in df2
df2 = df2.drop("id")

# Concatenate dataframes
pl.concat([df, df2], how="horizontal")

Note that, unlike our previous inner join, now we get null values contained in the “language rating” column. This happens because df has more rows than df2 leading to null values within the concatenation.

Congratulations! You only learned the right way to use the polars library. For more, check the official documentation.

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here