Pandas 2.0: A Faster Version of Pandas with Apache Arrow Backend

Here’s every thing it’s worthwhile to know concerning the recent Pandas 2.0

Pandas 2.0 was recently released. This version mainly includes bug fixes, performance improvements, and the addition of the Apache Arrow backend.

If you happen to’re a pandas user, probably you already know that pandas has used Numpy for years to represent arrays and perform operations on them. Nonetheless, in the case of working with dataframes, there are a lot of benefits that Arrow provides over Numpy.

In this text, we’ll see what are those benefits, why pandas is selecting Arrow for its backend, and the way you possibly can start using Arrow in Pandas 2.0 (it’s still not the default option).

Updating to Pandas 2.0

If you happen to have already got pandas installed in your virtual environment, it’s worthwhile to update to the newest version to have all the advantages of pandas 2.0.

You possibly can install pandas 2.0 with pip

python -m pip install --upgrade --pre pandas==2.0.0rc0

Or you should utilize conda-forge

conda install -c conda-forge/label/pandas_rc pandas==2.0.0rc0

To confirm you’re running the newest version of pandas use the code below.

import pandas as pdprint(pd.__version__)

It’s best to see “pandas 2.0.0rc0” printed in your computer.

Now let’s see tips on how to enable Arrow within the backend.

Using Arrow within the backend

On the time I’m writing this text, pandas by default remains to be using the unique types we all know thoroughly. Nonetheless, if you may have pandas 2.0, now you possibly can tell pandas you must use Arrow backed types as shown below.

Say we wish to create a series using Arrow backed types.

>>> pd.Series([5, 6, 7, 8], dtype='int64[pyarrow]')0    5
1    6
2    7
3    8
dtype: int64[pyarrow]

As you possibly can see, now we are able to use the dtype parameter to make use of Arrow within the backend. That’s why now we get an information type int64[pyarrow] as a substitute of int64 that we’d get if Numpy was utilized in the backend.

Along with that, we are able to set Arrow backed types by default as a substitute of on a specific function as shown before.

import pandas as pdpd.options.mode.dtype_backend = 'pyarrow'

Just be mindful that currently this remains to be partially implemented and it’s not working when creating data with pd.Series or pd.DataFrame. For instance, if you must read a CSV with PyArrow, you’ll need to use the code below.

import pandas as pdpd.options.mode.dtype_backend = 'pyarrow'
pd.read_csv("file_name.csv", engine='pyarrow', use_nullable_dtypes=True)

At this point, we learned tips on how to use Arrow within the backend, but why should we use Arrow over Numpy?

Why should we use Arrow?

Pandas was initially built on Numpy; nonetheless, it was never built as a backend for dataframe libraries and it has some limitations.

Listed below are a number of the benefits that Arrow has over Numpy.

1. Missing values

Representing missing values isn’t easy. The approach of pandas to represent missing values has been to convert numbers to floating points and use NaN as missing values.

Here’s an example.

>>> pd.Series([5, 6, 7, None])0    5
1    6
2    7
3    NaN
dtype: float64

Nonetheless, this isn’t optimal and has negative effects.

In contrast, by utilizing Arrow, pandas can cope with missing values without having to implement its own version for every data type.

Let’s see the identical example, but now using Arrow backed types.

>>> pd.Series([5, 6, 7, None], dtype='int64[pyarrow]')0    5
1    6
2    7
3    
dtype: int64[pyarrow]

2. Speed

In keeping with the documentation, Arrow can perform operations faster than Numpy when working with dataframes.

A test using a dataframe with 2.5 million rows with each Numpy and Arrow revealed that Arrow is quicker than Numpy when doing the next operations:

read parquet: 141ms with Numpy vs 87ms with Arrow
mean (int64): 2.03ms with Numpy vs 1.11ms with Arrow
mean (float64): 3.56ms with Numpy vs 1.73ms with Arrow
endswith (string): 471ms with Numpy vs 14.9ms with Arrow

As we are able to see Arrow is quicker than Numpy especially when coping with strings (the outcomes might barely change based in your laptop)

3. Interoperability

Identical to a CSV file that will be read with pandas or opened in Excel, Arrow can be accessed by different programs akin to R, Spark, and Polars.

The good thing about that is that it’s easy, fast, and memory efficient to share data amongst these programs.

Let’s see an example to see how useful that is.

Say you’ve been working with a pandas dataframe, but for some reason now it’s worthwhile to convert it right into a polars dataframe. Because of the Arrow implementation, now you possibly can share data between polars and pandas in an efficient way.

import pandas as pd
import polars as pldf_pandas = pd.read_csv("example.csv", engine="pyarrow")
df_polars = pl.from_pandas(df_pandas)
print(df_polars)

You possibly can switch back to pandas to make use of functionalities you wouldn’t find in polars and vice-versa because of Arrow.

4. Arrow Data types

Arrow supports more and higher data types than Numpy. Here’s a comparison between Numpy and Arrow data types.

Arrow data types are more efficient than Numpy. For instance, the boolean type in Arrow uses 1 bit per value, while Numpy uses 8 bits per value
Arrow has higher support for dates and time, different precision (s, ms, etc), and different sizes (32 bits, 63 bits, etc.)
Arrow supports data types like decimals, or binary data, and sophisticated types

In case you’re curious to see the equivalent pyarrow-backed, pandas extension, and numpy types check this table.

Conclusion

Overall, we’ve seen that the Arrow implementation in Pandas 2.0 offers faster and more memory-efficient operations because of its data types and the way in which it deals with missing values.

For more information concerning the recent Pandas 2.0, read the official documentation.

Pandas 2.0: A Faster Version of Pandas with Apache Arrow Backend

Here’s every thing it’s worthwhile to know concerning the recent Pandas 2.0

Updating to Pandas 2.0

Using Arrow within the backend

Why should we use Arrow?

1. Missing values

2. Speed

3. Interoperability

4. Arrow Data types

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

5 COMMENTS

Share this article

Recent posts

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

Pandas 2.0: A Faster Version of Pandas with Apache Arrow Backend

Here’s every thing it’s worthwhile to know concerning the recent Pandas 2.0

Updating to Pandas 2.0

Using Arrow within the backend

Why should we use Arrow?

1. Missing values

2. Speed

3. Interoperability

4. Arrow Data types

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

5 COMMENTS

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.