Home Artificial Intelligence Pandas 2.0: A Faster Version of Pandas with Apache Arrow Backend

Pandas 2.0: A Faster Version of Pandas with Apache Arrow Backend

2
Pandas 2.0: A Faster Version of Pandas with Apache Arrow Backend

Here’s every thing it’s worthwhile to know concerning the recent Pandas 2.0

Photo by Lukas W. on Unsplash

Pandas 2.0 was recently released. This version mainly includes bug fixes, performance improvements, and the addition of the Apache Arrow backend.

If you happen to’re a pandas user, probably you already know that pandas has used Numpy for years to represent arrays and perform operations on them. Nonetheless, in the case of working with dataframes, there are a lot of benefits that Arrow provides over Numpy.

In this text, we’ll see what are those benefits, why pandas is selecting Arrow for its backend, and the way you possibly can start using Arrow in Pandas 2.0 (it’s still not the default option).

Updating to Pandas 2.0

If you happen to have already got pandas installed in your virtual environment, it’s worthwhile to update to the newest version to have all the advantages of pandas 2.0.

You possibly can install pandas 2.0 with pip

python -m pip install --upgrade --pre pandas==2.0.0rc0

Or you should utilize conda-forge

conda install -c conda-forge/label/pandas_rc pandas==2.0.0rc0

To confirm you’re running the newest version of pandas use the code below.

import pandas as pd

print(pd.__version__)

It’s best to see “pandas 2.0.0rc0” printed in your computer.

Now let’s see tips on how to enable Arrow within the backend.

Using Arrow within the backend

On the time I’m writing this text, pandas by default remains to be using the unique types we all know thoroughly. Nonetheless, if you may have pandas 2.0, now you possibly can tell pandas you must use Arrow backed types as shown below.

Say we wish to create a series using Arrow backed types.

>>> pd.Series([5, 6, 7, 8], dtype='int64[pyarrow]')

0 5
1 6
2 7
3 8

dtype: int64[pyarrow]

As you possibly can see, now we are able to use the dtype parameter to make use of Arrow within the backend. That’s why now we get an information type int64[pyarrow] as a substitute of int64 that we’d get if Numpy was utilized in the backend.

Along with that, we are able to set Arrow backed types by default as a substitute of on a specific function as shown before.

import pandas as pd

pd.options.mode.dtype_backend = 'pyarrow'

Just be mindful that currently this remains to be partially implemented and it’s not working when creating data with pd.Series or pd.DataFrame. For instance, if you must read a CSV with PyArrow, you’ll need to use the code below.

import pandas as pd

pd.options.mode.dtype_backend = 'pyarrow'
pd.read_csv("file_name.csv", engine='pyarrow', use_nullable_dtypes=True)

At this point, we learned tips on how to use Arrow within the backend, but why should we use Arrow over Numpy?

Why should we use Arrow?

Pandas was initially built on Numpy; nonetheless, it was never built as a backend for dataframe libraries and it has some limitations.

Listed below are a number of the benefits that Arrow has over Numpy.

1. Missing values

Representing missing values isn’t easy. The approach of pandas to represent missing values has been to convert numbers to floating points and use NaN as missing values.

Here’s an example.

>>> pd.Series([5, 6, 7, None])

0 5
1 6
2 7
3 NaN

dtype: float64

Nonetheless, this isn’t optimal and has negative effects.

In contrast, by utilizing Arrow, pandas can cope with missing values without having to implement its own version for every data type.

Let’s see the identical example, but now using Arrow backed types.

>>> pd.Series([5, 6, 7, None], dtype='int64[pyarrow]')

0 5
1 6
2 7
3

dtype: int64[pyarrow]

2. Speed

In keeping with the documentation, Arrow can perform operations faster than Numpy when working with dataframes.

A test using a dataframe with 2.5 million rows with each Numpy and Arrow revealed that Arrow is quicker than Numpy when doing the next operations:

  • read parquet: 141ms with Numpy vs 87ms with Arrow
  • mean (int64): 2.03ms with Numpy vs 1.11ms with Arrow
  • mean (float64): 3.56ms with Numpy vs 1.73ms with Arrow
  • endswith (string): 471ms with Numpy vs 14.9ms with Arrow

As we are able to see Arrow is quicker than Numpy especially when coping with strings (the outcomes might barely change based in your laptop)

3. Interoperability

Identical to a CSV file that will be read with pandas or opened in Excel, Arrow can be accessed by different programs akin to R, Spark, and Polars.

The good thing about that is that it’s easy, fast, and memory efficient to share data amongst these programs.

Let’s see an example to see how useful that is.

Say you’ve been working with a pandas dataframe, but for some reason now it’s worthwhile to convert it right into a polars dataframe. Because of the Arrow implementation, now you possibly can share data between polars and pandas in an efficient way.

import pandas as pd
import polars as pl

df_pandas = pd.read_csv("example.csv", engine="pyarrow")

df_polars = pl.from_pandas(df_pandas)
print(df_polars)

You possibly can switch back to pandas to make use of functionalities you wouldn’t find in polars and vice-versa because of Arrow.

4. Arrow Data types

Arrow supports more and higher data types than Numpy. Here’s a comparison between Numpy and Arrow data types.

  • Arrow data types are more efficient than Numpy. For instance, the boolean type in Arrow uses 1 bit per value, while Numpy uses 8 bits per value
  • Arrow has higher support for dates and time, different precision (s, ms, etc), and different sizes (32 bits, 63 bits, etc.)
  • Arrow supports data types like decimals, or binary data, and sophisticated types

In case you’re curious to see the equivalent pyarrow-backed, pandas extension, and numpy types check this table.

Conclusion

Overall, we’ve seen that the Arrow implementation in Pandas 2.0 offers faster and more memory-efficient operations because of its data types and the way in which it deals with missing values.

For more information concerning the recent Pandas 2.0, read the official documentation.

2 COMMENTS

  1. I may need your help. I’ve been doing research on gate io recently, and I’ve tried a lot of different things. Later, I read your article, and I think your way of writing has given me some innovative ideas, thank you very much.

LEAVE A REPLY

Please enter your comment!
Please enter your name here