Pandas 2.0: A Faster Version of Pandas with Apache Arrow Backend

-

Here’s every thing it’s worthwhile to know concerning the recent Pandas 2.0

Photo by Lukas W. on Unsplash

Pandas 2.0 was recently released. This version mainly includes bug fixes, performance improvements, and the addition of the Apache Arrow backend.

If you happen to’re a pandas user, probably you already know that pandas has used Numpy for years to represent arrays and perform operations on them. Nonetheless, in the case of working with dataframes, there are a lot of benefits that Arrow provides over Numpy.

In this text, we’ll see what are those benefits, why pandas is selecting Arrow for its backend, and the way you possibly can start using Arrow in Pandas 2.0 (it’s still not the default option).

Updating to Pandas 2.0

If you happen to have already got pandas installed in your virtual environment, it’s worthwhile to update to the newest version to have all the advantages of pandas 2.0.

You possibly can install pandas 2.0 with pip

python -m pip install --upgrade --pre pandas==2.0.0rc0

Or you should utilize conda-forge

conda install -c conda-forge/label/pandas_rc pandas==2.0.0rc0

To confirm you’re running the newest version of pandas use the code below.

import pandas as pd

print(pd.__version__)

It’s best to see “pandas 2.0.0rc0” printed in your computer.

Now let’s see tips on how to enable Arrow within the backend.

Using Arrow within the backend

On the time I’m writing this text, pandas by default remains to be using the unique types we all know thoroughly. Nonetheless, if you may have pandas 2.0, now you possibly can tell pandas you must use Arrow backed types as shown below.

Say we wish to create a series using Arrow backed types.

>>> pd.Series([5, 6, 7, 8], dtype='int64[pyarrow]')

0 5
1 6
2 7
3 8

dtype: int64[pyarrow]

As you possibly can see, now we are able to use the dtype parameter to make use of Arrow within the backend. That’s why now we get an information type int64[pyarrow] as a substitute of int64 that we’d get if Numpy was utilized in the backend.

Along with that, we are able to set Arrow backed types by default as a substitute of on a specific function as shown before.

import pandas as pd

pd.options.mode.dtype_backend = 'pyarrow'

Just be mindful that currently this remains to be partially implemented and it’s not working when creating data with pd.Series or pd.DataFrame. For instance, if you must read a CSV with PyArrow, you’ll need to use the code below.

import pandas as pd

pd.options.mode.dtype_backend = 'pyarrow'
pd.read_csv("file_name.csv", engine='pyarrow', use_nullable_dtypes=True)

At this point, we learned tips on how to use Arrow within the backend, but why should we use Arrow over Numpy?

Why should we use Arrow?

Pandas was initially built on Numpy; nonetheless, it was never built as a backend for dataframe libraries and it has some limitations.

Listed below are a number of the benefits that Arrow has over Numpy.

1. Missing values

Representing missing values isn’t easy. The approach of pandas to represent missing values has been to convert numbers to floating points and use NaN as missing values.

Here’s an example.

>>> pd.Series([5, 6, 7, None])

0 5
1 6
2 7
3 NaN

dtype: float64

Nonetheless, this isn’t optimal and has negative effects.

In contrast, by utilizing Arrow, pandas can cope with missing values without having to implement its own version for every data type.

Let’s see the identical example, but now using Arrow backed types.

>>> pd.Series([5, 6, 7, None], dtype='int64[pyarrow]')

0 5
1 6
2 7
3

dtype: int64[pyarrow]

2. Speed

In keeping with the documentation, Arrow can perform operations faster than Numpy when working with dataframes.

A test using a dataframe with 2.5 million rows with each Numpy and Arrow revealed that Arrow is quicker than Numpy when doing the next operations:

  • read parquet: 141ms with Numpy vs 87ms with Arrow
  • mean (int64): 2.03ms with Numpy vs 1.11ms with Arrow
  • mean (float64): 3.56ms with Numpy vs 1.73ms with Arrow
  • endswith (string): 471ms with Numpy vs 14.9ms with Arrow

As we are able to see Arrow is quicker than Numpy especially when coping with strings (the outcomes might barely change based in your laptop)

3. Interoperability

Identical to a CSV file that will be read with pandas or opened in Excel, Arrow can be accessed by different programs akin to R, Spark, and Polars.

The good thing about that is that it’s easy, fast, and memory efficient to share data amongst these programs.

Let’s see an example to see how useful that is.

Say you’ve been working with a pandas dataframe, but for some reason now it’s worthwhile to convert it right into a polars dataframe. Because of the Arrow implementation, now you possibly can share data between polars and pandas in an efficient way.

import pandas as pd
import polars as pl

df_pandas = pd.read_csv("example.csv", engine="pyarrow")

df_polars = pl.from_pandas(df_pandas)
print(df_polars)

You possibly can switch back to pandas to make use of functionalities you wouldn’t find in polars and vice-versa because of Arrow.

4. Arrow Data types

Arrow supports more and higher data types than Numpy. Here’s a comparison between Numpy and Arrow data types.

  • Arrow data types are more efficient than Numpy. For instance, the boolean type in Arrow uses 1 bit per value, while Numpy uses 8 bits per value
  • Arrow has higher support for dates and time, different precision (s, ms, etc), and different sizes (32 bits, 63 bits, etc.)
  • Arrow supports data types like decimals, or binary data, and sophisticated types

In case you’re curious to see the equivalent pyarrow-backed, pandas extension, and numpy types check this table.

Conclusion

Overall, we’ve seen that the Arrow implementation in Pandas 2.0 offers faster and more memory-efficient operations because of its data types and the way in which it deals with missing values.

For more information concerning the recent Pandas 2.0, read the official documentation.

admin

What are your thoughts on this topic?
Let us know in the comments below.

2 COMMENTS

Subscribe
Notify of
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
gateio
gateio
9 months ago

I may need your help. I’ve been doing research on gate io recently, and I’ve tried a lot of different things. Later, I read your article, and I think your way of writing has given me some innovative ideas, thank you very much.

trapanese
trapanese
4 months ago

trapanese

Share this article

Recent posts

Gates, Zuckerberg and others travel to India… Many tech leaders participate to have a good time Ambani's wedding

It is thought that world-class tech leaders, including Microsoft founder Bill Gates and CEO Mark Zuckerberg of Mehta, will take part in the marriage...

Could We Achieve AGI Inside 5 Years? NVIDIA’s CEO Jensen Huang Believes It’s Possible

Within the dynamic field of artificial intelligence, the search for Artificial General Intelligence (AGI) represents a pinnacle of innovation, promising to redefine the interplay...

MS reveals a part of 'Customized Co-Pilot'… “Testing in progress… coming soon”

A few of the 'Customized Co-Pilot' that Microsoft (MS) announced in January has been released. In addition they announced that they plan to...

Impact of Rising Sea Levels on Coastal Residential Real Estate Assets

Using scenario based stress testing to discover medium (2050) and long run (2100) sea level rise risksThis project utilizes a scenario based qualitative stress...

Create a speaking and singing video with a single photo…”Produce mouth shapes, facial expressions, and movements.”

https://www.youtube.com/watch?v=9KuCy0W5s4o Alibaba introduced a man-made intelligence (AI) system that creates realistic speaking and singing videos from a single photo. It's the follow-up to the...

Recent comments

binance us registrácia on The Path to AI Maturity – 2023 LXT Report
Do NeuroTest work on The Stacking Ensemble Method
AeroSlim Weight loss price on NIA holds AI Ethics Idea Contest Awards Ceremony
skapa binance-konto on LLMs and the Emerging ML Tech Stack
бнанс рестраця для США on Model Evaluation in Time Series Forecasting
Bonus Pendaftaran Binance on Meet Our Fleet
Créer un compte gratuit on About Me — How I give AI artists a hand
To tài khon binance on China completely blocks ‘Chat GPT’
Regístrese para obtener 100 USDT on Reducing bias and improving safety in DALL·E 2
crystal teeth whitening on What babies can teach AI
binance referral bonus on DALL·E API now available in public beta
www.binance.com prihlásení on Neural Networks and Life
Büyü Yapılmışsa Nasıl Bozulur on Introduction to PyTorch: from training loop to prediction
yıldızname on OpenAI Function Calling
Kısmet Bağlılığını Çözmek İçin Dua on Examining Flights within the U.S. with AWS and Power BI
Kısmet Bağlılığını Çözmek İçin Dua on How Meta’s AI Generates Music Based on a Reference Melody
Kısmet Bağlılığını Çözmek İçin Dua on ‘이루다’의 스캐터랩, 기업용 AI 시장에 도전장
uçak oyunu bahis on Thanks!
para kazandıran uçak oyunu on Make Machine Learning Work for You
medyum on Teaching with AI
aviator oyunu oyna on Machine Learning for Beginners !
yıldızname on Final DXA-nation
adet kanı büyüsü on ‘Fake ChatGPT’ app on the App Store
Eşini Eve Bağlamak İçin Dua on LLMs and the Emerging ML Tech Stack
aviator oyunu oyna on AI as Artist’s Augmentation
Büyü Yapılmışsa Nasıl Bozulur on Some Guy Is Trying To Turn $100 Into $100,000 With ChatGPT
Eşini Eve Bağlamak İçin Dua on Latest embedding models and API updates
Kısmet Bağlılığını Çözmek İçin Dua on Jorge Torres, Co-founder & CEO of MindsDB – Interview Series
gideni geri getiren büyü on Joining the battle against health care bias
uçak oyunu bahis on A faster method to teach a robot
uçak oyunu bahis on Introducing the GPT Store
para kazandıran uçak oyunu on Upgrading AI-powered travel products to first-class
para kazandıran uçak oyunu on 10 Best AI Scheduling Assistants (September 2023)
aviator oyunu oyna on 🤗Hugging Face Transformers Agent
Kısmet Bağlılığını Çözmek İçin Dua on Time Series Prediction with Transformers
para kazandıran uçak oyunu on How China is regulating robotaxis
bağlanma büyüsü on MLflow on Cloud
para kazandıran uçak oyunu on Can The 2024 US Elections Leverage Generative AI?
Canbar Büyüsü on The reverse imitation game
bağlanma büyüsü on The NYU AI School Returns Summer 2023
para kazandıran uçak oyunu on Beyond ChatGPT; AI Agent: A Recent World of Staff
Büyü Yapılmışsa Nasıl Bozulur on The Murky World of AI and Copyright
gideni geri getiren büyü on ‘Midjourney 5.2’ creates magical images
Büyü Yapılmışsa Nasıl Bozulur on Microsoft launches the brand new Bing, with ChatGPT inbuilt
gideni geri getiren büyü on MemCon 2023: We’ll Be There — Will You?
adet kanı büyüsü on Meet the Fellow: Umang Bhatt
aviator oyunu oyna on Meet the Fellow: Umang Bhatt
abrir uma conta na binance on The reverse imitation game
código de indicac~ao binance on Neural Networks and Life
Larry Devin Vaughn Wall on How China is regulating robotaxis
Jon Aron Devon Bond on How China is regulating robotaxis
otvorenie úctu na binance on Evolution of Blockchain by DLC
puravive reviews consumer reports on AI-Driven Platform Could Streamline Drug Development
puravive reviews consumer reports on How OpenAI is approaching 2024 worldwide elections
www.binance.com Registrácia on DALL·E now available in beta