Meet PandasAI: Supercharge Your Data Evaluation With AI

-

A glimpse into conversational data evaluation with natural language

Photo by Luke Chesser on Unsplash

The swift progress of enormous language models, including fascinating applications like ChatGPT, continually showcases the remarkable capabilities of this technology, with modern use cases arising every day. In this text, now we have a better have a look at PandasAI— a conversational library that allows you to literally speak along with your data and explore some compelling examples.

Introduction

The landscape of writing computer code has undergone a dramatic transformation. Gone are the times when programmers spent countless hours scouring Google for answers and sifting through Stack Overflow forum entries. Now we have now entered the era of AI-assisted coding, which has significantly accelerated the event process. First, there have been tools like GitHub Copilot, that allowed you to type docstrings, and it will give eerily good code suggestions; with ChatGPT, you’ll be able to just type what you should do, and it’ll produce script-length code output. The chat functionality streamlines the troubleshooting process, enabling quick identification and backbone of error messages.

The long run of programming is poised for a big shift: As a substitute of mastering abstract programming language concepts, users will soon communicate with computers and data using conversational language. Current advancements in libraries suggest that this transformation is already underway. As an illustration, LangChain, a highly popular utility library for constructing tools around large language models, contains a showcase of the Pandas DataFrame Agent. Users can load their Pandas DataFrame, ask an issue, and the agent will reference relevant code and supply a comprehensive response. For those unfamiliar with Pandas, it’s a widely-used Python library for handling tabular data. The next example from the documentation illustrates this interaction:

from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
import pandas as pd

df = pd.read_csv('titanic.csv')

agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)
agent.run("what number of rows are there?")

Will produce the next output:

> Entering latest AgentExecutor chain...
Thought: I would like to count the variety of rows
Motion: python_repl_ast
Motion Input: len(df)
Commentary: 891
Thought: I now know the ultimate answer
Final Answer: There are 891 rows within the dataframe.

> Finished chain.

A package that’s built on LangChain for that is yolopandas.

Introducing PandasAI

PandasAI is one other package designed to supply a conversational interface for Pandas DataFrames. In only a couple of days, it gained considerable popularity on GitHub, amassing 3.6k stars — a noteworthy achievement, considering the unique Pandas package has around 38k stars. What sets PandasAI apart is its ease of installation via pip, allowing users to start with just a couple of lines of code. For example, here’s how I set it up in my conda environment on an M1 Mac, alongside Jupyter:

conda create --name pandasai python=3.10
conda activate pandasai
pip install pandasai
conda install jupyter

If you happen to get a 500 error launching Jupyter, it’s most definitely this bug, and the answer is fixing the dependency: pip install --force-reinstall charset-normalizer=3.1.0.

Now, let’s follow the quickstart example from the documentation: Here, we use the OpenAI API access, but in principle, you can even use OpenAssistant.

import pandas as pd
from pandasai import PandasAI

# Sample DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})

# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token="YOUR_OPENAI_API_KEY")

pandas_ai = PandasAI(llm)
pandas_ai.run(df, prompt='That are the 5 happiest countries?')

This worked like a charm:

‘Based on the information, the highest 5 happiest countries are Canada, Australia, United Kingdom, Germany, and United States.’

If we rephrase the prompt:
That are the countries with the most important gpd?,
the result was this:

‘The countries with the most important GDP are america, China, Japan, Germany, and the UK. The US has the best GDP, followed by China and Japan. These countries have a robust economy and contribute significantly to the worldwide market.’

This even has a more verbose explanation. The quickstart demo also shows which you could plot and, again, with a rather modified demo prompt:
Plot the histogram of nations showing for every the gpd, using different shades of blue for every bar

Plot the histogram of nations showing for every the gpd, using different shades of blue for every bar

This works remarkably well. Now I’d assume that it knows the pandas functionality backwards and forwards, so let’s increase the problem a notch:
Make a scatterplot with gdp vs happines for every country. Add horizontal and vertical lines for the typical happines and gpd.
Note that I purposefully wrote happines — a typo, and in the information frame, the column name is happiness_index. Nonetheless, it really works flawlessly:

Make a scatterplot with gdp vs happines for every country. Add horizontal and vertical lines for the typical happines and gpd.

Note that PandasAI supposedly changes the dataframe in place, based on the prompts, e.g. in case your prompt isadd a column with gdp divided by happines it’ll accomplish that. Nevertheless, for the groupby example above it didn’t.

Now what are the constraints? How a few prompt along the lines of:
Plot the histogram of nations showing for every the gpd. Make countries in Europe red and the remainder blue.,
which would wish additional information from outside. For me, this caused a FileNotFoundError and highlights potential limitations.

Moreover, when moving away from the instance dataset, things got more brittle. The instance dataset is in long format, i.e., one data point per row. When testing with a large format (multiple data points in each row), things didn’t work. Indeed, chances are you’ll have to perform some formatting beforehand to make sure that your data is correctly structured and compatible with the PandasAI package.

PandasAI meets Streamlit

If you happen to’re inquisitive about trying out PandasAI without diving into coding, I’ve created a user-friendly Streamlit app that interfaces with the package. You could find the source code on GitHub, and a web-based version is instantly available, hosted on Streamlit Share. You have to to enter an OpenAI-API-key, though. Give it a attempt to experience the seamless interaction with Pandas DataFrames through a conversational interface.

Example streamlit Interface to pandas-ai

Discussion

PandasAI exemplifies the seamless integration of enormous language models into established workflows and the continuing transformation of information evaluation. If you happen to’re a knowledge analyst proficient in using libraries and your primary responsibility involves generating plots based on user specifications, there’s a robust possibility that this process could be efficiently automated. The advancements in AI and conversational interfaces are revolutionizing the best way we interact with data, streamlining tasks, and making data evaluation more accessible than ever before.

Evidently, it will shift the main target from implement a certain evaluation to . One challenge with using natural language is the potential for ambiguity. As an illustration, we observed that the term “happiness” led to the usage of the happiness index, but is that this all the time the proper assumption? It’s conceivable that the role of information analysts could evolve beyond instructing large language models on what to plot. As a substitute, they could depend on advanced prompts or “super-prompts” that first request probably the most suitable metrics for making decisions on a selected topic after which ask the AI to generate the corresponding visuals. This shift would enable a more comprehensive and nuanced approach to data evaluation, harnessing the ability of AI to make more informed decisions.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

4 COMMENTS

0 0 votes
Article Rating
guest
4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

4
0
Would love your thoughts, please comment.x
()
x