Home Artificial Intelligence Meet PandasAI: Supercharge Your Data Evaluation With AI

Meet PandasAI: Supercharge Your Data Evaluation With AI

1
Meet PandasAI: Supercharge Your Data Evaluation With AI

A glimpse into conversational data evaluation with natural language

Photo by Luke Chesser on Unsplash

The swift progress of huge language models, including fascinating applications like ChatGPT, continually showcases the remarkable capabilities of this technology, with revolutionary use cases arising every day. In this text, we now have a better have a look at PandasAI— a conversational library that lets you literally speak along with your data and explore some compelling examples.

Introduction

The landscape of writing computer code has undergone a dramatic transformation. Gone are the times when programmers spent countless hours scouring Google for answers and sifting through Stack Overflow forum entries. We have now now entered the era of AI-assisted coding, which has significantly accelerated the event process. First, there have been tools like GitHub Copilot, that allowed you to type docstrings, and it will give eerily good code suggestions; with ChatGPT, you possibly can just type what you desire to do, and it can produce script-length code output. The chat functionality streamlines the troubleshooting process, enabling quick identification and determination of error messages.

The long run of programming is poised for a major shift: As an alternative of mastering abstract programming language concepts, users will soon communicate with computers and data using conversational language. Current advancements in libraries suggest that this transformation is already underway. As an example, LangChain, a highly popular utility library for constructing tools around large language models, encompasses a showcase of the Pandas DataFrame Agent. Users can load their Pandas DataFrame, ask a matter, and the agent will reference relevant code and supply a comprehensive response. For those unfamiliar with Pandas, it’s a widely-used Python library for handling tabular data. The next example from the documentation illustrates this interaction:

from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
import pandas as pd

df = pd.read_csv('titanic.csv')

agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)
agent.run("what number of rows are there?")

Will produce the next output:

> Entering latest AgentExecutor chain...
Thought: I would like to count the variety of rows
Motion: python_repl_ast
Motion Input: len(df)
Statement: 891
Thought: I now know the ultimate answer
Final Answer: There are 891 rows within the dataframe.

> Finished chain.

A package that’s built on LangChain for that is yolopandas.

Introducing PandasAI

PandasAI is one other package designed to supply a conversational interface for Pandas DataFrames. In only just a few days, it gained considerable popularity on GitHub, amassing 3.6k stars — a noteworthy achievement, considering the unique Pandas package has around 38k stars. What sets PandasAI apart is its ease of installation via pip, allowing users to start with just just a few lines of code. For instance, here’s how I set it up in my conda environment on an M1 Mac, alongside Jupyter:

conda create --name pandasai python=3.10
conda activate pandasai
pip install pandasai
conda install jupyter

If you happen to get a 500 error launching Jupyter, it’s probably this bug, and the answer is fixing the dependency: pip install --force-reinstall charset-normalizer=3.1.0.

Now, let’s follow the quickstart example from the documentation: Here, we use the OpenAI API access, but in principle, you too can use OpenAssistant.

import pandas as pd
from pandasai import PandasAI

# Sample DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})

# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token="YOUR_OPENAI_API_KEY")

pandas_ai = PandasAI(llm)
pandas_ai.run(df, prompt='That are the 5 happiest countries?')

This worked like a charm:

‘In accordance with the info, the highest 5 happiest countries are Canada, Australia, United Kingdom, Germany, and United States.’

If we rephrase the prompt:
That are the countries with the most important gpd?,
the result was this:

‘The countries with the most important GDP are the USA, China, Japan, Germany, and the UK. The US has the best GDP, followed by China and Japan. These countries have a powerful economy and contribute significantly to the worldwide market.’

This even has a more verbose explanation. The quickstart demo also shows you could plot and, again, with a rather modified demo prompt:
Plot the histogram of nations showing for every the gpd, using different shades of blue for every bar

Plot the histogram of nations showing for every the gpd, using different shades of blue for every bar

This works remarkably well. Now I might assume that it knows the pandas functionality backwards and forwards, so let’s increase the problem a notch:
Make a scatterplot with gdp vs happines for every country. Add horizontal and vertical lines for the common happines and gpd.
Note that I purposefully wrote happines — a typo, and in the info frame, the column name is happiness_index. Nonetheless, it really works flawlessly:

Make a scatterplot with gdp vs happines for every country. Add horizontal and vertical lines for the common happines and gpd.

Note that PandasAI supposedly changes the dataframe in place, based on the prompts, e.g. in case your prompt isadd a column with gdp divided by happines it can accomplish that. Nevertheless, for the groupby example above it didn’t.

Now what are the constraints? How a few prompt along the lines of:
Plot the histogram of nations showing for every the gpd. Make countries in Europe red and the remaining blue.,
which would want additional information from outside. For me, this caused a FileNotFoundError and highlights potential limitations.

Moreover, when moving away from the instance dataset, things got more brittle. The instance dataset is in long format, i.e., one data point per row. When testing with a large format (multiple data points in each row), things didn’t work. Indeed, you could have to perform some formatting beforehand to be certain that your data is correctly structured and compatible with the PandasAI package.

PandasAI meets Streamlit

If you happen to’re taken with trying out PandasAI without diving into coding, I’ve created a user-friendly Streamlit app that interfaces with the package. Yow will discover the source code on GitHub, and an internet version is instantly available, hosted on Streamlit Share. You will have to enter an OpenAI-API-key, though. Give it a try to experience the seamless interaction with Pandas DataFrames through a conversational interface.

Example streamlit Interface to pandas-ai

Discussion

PandasAI exemplifies the seamless integration of huge language models into established workflows and the continuing transformation of knowledge evaluation. If you happen to’re a knowledge analyst proficient in using libraries and your primary responsibility involves generating plots based on user specifications, there’s a powerful possibility that this process will be efficiently automated. The advancements in AI and conversational interfaces are revolutionizing the best way we interact with data, streamlining tasks, and making data evaluation more accessible than ever before.

Evidently, it will shift the main target from implement a certain evaluation to . One challenge with using natural language is the potential for ambiguity. As an example, we observed that the term “happiness” led to the usage of the happiness index, but is that this at all times the best assumption? It’s conceivable that the role of knowledge analysts could evolve beyond instructing large language models on what to plot. As an alternative, they could depend on advanced prompts or “super-prompts” that first request probably the most suitable metrics for making decisions on a selected topic after which ask the AI to generate the corresponding visuals. This shift would enable a more comprehensive and nuanced approach to data evaluation, harnessing the facility of AI to make more informed decisions.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here