LLMs + Pandas: How I Use Generative AI to Generate Pandas DataFrame Summaries

-

datasets and are in search of quick insights without an excessive amount of manual grind, you’ve come to the best place.

In 2025, datasets often contain tens of millions of rows and lots of of columns, which makes manual evaluation next to not possible. Local Large Language Models can transform your raw DataFrame statistics into polished, readable reports in seconds — minutes at worst. This approach eliminates the tedious technique of analyzing data by hand and writing executive reports, especially if the information structure doesn’t change.

Pandas handles the heavy lifting of knowledge extraction while LLMs convert your technical outputs into presentable reports. You’ll still need to write down functions that pull key statistics out of your datasets, but it surely’s a one-time effort.

This guide assumes you’ve got Ollama installed locally. In the event you don’t, you possibly can still use third-party LLM vendors, but I won’t explain learn how to hook up with their APIs.

Table of contents:

  • Dataset Introduction and Exploration
  • The Boring Part: Extracting Summary Statistics
  • The Cool Part: Working with LLMs
  • What You Could Improve

Dataset Introduction and Exploration

For this guide, I’m using the MBA admissions dataset from Kaggle. Download it if you would like to follow along.

The dataset is licensed under the Apache 2.0 license, which implies you need to use it freely for each personal and industrial projects.

To start, you’ll need a couple of Python libraries installed in your system.

Image 1 – Required Python libraries and versions (image by writer)

Once you’ve got every little thing installed, import the vital libraries in a brand new script or a notebook:

import pandas as pd
from langchain_ollama import ChatOllama
from typing import Literal

Dataset loading and preprocessing

Start by loading the dataset with Pandas. This snippet loads the CSV file, prints basic information concerning the dataset shape, and shows what number of missing values exist in each column:

df = pd.read_csv("data/MBA.csv")

# Basic dataset info
print(f"Dataset shape: {df.shape}n")
print("Missing value stats:")
print(df.isnull().sum())
print("-" * 25)
df.sample(5)
Image 2 – Basic dataset statistics (image by writer)

Since data cleansing isn’t the most important focus of this text, I’ll keep the preprocessing minimal. The dataset only has a few missing values that need attention:

df["race"] = df["race"].fillna("Unknown")
df["admission"] = df["admission"].fillna("Deny")

That’s it! Let’s see learn how to go from this to a meaningful report next.

The Boring Part: Extracting Summary Statistics

Even with all of the advances in AI capability and availability, you almost certainly don’t need to send your entire dataset to an LLM provider. There are a pair of fine explanation why.

It could eat way too many tokens, which translates on to higher costs. Processing large datasets can take an extended time, especially once you’re running models locally on your personal hardware. You may also be coping with sensitive data that shouldn’t leave your organization.

Some manual work remains to be the approach to go.

This approach requires you to write down a function that extracts key elements and statistics out of your Pandas DataFrame. You’ll have to write down this function from scratch for various datasets, however the core idea transfers easily between projects.

The get_summary_context_message() function takes in a DataFrame and returns a formatted multi-line string with an in depth summary. Here’s what it includes:

  • Total application count and gender distribution
  • International vs domestic applicant breakdown
  • GPA and GMAT rating quartile statistics
  • Admission rates by academic major (sorted by rate)
  • Admission rates by work industry (top 8 industries)
  • Work experience evaluation with categorical breakdowns
  • Key insights highlighting top-performing categories

Here’s the whole source code for the function:

def get_summary_context_message(df: pd.DataFrame) -> str:
    """
    Generate a comprehensive summary report of MBA admissions dataset statistics.
    
    This function analyzes MBA application data to offer detailed statistics on
    applicant demographics, academic performance, skilled backgrounds, and
    admission rates across various categories. The summary includes gender and
    international status distributions, GPA and GMAT rating statistics, admission
    rates by academic major and work industry, and work experience impact evaluation.
    
    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing MBA admissions data with the next expected columns:
        - 'gender', 'international', 'gpa', 'gmat', 'major', 'work_industry', 'work_exp', 'admission'
    
    Returns
    -------
    str
        A formatted multi-line string containing comprehensive MBA admissions
        statistics.
    """
    # Basic application statistics
    total_applications = len(df)

    # Gender distribution
    gender_counts = df["gender"].value_counts()
    male_count = gender_counts.get("Male", 0)
    female_count = gender_counts.get("Female", 0)

    # International status
    international_count = (
        df["international"].sum()
        if df["international"].dtype == bool
        else (df["international"] == True).sum()
    )

    # GPA statistics
    gpa_data = df["gpa"].dropna()
    gpa_avg = gpa_data.mean()
    gpa_25th = gpa_data.quantile(0.25)
    gpa_50th = gpa_data.quantile(0.50)
    gpa_75th = gpa_data.quantile(0.75)

    # GMAT statistics
    gmat_data = df["gmat"].dropna()
    gmat_avg = gmat_data.mean()
    gmat_25th = gmat_data.quantile(0.25)
    gmat_50th = gmat_data.quantile(0.50)
    gmat_75th = gmat_data.quantile(0.75)

    # Major evaluation - admission rates by major
    major_stats = []
    for major in df["major"].unique():
        major_data = df[df["major"] == major]
        admitted = len(major_data[major_data["admission"] == "Admit"])
        total = len(major_data)
        rate = (admitted / total) * 100
        major_stats.append((major, admitted, total, rate))

    # Sort by admission rate (descending)
    major_stats.sort(key=lambda x: x[3], reverse=True)

    # Work industry evaluation - admission rates by industry
    industry_stats = []
    for industry in df["work_industry"].unique():
        if pd.isna(industry):
            proceed
        industry_data = df[df["work_industry"] == industry]
        admitted = len(industry_data[industry_data["admission"] == "Admit"])
        total = len(industry_data)
        rate = (admitted / total) * 100
        industry_stats.append((industry, admitted, total, rate))

    # Sort by admission rate (descending)
    industry_stats.sort(key=lambda x: x[3], reverse=True)

    # Work experience evaluation
    work_exp_data = df["work_exp"].dropna()
    avg_work_exp_all = work_exp_data.mean()

    # Work experience for admitted students
    admitted_students = df[df["admission"] == "Admit"]
    admitted_work_exp = admitted_students["work_exp"].dropna()
    avg_work_exp_admitted = admitted_work_exp.mean()

    # Work experience ranges evaluation
    def categorize_work_exp(exp):
        if pd.isna(exp):
            return "Unknown"
        elif exp < 2:
            return "0-1 years"
        elif exp < 4:
            return "2-3 years"
        elif exp < 6:
            return "4-5 years"
        elif exp < 8:
            return "6-7 years"
        else:
            return "8+ years"

    df["work_exp_category"] = df["work_exp"].apply(categorize_work_exp)
    work_exp_category_stats = []

    for category in ["0-1 years", "2-3 years", "4-5 years", "6-7 years", "8+ years"]:
        category_data = df[df["work_exp_category"] == category]
        if len(category_data) > 0:
            admitted = len(category_data[category_data["admission"] == "Admit"])
            total = len(category_data)
            rate = (admitted / total) * 100
            work_exp_category_stats.append((category, admitted, total, rate))

    # Construct the summary message
    summary = f"""MBA Admissions Dataset Summary (2025)
    
Total Applications: {total_applications:,} people applied to the MBA program.

Gender Distribution:
- Male applicants: {male_count:,} ({male_count/total_applications*100:.1f}%)
- Female applicants: {female_count:,} ({female_count/total_applications*100:.1f}%)

International Status:
- International applicants: {international_count:,} ({international_count/total_applications*100:.1f}%)
- Domestic applicants: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)

Academic Performance Statistics:

GPA Statistics:
- Average GPA: {gpa_avg:.2f}
- twenty fifth percentile: {gpa_25th:.2f}
- fiftieth percentile (median): {gpa_50th:.2f}
- seventy fifth percentile: {gpa_75th:.2f}

GMAT Statistics:
- Average GMAT: {gmat_avg:.0f}
- twenty fifth percentile: {gmat_25th:.0f}
- fiftieth percentile (median): {gmat_50th:.0f}
- seventy fifth percentile: {gmat_75th:.0f}

Major Evaluation - Admission Rates by Academic Background:"""

    for major, admitted, total, rate in major_stats:
        summary += (
            f"n- {major}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += (
        "nnWork Industry Evaluation - Admission Rates by Skilled Background:"
    )

    # Show top 8 industries by admission rate
    for industry, admitted, total, rate in industry_stats[:8]:
        summary += (
            f"n- {industry}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    summary += "nnWork Experience Impact on Admissions:nnOverall Work Experience Comparison:"
    summary += (
        f"n- Average work experience (all applicants): {avg_work_exp_all:.1f} years"
    )
    summary += f"n- Average work experience (admitted students): {avg_work_exp_admitted:.1f} years"

    summary += "nnAdmission Rates by Work Experience Range:"
    for category, admitted, total, rate in work_exp_category_stats:
        summary += (
            f"n- {category}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
        )

    # Key insights
    best_major = major_stats[0]
    best_industry = industry_stats[0]

    summary += "nnKey Insights:"
    summary += (
        f"n- Highest admission rate by major: {best_major[0]} at {best_major[3]:.1f}%"
    )
    summary += f"n- Highest admission rate by industry: {best_industry[0]} at {best_industry[3]:.1f}%"

    if avg_work_exp_admitted > avg_work_exp_all:
        summary += f"n- Admitted students have barely more work experience on average ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
    else:
        summary += "n- Work experience shows minimal difference between admitted and all applicants"

    return summary

When you’ve defined the function, simply call it and print the outcomes:

print(get_summary_context_message(df))
Image 3 – Extracted findings and statistics from the dataset (image by writer)

Now let’s move on to the fun part.

The Cool Part: Working with LLMs

That is where things get interesting and your manual data extraction work pays off.

Python helper function for working with LLMs

If you’ve got decent hardware, I strongly recommend using local LLMs for easy tasks like this. I take advantage of Ollama and the newest version of the Mistral model for the actual LLM processing.

Image 4 – Available Ollama models (image by writer)

If you would like to use something like ChatGPT through OpenAI API, you possibly can still do this. You’ll just need to switch the function below to establish your API key and return the suitable instance from Langchain.

Whatever the option you select, a call to get_llm() with a test message shouldn’t return an error:

def get_llm(model_name: str = "mistral:latest") -> ChatOllama:
    """
    Create and configure a ChatOllama instance for local LLM inference.
    
    This function initializes a ChatOllama client configured to connect with a
    local Ollama server. The client is about up with deterministic output
    (temperature=0) for consistent responses across multiple calls with the
    same input.
    
    Parameters
    ----------
    model_name : str, optional
        The name of the Ollama model to make use of for chat completions.
        Have to be a sound model name that is accessible on the local Ollama
        installation. Default is "mistral:latest".
    
    Returns
    -------
    ChatOllama
        A configured ChatOllama instance ready for chat completions.
    """
    return ChatOllama(
        model=model_name, base_url="http://localhost:11434", temperature=0
    )


print(get_llm().invoke("test").content)
Image 5 – LLM test message (image by writer)

Summarization prompt

That is where you possibly can get creative and write ultra-specific instructions in your LLM. I’ve decided to maintain things light for demonstration purposes, but be at liberty to experiment here.

There isn’t a single right or incorrect prompt.

Whatever you do, make sure that to incorporate the format arguments using curly brackets – these values can be filled dynamically later:

SUMMARIZE_DATAFRAME_PROMPT = """
You might be an authority data analyst and data summarizer. Your task is to soak up complex datasets
and return user-friendly descriptions and findings.

You got this dataset:
- Name: {dataset_name}
- Source: {dataset_source}

This dataset was analyzed in a pipeline before it was given to you.
These are the findings returned by the evaluation pipeline:


{context}


Based on these findings, write an in depth report in {report_format} format.
Give the report a meaningful title and separate findings into sections with headings and subheadings.
Output only the report in {report_format} and nothing else.

Report:
"""

Summarization Python function

With the prompt and the get_llm() functions declared, the one thing left is to attach the dots. The get_report_summary() function takes in arguments that can fill the format placeholders within the prompt, then invokes the LLM with that prompt to generate a report.

You may make a choice from Markdown or HTML formats:

def get_report_summary(
    dataset: pd.DataFrame,
    dataset_name: str,
    dataset_source: str,
    report_format: Literal["markdown", "html"] = "markdown",
) -> str:
    """
    Generate an AI-powered summary report from a pandas DataFrame.
    
    This function analyzes a dataset and generates a comprehensive summary report
    using a big language model (LLM). It first extracts statistical context
    from the dataset, then uses an LLM to create a human-readable report within the
    specified format.
    
    Parameters
    ----------
    dataset : pd.DataFrame
        The pandas DataFrame to research and summarize.
    dataset_name : str
        A descriptive name for the dataset that can be included within the
        generated report for context and identification.
    dataset_source : str
        Information concerning the source or origin of the dataset.
    report_format : {"markdown", "html"}, optional
        The specified output format for the generated report. Options are:
        - "markdown" : Generate report in Markdown format (default)
        - "html" : Generate report in HTML format
    
    Returns
    -------
    str
        A formatted summary report.
    
    """
    context_message = get_summary_context_message(df=dataset)
    prompt = SUMMARIZE_DATAFRAME_PROMPT.format(
        dataset_name=dataset_name,
        dataset_source=dataset_source,
        context=context_message,
        report_format=report_format,
    )
    return get_llm().invoke(input=prompt).content

Using the function is easy – just pass within the dataset, its name, and source. The report format defaults to Markdown:

md_report = get_report_summary(
    dataset=df, 
    dataset_name="MBA Admissions (2025)",
    dataset_source="https://www.kaggle.com/datasets/taweilo/mba-admission-dataset"
)
print(md_report)
Image 6 – Final report in Markdown format (image by writer)

The HTML report is just as detailed, but could use some styling. Perhaps you could possibly ask the LLM to handle that as well!

Image 7 – Final report in HTML format (image by writer)

What You Could Improve

I could have easily turned this right into a 30-minute read by optimizing every detail of the pipeline, but I kept it easy for demonstration purposes. You don’t must (and shouldn’t) stop here though.

Listed here are the things you possibly can improve to make this pipeline much more powerful:

  • Write a function that saves the report (Markdown or HTML) on to disk. This manner you possibly can automate the whole process and generate reports on a schedule without manual intervention.
  • Within the prompt, ask the LLM so as to add CSS styling to the HTML report to make it look more presentable. You could possibly even provide your organization’s brand colours and fonts to keep up consistency across all of your data reports.
  • Expand the prompt to follow more specific instructions. You would possibly want reports that give attention to specific business metrics, follow a selected template, or include recommendations based on the findings.
  • Expand the get_llm() function so it will probably connect each to Ollama and other vendors like OpenAI, Anthropic, or Google. This offers you flexibility to modify between local and cloud-based models depending in your needs.
  • Do literally anything within the get_summary_context_message() function because it serves as the inspiration for all context data provided to the LLM. That is where you possibly can get creative with feature engineering, statistical evaluation, and data insights that matter to your specific use case.

I hope this minimal example has set you on the best track to automate your personal data reporting workflows.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x