datasets and are in search of quick insights without an excessive amount of manual grind, you’ve come to the best place.
In 2025, datasets often contain tens of millions of rows and lots of of columns, which makes manual evaluation next to not possible. Local Large Language Models can transform your raw DataFrame statistics into polished, readable reports in seconds — minutes at worst. This approach eliminates the tedious technique of analyzing data by hand and writing executive reports, especially if the information structure doesn’t change.
Pandas handles the heavy lifting of knowledge extraction while LLMs convert your technical outputs into presentable reports. You’ll still need to write down functions that pull key statistics out of your datasets, but it surely’s a one-time effort.
This guide assumes you’ve got Ollama installed locally. In the event you don’t, you possibly can still use third-party LLM vendors, but I won’t explain learn how to hook up with their APIs.
Table of contents:
- Dataset Introduction and Exploration
- The Boring Part: Extracting Summary Statistics
- The Cool Part: Working with LLMs
- What You Could Improve
Dataset Introduction and Exploration
For this guide, I’m using the MBA admissions dataset from Kaggle. Download it if you would like to follow along.
The dataset is licensed under the Apache 2.0 license, which implies you need to use it freely for each personal and industrial projects.
To start, you’ll need a couple of Python libraries installed in your system.
Once you’ve got every little thing installed, import the vital libraries in a brand new script or a notebook:
import pandas as pd
from langchain_ollama import ChatOllama
from typing import Literal
Dataset loading and preprocessing
Start by loading the dataset with Pandas. This snippet loads the CSV file, prints basic information concerning the dataset shape, and shows what number of missing values exist in each column:
df = pd.read_csv("data/MBA.csv")
# Basic dataset info
print(f"Dataset shape: {df.shape}n")
print("Missing value stats:")
print(df.isnull().sum())
print("-" * 25)
df.sample(5)

Since data cleansing isn’t the most important focus of this text, I’ll keep the preprocessing minimal. The dataset only has a few missing values that need attention:
df["race"] = df["race"].fillna("Unknown")
df["admission"] = df["admission"].fillna("Deny")
That’s it! Let’s see learn how to go from this to a meaningful report next.
The Boring Part: Extracting Summary Statistics
Even with all of the advances in AI capability and availability, you almost certainly don’t need to send your entire dataset to an LLM provider. There are a pair of fine explanation why.
It could eat way too many tokens, which translates on to higher costs. Processing large datasets can take an extended time, especially once you’re running models locally on your personal hardware. You may also be coping with sensitive data that shouldn’t leave your organization.
Some manual work remains to be the approach to go.
This approach requires you to write down a function that extracts key elements and statistics out of your Pandas DataFrame. You’ll have to write down this function from scratch for various datasets, however the core idea transfers easily between projects.
The get_summary_context_message()
function takes in a DataFrame and returns a formatted multi-line string with an in depth summary. Here’s what it includes:
- Total application count and gender distribution
- International vs domestic applicant breakdown
- GPA and GMAT rating quartile statistics
- Admission rates by academic major (sorted by rate)
- Admission rates by work industry (top 8 industries)
- Work experience evaluation with categorical breakdowns
- Key insights highlighting top-performing categories
Here’s the whole source code for the function:
def get_summary_context_message(df: pd.DataFrame) -> str:
"""
Generate a comprehensive summary report of MBA admissions dataset statistics.
This function analyzes MBA application data to offer detailed statistics on
applicant demographics, academic performance, skilled backgrounds, and
admission rates across various categories. The summary includes gender and
international status distributions, GPA and GMAT rating statistics, admission
rates by academic major and work industry, and work experience impact evaluation.
Parameters
----------
df : pd.DataFrame
DataFrame containing MBA admissions data with the next expected columns:
- 'gender', 'international', 'gpa', 'gmat', 'major', 'work_industry', 'work_exp', 'admission'
Returns
-------
str
A formatted multi-line string containing comprehensive MBA admissions
statistics.
"""
# Basic application statistics
total_applications = len(df)
# Gender distribution
gender_counts = df["gender"].value_counts()
male_count = gender_counts.get("Male", 0)
female_count = gender_counts.get("Female", 0)
# International status
international_count = (
df["international"].sum()
if df["international"].dtype == bool
else (df["international"] == True).sum()
)
# GPA statistics
gpa_data = df["gpa"].dropna()
gpa_avg = gpa_data.mean()
gpa_25th = gpa_data.quantile(0.25)
gpa_50th = gpa_data.quantile(0.50)
gpa_75th = gpa_data.quantile(0.75)
# GMAT statistics
gmat_data = df["gmat"].dropna()
gmat_avg = gmat_data.mean()
gmat_25th = gmat_data.quantile(0.25)
gmat_50th = gmat_data.quantile(0.50)
gmat_75th = gmat_data.quantile(0.75)
# Major evaluation - admission rates by major
major_stats = []
for major in df["major"].unique():
major_data = df[df["major"] == major]
admitted = len(major_data[major_data["admission"] == "Admit"])
total = len(major_data)
rate = (admitted / total) * 100
major_stats.append((major, admitted, total, rate))
# Sort by admission rate (descending)
major_stats.sort(key=lambda x: x[3], reverse=True)
# Work industry evaluation - admission rates by industry
industry_stats = []
for industry in df["work_industry"].unique():
if pd.isna(industry):
proceed
industry_data = df[df["work_industry"] == industry]
admitted = len(industry_data[industry_data["admission"] == "Admit"])
total = len(industry_data)
rate = (admitted / total) * 100
industry_stats.append((industry, admitted, total, rate))
# Sort by admission rate (descending)
industry_stats.sort(key=lambda x: x[3], reverse=True)
# Work experience evaluation
work_exp_data = df["work_exp"].dropna()
avg_work_exp_all = work_exp_data.mean()
# Work experience for admitted students
admitted_students = df[df["admission"] == "Admit"]
admitted_work_exp = admitted_students["work_exp"].dropna()
avg_work_exp_admitted = admitted_work_exp.mean()
# Work experience ranges evaluation
def categorize_work_exp(exp):
if pd.isna(exp):
return "Unknown"
elif exp < 2:
return "0-1 years"
elif exp < 4:
return "2-3 years"
elif exp < 6:
return "4-5 years"
elif exp < 8:
return "6-7 years"
else:
return "8+ years"
df["work_exp_category"] = df["work_exp"].apply(categorize_work_exp)
work_exp_category_stats = []
for category in ["0-1 years", "2-3 years", "4-5 years", "6-7 years", "8+ years"]:
category_data = df[df["work_exp_category"] == category]
if len(category_data) > 0:
admitted = len(category_data[category_data["admission"] == "Admit"])
total = len(category_data)
rate = (admitted / total) * 100
work_exp_category_stats.append((category, admitted, total, rate))
# Construct the summary message
summary = f"""MBA Admissions Dataset Summary (2025)
Total Applications: {total_applications:,} people applied to the MBA program.
Gender Distribution:
- Male applicants: {male_count:,} ({male_count/total_applications*100:.1f}%)
- Female applicants: {female_count:,} ({female_count/total_applications*100:.1f}%)
International Status:
- International applicants: {international_count:,} ({international_count/total_applications*100:.1f}%)
- Domestic applicants: {total_applications-international_count:,} ({(total_applications-international_count)/total_applications*100:.1f}%)
Academic Performance Statistics:
GPA Statistics:
- Average GPA: {gpa_avg:.2f}
- twenty fifth percentile: {gpa_25th:.2f}
- fiftieth percentile (median): {gpa_50th:.2f}
- seventy fifth percentile: {gpa_75th:.2f}
GMAT Statistics:
- Average GMAT: {gmat_avg:.0f}
- twenty fifth percentile: {gmat_25th:.0f}
- fiftieth percentile (median): {gmat_50th:.0f}
- seventy fifth percentile: {gmat_75th:.0f}
Major Evaluation - Admission Rates by Academic Background:"""
for major, admitted, total, rate in major_stats:
summary += (
f"n- {major}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
)
summary += (
"nnWork Industry Evaluation - Admission Rates by Skilled Background:"
)
# Show top 8 industries by admission rate
for industry, admitted, total, rate in industry_stats[:8]:
summary += (
f"n- {industry}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
)
summary += "nnWork Experience Impact on Admissions:nnOverall Work Experience Comparison:"
summary += (
f"n- Average work experience (all applicants): {avg_work_exp_all:.1f} years"
)
summary += f"n- Average work experience (admitted students): {avg_work_exp_admitted:.1f} years"
summary += "nnAdmission Rates by Work Experience Range:"
for category, admitted, total, rate in work_exp_category_stats:
summary += (
f"n- {category}: {admitted}/{total} admitted ({rate:.1f}% admission rate)"
)
# Key insights
best_major = major_stats[0]
best_industry = industry_stats[0]
summary += "nnKey Insights:"
summary += (
f"n- Highest admission rate by major: {best_major[0]} at {best_major[3]:.1f}%"
)
summary += f"n- Highest admission rate by industry: {best_industry[0]} at {best_industry[3]:.1f}%"
if avg_work_exp_admitted > avg_work_exp_all:
summary += f"n- Admitted students have barely more work experience on average ({avg_work_exp_admitted:.1f} vs {avg_work_exp_all:.1f} years)"
else:
summary += "n- Work experience shows minimal difference between admitted and all applicants"
return summary
When you’ve defined the function, simply call it and print the outcomes:
print(get_summary_context_message(df))

Now let’s move on to the fun part.
The Cool Part: Working with LLMs
That is where things get interesting and your manual data extraction work pays off.
Python helper function for working with LLMs
If you’ve got decent hardware, I strongly recommend using local LLMs for easy tasks like this. I take advantage of Ollama and the newest version of the Mistral model for the actual LLM processing.

If you would like to use something like ChatGPT through OpenAI API, you possibly can still do this. You’ll just need to switch the function below to establish your API key and return the suitable instance from Langchain.
Whatever the option you select, a call to get_llm()
with a test message shouldn’t return an error:
def get_llm(model_name: str = "mistral:latest") -> ChatOllama:
"""
Create and configure a ChatOllama instance for local LLM inference.
This function initializes a ChatOllama client configured to connect with a
local Ollama server. The client is about up with deterministic output
(temperature=0) for consistent responses across multiple calls with the
same input.
Parameters
----------
model_name : str, optional
The name of the Ollama model to make use of for chat completions.
Have to be a sound model name that is accessible on the local Ollama
installation. Default is "mistral:latest".
Returns
-------
ChatOllama
A configured ChatOllama instance ready for chat completions.
"""
return ChatOllama(
model=model_name, base_url="http://localhost:11434", temperature=0
)
print(get_llm().invoke("test").content)

Summarization prompt
That is where you possibly can get creative and write ultra-specific instructions in your LLM. I’ve decided to maintain things light for demonstration purposes, but be at liberty to experiment here.
There isn’t a single right or incorrect prompt.
Whatever you do, make sure that to incorporate the format arguments using curly brackets – these values can be filled dynamically later:
SUMMARIZE_DATAFRAME_PROMPT = """
You might be an authority data analyst and data summarizer. Your task is to soak up complex datasets
and return user-friendly descriptions and findings.
You got this dataset:
- Name: {dataset_name}
- Source: {dataset_source}
This dataset was analyzed in a pipeline before it was given to you.
These are the findings returned by the evaluation pipeline:
{context}
Based on these findings, write an in depth report in {report_format} format.
Give the report a meaningful title and separate findings into sections with headings and subheadings.
Output only the report in {report_format} and nothing else.
Report:
"""
Summarization Python function
With the prompt and the get_llm()
functions declared, the one thing left is to attach the dots. The get_report_summary()
function takes in arguments that can fill the format placeholders within the prompt, then invokes the LLM with that prompt to generate a report.
You may make a choice from Markdown or HTML formats:
def get_report_summary(
dataset: pd.DataFrame,
dataset_name: str,
dataset_source: str,
report_format: Literal["markdown", "html"] = "markdown",
) -> str:
"""
Generate an AI-powered summary report from a pandas DataFrame.
This function analyzes a dataset and generates a comprehensive summary report
using a big language model (LLM). It first extracts statistical context
from the dataset, then uses an LLM to create a human-readable report within the
specified format.
Parameters
----------
dataset : pd.DataFrame
The pandas DataFrame to research and summarize.
dataset_name : str
A descriptive name for the dataset that can be included within the
generated report for context and identification.
dataset_source : str
Information concerning the source or origin of the dataset.
report_format : {"markdown", "html"}, optional
The specified output format for the generated report. Options are:
- "markdown" : Generate report in Markdown format (default)
- "html" : Generate report in HTML format
Returns
-------
str
A formatted summary report.
"""
context_message = get_summary_context_message(df=dataset)
prompt = SUMMARIZE_DATAFRAME_PROMPT.format(
dataset_name=dataset_name,
dataset_source=dataset_source,
context=context_message,
report_format=report_format,
)
return get_llm().invoke(input=prompt).content
Using the function is easy – just pass within the dataset, its name, and source. The report format defaults to Markdown:
md_report = get_report_summary(
dataset=df,
dataset_name="MBA Admissions (2025)",
dataset_source="https://www.kaggle.com/datasets/taweilo/mba-admission-dataset"
)
print(md_report)

The HTML report is just as detailed, but could use some styling. Perhaps you could possibly ask the LLM to handle that as well!

What You Could Improve
I could have easily turned this right into a 30-minute read by optimizing every detail of the pipeline, but I kept it easy for demonstration purposes. You don’t must (and shouldn’t) stop here though.
Listed here are the things you possibly can improve to make this pipeline much more powerful:
- Write a function that saves the report (Markdown or HTML) on to disk. This manner you possibly can automate the whole process and generate reports on a schedule without manual intervention.
- Within the prompt, ask the LLM so as to add CSS styling to the HTML report to make it look more presentable. You could possibly even provide your organization’s brand colours and fonts to keep up consistency across all of your data reports.
- Expand the prompt to follow more specific instructions. You would possibly want reports that give attention to specific business metrics, follow a selected template, or include recommendations based on the findings.
- Expand the
get_llm()
function so it will probably connect each to Ollama and other vendors like OpenAI, Anthropic, or Google. This offers you flexibility to modify between local and cloud-based models depending in your needs. - Do literally anything within the get_summary_context_message() function because it serves as the inspiration for all context data provided to the LLM. That is where you possibly can get creative with feature engineering, statistical evaluation, and data insights that matter to your specific use case.
I hope this minimal example has set you on the best track to automate your personal data reporting workflows.