Generating Data Dictionary for Excel Files Using OpenPyxl and AI Agents

-

Every company I worked for until today, there it was: the resilient MS Excel.

Excel was first released in 1985 and has remained strong until today. It has survived the rise of relational databases, the evolution of many programming languages, the Web with its infinite variety of online applications, and at last, additionally it is surviving the era of the AI.

Phew!

Do you’ve gotten any doubts about how resilient Excel is? I don’t.

I believe the explanation for that’s its . Take into consideration this example: we’re at work, in a gathering, and suddenly the leadership shares a CSV file and asks for a fast calculation or a number of calculated numbers. Now, the choices are:

1. Open an IDE (or a notebook) and begin coding like crazy to generate a straightforward matplotlib graphic;

2. Open Power BI, import the information, and begin making a report with dynamic graphics.

3. Open the CSV in Excel, write a few formulas, and create a graphic.

I can’t speak for you, but again and again I’m going for option 3. Especially because Excel files are compatible with every little thing, easily shareable, and beginner-friendly.

I’m saying all of this as an Introduction to make my point that I don’t think that Excel files are going away anytime soon, even with the fast development of AI. Many will love that, many will hate that.

So, my motion here was to leverage AI to make Excel files higher documented. Certainly one of the primary complaints of knowledge teams about Excel is the dearth of best practices and reproducibility, on condition that the names of the columns can have any names and data types, but zero documentation.

So, I actually have created an AI Agent that reads the Excel file and creates this small documentation. Here is how it really works:

  1. The Excel file is converted to CSV and fed into the Large Language Model (LLM).
  2. The AI Agent generates the information dictionary with column information (variable name, data type, description).
  3. The information dictionary gets added as comments to the Excel file’s header.
  4. Output file saved with comments.

Okay. Hands-on now. Let’s get that done on this tutorial.

Code

Let’s code! | Image generated by AI. Meta Llama, 2025. https://meta.ai

We’ll begin by organising a virtual environment. Create a venv with the tool of your selection, equivalent to Poetry, Python Venv, Anaconda, or UV. I actually like UV, as it’s the fastest and the best, for my part. If you’ve gotten UV installed [5], open a terminal and create your venv.

uv init data-docs
cd data-docs
uv venv
uv add streamlit openpyxl pandas agno mcp google-genai

Now, allow us to import the vital modules. This project was created with Python 3.12.1, but I imagine Python 3.9 or higher might do the trick already. We’ll use:

  • Agno: for the AI Agent management
  • OpenPyxl: for the manipulation of Excel files
  • Streamlit: for the front-end interface.
  • Pandas, OS, JSON, Dedent and Google Genai as support modules.
# Imports
import os
import json
import streamlit as st
from textwrap import dedent

from agno.agent import Agent
from agno.models.google import Gemini
from agno.tools.file import FileTools

from openpyxl import load_workbook
from openpyxl.comments import Comment
import pandas as pd

Great. The following step is creating the functions we’ll must handle the Excel files and to create the AI Agent.

Notice that each one the functions have detailed docstrings. That is intentional because LLMs use docstrings to know what a given function does and choose whether to make use of it or not as a tool.

So, in case you’re using Python functions as Tools for an AI Agent, be sure that to make use of detailed docstrings. Nowadays, with free copilots equivalent to Windsurf [6] it’s even easier to create them.

Converting the file to CSV

This function will:

  • Take the Excel file and skim only the primary 10 rows. That is enough for us to send to the LLM. Doing that, we’re also stopping sending too many tokens as input and making this agent too expensive.
  • Save the file as CSV to make use of as input for the AI Agent. The CSV format is simpler for the model to absorb, because it is a bunch of text separated by commas. And we all know LLMs shine working with text.

Here is the function.

def convert_to_csv(file_path:str):
   """
    Use this tool to convert the excel file to CSV.

    * file_path: Path to the Excel file to be converted
    """
   # Load the file  
   df = pd.read_excel(file_path).head(10)

   # Convert to CSV
   st.write("Converting to CSV... :leftwards_arrow_with_hook:")
   return df.to_csv('temp.csv', index=False)

Let’s move on.

Creating the Agent

The following function creates the AI agent. I’m using Agno [1], as it is vitally versatile and simple to make use of. I also selected the model Gemini 2.0 Flash. In the course of the test phase, this was the best-performing model generating the information docs. To make use of it, you have to an API Key from Google. Don’t forget to get one here [7].

The function:

  • Receives the CSV output from the previous function.
  • Passes through the AI Agent, which generates the information dictionary with column name, description, and data type.
  • Notice that the description argument is the prompt for the agent. Make it detailed and precise.
  • The information dictionary will probably be saved as a JSON file using a tool called FileTools that may read and write files.
  • I actually have arrange retries=2 so we will work around any error on a primary try.
def create_agent(apy_key):
    agent = Agent(
        model=Gemini(id="gemini-2.0-flash", api_key=apy_key),
        description= dedent("""
                            You're an agent that reads the temp.csv dataset presented to you and 
                            based on the name and data style of each column header, determine the next information:
                            - The information forms of each column
                            - The outline of every column
                            - The primary column numer is 0

                            Using the FileTools provided, create an information dictionary in JSON format that features the below information:
                            {: {ColName: , DataType: , Description: }}

                            Should you are unable to find out the information type or description of a column, return 'N/A' for that column for the missing values.
                            
                            """),
        tools=[ FileTools(read_files=True, save_files=True) ],
        retries=2,
        show_tool_calls=True
        )

    return agent

Okay. Now we’d like one other function to avoid wasting the information dictionary to the file.

Adding Data Dictionary to the File’s Header

That is the last function to be created. It’ll:

  • Get the information dictionary json from the previous step and the unique Excel file.
  • Add the information dictionary to the file’s header as comments.
  • Save the output file.
  • Once the file is saved, it displays a download button for the user to get the modified file.
def add_comments_to_header(file_path:str, data_dict:dict="data_dict.json"):
    """
    Use this tool so as to add the information dictionary {data_dict.json} as comments to the header of an Excel file and save the output file.

    The function takes the Excel file path as argument and adds the {data_dict.json} as comments to every cell
    Start counting from column 0
    in the primary row of the Excel file, using the next format:    
        * Column Number: 
        * Column Name: 
        * Data Type: 
        * Description: 

    Parameters
    ----------
    * file_path : str
        The trail to the Excel file to be processed
    * data_dict : dict
        The information dictionary containing the column number, column name, data type, description, and variety of null values

    """
    
    # Load the information dictionary
    data_dict = json.load(open(data_dict))

    # Load the workbook
    wb = load_workbook(file_path)

    # Get the energetic worksheet
    ws = wb.energetic

    # Iterate over each column in the primary row (header)
    for n, col in enumerate(ws.iter_cols(min_row=1, max_row=1)):
        for header_cell in col:
            header_cell.comment = Comment(dedent(f"""
                              ColName: {data_dict[str(n)]['ColName']}, 
                              DataType: {data_dict[str(n)]['DataType']},
                              Description: {data_dict[str(n)]['Description']}
    """),'AI Agent')

    # Save the workbook
    st.write("Saving File... :floppy_disk:")
    wb.save('output.xlsx')

    # Create a download button
    with open('output.xlsx', 'rb') as f:
        st.download_button(
            label="Download output.xlsx",
            data=f,
            file_name='output.xlsx',
            mime='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
        )

Okay. The following step is to attach all of this together on a Streamlit front-end script.

Streamlit Front-End

On this step, I could have created a unique file for the front-end and imported the functions in there. But I made a decision to make use of the identical file, so let’s start with the famous:

if __name__ == "__main__":

First, a few lines to configure the page and messages displayed within the Web Application. We’ll use the content centered on the page, and there’s some details about how the App works.

# Config page Streamlit
    st.set_page_config(layout="centered", 
                       page_title="Data Docs", 
                       page_icon=":paperclip:",
                       initial_sidebar_state="expanded")
    
    # Title
    st.title("Data Docs :paperclip:")
    st.subheader("Generate an information dictionary to your Excel file.")
    st.caption("1. Enter your Gemini API key and the trail of the Excel file on the sidebar.")
    st.caption("2. Run the agent.")
    st.caption("3. The agent will generate an information dictionary and add it as comments to the header of the Excel file.")
    st.caption("ColName:  | DataType:  | Description: ")
    
    st.divider()

Next, we’ll arrange the sidebar, where the user can input their API Key from Google and choose a .xlsx file to be modified.

There may be a button to run the appliance, one other to reset the app state, and a progress bar. Nothing too fancy.

with st.sidebar:
        # Enter your API key
        st.caption("Enter your API key and the trail of the Excel file.")
        api_key = st.text_input("API key: ", placeholder="Google Gemini API key", type="password")
        
        # Upload file
        input_file = st.file_uploader("File upload", 
                                       type='xlsx')
        

        # Run the agent
        agent_run = st.button("Run")

        # progress bar
        progress_bar = st.empty()
        progress_bar.progress(0, text="Initializing...")

        st.divider()

        # Reset session state
        if st.button("Reset Session"):
            st.session_state.clear()
            st.rerun()

Once the run button is clicked, it triggers the remaining of the code to run the Agent. Here is the sequence of steps performed:

  1. The primary function is named to remodel the file to CSV
  2. The progress is registered on the progress bar.
  3. The Agent is created.
  4. Progress bar updated.
  5. A prompt is fed into the agent to read the temp.csv file, create the information dictionary, and save the output to data_dictionary.json.
  6. The information dictionary is printed on the screen, so the user can see what was generated while it’s being saved to the Excel file.
  7. The Excel file is modified and saved.
# Create the agent
    if agent_run:
        # Convert Excel file to CSV
        convert_to_csv(input_file)

        # Register progress
        progress_bar.progress(15, text="Processing CSV...")

        # Create the agent
        agent = create_agent(api_key)

        # Start the script
        st.write("Running Agent... :runner:")

        # Register progress
        progress_bar.progress(50, text="AI Agent is running...")

        # Run the agent    
        agent.print_response(dedent(f"""
                                1. Use FileTools to read the temp.csv as input to create the information dictionary for the columns within the dataset. 
                                2. Using the FileTools tool, save the information dictionary to a file named 'data_dict.json'.
                                
                                """),
                        markdown=True)

        # Print the information dictionary
        st.write("Generating Data Dictionary... :page_facing_up:")
        with open('data_dict.json', 'r') as f:
            data_dict = json.load(f)
            st.json(data_dict, expanded=False)

        # Add comments to header
        add_comments_to_header(input_file, 'data_dict.json')

        # Remove temporary files
        st.write("Removing temporary files... :wastebasket:")
        os.remove('temp.csv')
        os.remove('data_dict.json')    
    
    # If file exists, show success message
    if os.path.exists('output.xlsx'):
        st.success("Done! :white_check_mark:")
        os.remove('output.xlsx')

    # Progress bar end
    progress_bar.progress(100, text="Done!")

That’s it. Here is an indication of the agent in motion.

Data Docs added to your Excel File. Image by the creator.

Beautiful result!

Try It

You may try the deployed app here: https://excel-datadocs.streamlit.app/

Before You Go

In my humble opinion, Excel files aren’t going away anytime soon. Loving or hating them, we’ll should persist with them for some time.

Excel files are versatile, easy to handle and share, thus they’re still very useful for the routine tasks at work.

Nevertheless, now we will leverage AI to assist us handle those files and make them higher. Artificial Intelligence is touching so many points of our lives. The routine and tools at work are only one other one.

Let’s reap the benefits of AI and work smarter day-after-day!

Should you liked this content, find more of my work in my website and GitHub, shared below.

GitHub Repository

Here is the GitHub Repository for this project.

https://github.com/gurezende/Data-Dictionary-GenAI

Find Me

You’ll find more about my work on my website.

https://gustavorsantos.me

References

[1. Agno Docs] https://docs.agno.com/introduction/agents

[2. Openpyxl Docs] https://openpyxl.readthedocs.io/en/stable/index.html

[3. Streamlit Docs] https://docs.streamlit.io/

[4. Data-Docs Web App] https://excel-datadocs.streamlit.app/

[5. Installing UV] https://docs.astral.sh/uv/getting-started/installation/

[6. Windsurf Coding Copilot] https://windsurf.com/vscode_tutorial

[7. Google Gemini API Key] https://ai.google.dev/gemini-api/docs/api-key

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x