Effortless Spreadsheet Normalisation With LLM

This text is a component of a series of articles on automating Data Cleansing for any tabular dataset.

You’ll be able to test the feature described in this text on your personal dataset using the CleanMyExcel.io service, which is free and requires no registration.

Start with the why

A spreadsheet containing information about awards given to films

Let’s consider this Excel spreadsheet, which accommodates information on awards given to movies. It’s sourced from the book and is accessible here.

It is a typical and customary spreadsheet that everybody may own and cope with of their each day tasks. But what’s improper with it?

To reply that query, allow us to first recall the tip goal of using data: to derive insights that help guide our decisions in our personal or business lives. This process requires a minimum of two crucial things:

Reliable data: clean data without issues, inconsistencies, duplicates, missing values, etc.
Tidy data: a well-normalised data frame that facilitates processing and manipulation.

The second point is the first foundation of any evaluation, including coping with data quality.

Returning to our example, imagine we wish to perform the next actions:

1. For every film involved in multiple awards, list the award and yr it’s related to.

2. For every actor/actress winning multiple awards, list the film and award they’re related to.

3. Check that each one actor/actress names are correct and well-standardised.

Naturally, this instance dataset is sufficiently small to derive those insights by eye or by hand if we structure it (as quickly as coding). But imagine now that the dataset accommodates the complete awards history; this might be time-consuming, painful, and error-prone with none automation.

Reading this spreadsheet and directly understanding its structure by a machine is difficult, because it doesn’t follow good practices of knowledge arrangement. That’s the reason tidying data is so necessary. By ensuring that data is structured in a machine-friendly way, we will simplify parsing, automate quality checks, and enhance business evaluation—all without altering the actual content of the dataset.

Example of a reshaping of this data:

Example of a reshaping of the data from the previous spreadsheet:

Now, anyone can use low/no-code tools or code-based queries (SQL, Python, etc.) to interact easily with this dataset and derive insights.

The primary challenge is find out how to turn a shiny and human-eye-pleasant spreadsheet right into a machine-readable tidy version.

What’s tidy data? A well-shaped data frame?

The term was described in a well‐known article named Tidy Data by Hadley Wickham and published within the Journal of Statistical Software in 2014. Below are the important thing quotes required to know the underlying concepts higher.

Data tidying

“Structuring datasets to facilitate manipulation, visualisation and modelling.”

“Tidy datasets provide a standardised way of linking the structure of a dataset (its physical layout) with its semantics (its meaning).”

Data structure

“Most statistical datasets are rectangular tables composed of rows and columns. The columns are almost all the time labelled, and the rows are sometimes labelled.”

Data semantics

“A dataset is a set of values, often either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to each a variable and an statement. A variable accommodates all values that measure the identical underlying attribute (comparable to height, temperature or duration) across units. An statement accommodates all values measured on the identical unit (for instance, an individual, a day or a race) across attributes.”

“In a given evaluation, there could also be multiple levels of statement. For instance, in a trial of a brand new allergy medication, we may need three forms of observations:

collected from everybody (age, sex, race),
collected from everybody on every day (variety of sneezes, redness of eyes), and
collected on every day (temperature, pollen count).”

Tidy data

“Tidy data is a regular way of mapping the meaning of a dataset to its structure. A dataset is taken into account messy or tidy depending on how its rows, columns and tables correspond to observations, variables and kinds. In tidy data:

Each variable forms a column.
Each statement forms a row.
Each kind of observational unit forms a table.”

Common problems with messy datasets

Column headers is likely to be values relatively than variable names.

Messy example: A table where column headers are years (2019, 2020, 2021) as a substitute of a “Yr” column.
Tidy version: A table with a “Yr” column and every row representing an statement for a given yr.

Multiple variables is likely to be stored in a single column.

Messy example: A column named “Age_Gender” containing values like 28_Female
Tidy version: Separate columns for “Age” and “Gender”

Variables is likely to be stored in each rows and columns.

Messy example: A dataset tracking student test scores where subjects (Math, Science, English) are stored as each column headers and repeated in rows as a substitute of using a single “Subject” column.
Tidy version: A table with columns for “Student ID,” “Subject,” and “Rating,” where each row represents one student’s rating for one subject.

Multiple forms of observational units is likely to be stored in the identical table.

Messy example: A sales dataset that accommodates each customer information and store inventory in the identical table.
Tidy version: Separate tables for “Customers” and “Inventory.”

A single observational unit is likely to be stored in multiple tables.

Messy example: A patient’s medical records are split across multiple tables (Diagnosis Table, Medication Table) and not using a common patient ID linking them.
Tidy version: A single table or properly linked tables using a singular “Patient ID.”

Now that we’ve a greater understanding of what tidy data is, let’s see find out how to transform a messy dataset right into a tidy one.

Desirous about the how

Although these guidelines sound clear in theory, they continue to be difficult to generalise easily in practice for any sort of dataset. In other words, starting with the messy data, no easy or deterministic process or algorithm exists to reshape the information. This is especially explained by the singularity of every dataset. Indeed, it’s surprisingly hard to exactly define variables and observations usually after which transform data routinely without losing content. That’s the reason, despite massive improvements in data processing during the last decade, data cleansing and formatting are still done “manually” more often than not.

Thus, when complex and hardly maintainable rules-based systems usually are not suitable (i.e. to exactly cope with all contexts by describing decisions prematurely), machine learning models may offer some advantages. This grants the system more freedom to adapt to any data by generalising what it has learned during training. Many large language models (LLMs) have been exposed to quite a few data processing examples, making them able to analysing input data and performing tasks comparable to spreadsheet structure evaluation, table schema estimation, and code generation.

Then, let’s describe a workflow fabricated from code and LLM-based modules, alongside business logic, to reshape any spreadsheet.

Diagram of a workflow made of code and LLM-based modules alongside business logic to reshape a spreadsheet

Spreadsheet encoder

This module is designed to serialise into text the primary information needed from the spreadsheet data. Only the obligatory subset of cells contributing to the table layout is retained, removing non-essential or overly repetitive formatting information. By retaining only the obligatory information, this step minimises token usage, reduces costs, and enhances model performance.. The present version is a deterministic algorithm inspired by the paper SpreadsheetLLM: Encoding Spreadsheets for Large Language Models, which relies on heuristics. More details about it should be the subject of a next article.

Table structure evaluation

Before moving forward, asking an LLM to extract the spreadsheet structure is a vital step in constructing the subsequent actions. Listed below are examples of questions addressed:

What number of tables are present, and what are their locations (regions) within the spreadsheet?
What defines the boundaries of every table (e.g., empty rows/columns, specific markers)?
Which rows/columns function headers, and do any tables have multi-level headers?
Are there metadata sections, aggregated statistics, or notes that should be filtered out or processed individually?
Are there any merged cells, and if that’s the case, how should they be handled?

Table schema estimation

Once the evaluation of the spreadsheet structure has been accomplished, it’s now time to begin interested by the best goal table schema. This involves letting the LLM process iteratively by:

Identifying all potential columns (multi-row headers, metadata, etc.)
Comparing columns for domain similarities based on column names and data semantics
Grouping related columns

The module outputs a final schema with names and a brief description for every retained column.

Code generation to format the spreadsheet

Considering the previous structure evaluation and the table schema, this last LLM-based module should draft code that transforms the spreadsheet right into a proper data frame compliant with the table schema. Furthermore, no useful content have to be omitted (e.g. aggregated or computed values should still be derived from other variables).

As generating code that works well from scratch at the primary iteration is difficult, two internal iterative processes are added to revise the code if needed:

Code checking: Each time code can’t be compiled or executed, the trace error is provided to the model to update its code.
Data frame validation: The metadata of the created data frame—comparable to column names, first and last rows, and statistics about each column—is checked to validate whether the table conforms to expectations. Otherwise, the code is revised accordingly.

Convert the information frame into an Excel file

Finally, if all data matches properly right into a single table, a worksheet is created from this data frame to respect the tabular format. The ultimate asset returned is an Excel file whose energetic sheet accommodates the tidy spreadsheet data.

Et voilà! The sky’s the limit for benefiting from your newly tidy dataset.

Be happy to check it together with your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Final note on the workflow

Why is a workflow proposed as a substitute of an agent for that purpose?

On the time of writing, we consider that a workflow based on LLMs for precise sub-tasks is more robust, stable, iterable, and maintainable than a more autonomous agent. An agent may offer benefits: more freedom and liberty in actions to perform tasks. Nonetheless, they could still be hard to cope with in practice; for instance, they could diverge quickly if the target shouldn’t be clear enough. I imagine that is our case, but that doesn’t mean that this model wouldn’t be applicable in the long run in the identical way as SWE-agent coding is performing, for instance.

Next articles within the series

In upcoming articles, we plan to explore related topics, including:

An in depth description of the spreadsheet encoder mentioned earlier.
Data validity: ensuring each column meets the expectations.
Data uniqueness: stopping duplicate entities throughout the dataset.
Data completeness: handling missing values effectively.
Evaluating data reshaping, validity, and other key elements of knowledge quality.

Stay tuned!

Thanks to Marc Hobballah for reviewing this text and providing feedback.

Effortless Spreadsheet Normalisation With LLM

Start with the why

Common problems with messy datasets

Desirous about the how

Next articles within the series

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Martin Trust Center for MIT Entrepreneurship welcomes Ana Bakshi as recent executive director

AI Engineering and Evals as Latest Layers of Software Work

Apple chases Meta’s AI glasses lead

OpenAI is big in India. Its models are steeped in caste bias.

Are Foundation Models Ready for Your Production Tabular Data?

Effortless Spreadsheet Normalisation With LLM

Start with the why

Common problems with messy datasets

Desirous about the how

Next articles within the series

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.