An LLM-Based Workflow for Automated Tabular Data Validation 

-

is an element of a series of articles on automating data cleansing for any tabular dataset:

You’ll be able to test the feature described in this text on your personal dataset using the CleanMyExcel.io service, which is free and requires no registration.

What’s Data Validity?

Data validity refers to data conformity to expected formats, types, and value ranges. This standardisation inside a single column ensures the uniformity of information in response to implicit or explicit requirements.

Common issues related to data validity include:

  • Inappropriate variable types: Column data types that aren’t suited to analytical needs, e.g., temperature values in text format.
  • Columns with mixed data types: A single column containing each numerical and textual data.
  • Non-conformity to expected formats: As an illustration, invalid email addresses or URLs.
  • Out-of-range values: Column values that fall outside what’s allowed or considered normal, e.g., negative age values or ages greater than 30 for top school students.
  • Time zone and DateTime format issues: Inconsistent or heterogeneous date formats inside the dataset.
  • Lack of measurement standardisation or uniform scale: Variability within the units of measurement used for a similar variable, e.g., mixing Celsius and Fahrenheit values for temperature.
  • Special characters or whitespace in numeric fields: Numeric data contaminated by non-numeric elements.

And the list goes on.

Error types reminiscent of duplicated records or entities and missing values don’t fall into this category.

But what’s the standard technique to identifying such data validity issues? 

When data meets expectations

Data cleansing, while it could be very complex, can generally be broken down into two key phases:

1. Detecting data errors  

2. Correcting these errors.

At its core, data cleansing revolves around identifying and resolving discrepancies in datasets—specifically, values that violate predefined constraints, that are from expectations in regards to the data..

It’s vital to acknowledge a fundamental fact: it’s almost not possible, in real-world scenarios, to be exhaustive in identifying all potential data errors—the sources of information issues are virtually infinite, starting from human input mistakes to system failures—and thus not possible to predict entirely. Nonetheless, what we do is define what we consider reasonably regular patterns in our data, often called data expectations—reasonable assumptions about what “correct” data should appear like. For instance:

  • If working with a dataset of highschool students, we’d expect ages to fall between 14 and 18 years old.
  • A customer database might require email addresses to follow a regular format (e.g., [email protected]).

By establishing these expectations, we create a structured framework for detecting anomalies, making the information cleansing process each manageable and scalable.

These expectations are derived from each semantic and statistical evaluation. We understand that the column name “age” refers back to the well-known concept of time spent living. Other column names could also be drawn from the lexical field of highschool, and column statistics (e.g. minimum, maximum, mean, etc.) offer insights into the distribution and range of values. Taken together, this information helps determine our expectations for that column:

  • Age values ought to be integers
  • Values should fall between 14 and 18

Expectations are likely to be as accurate because the time spent analysing the dataset. Naturally, if a dataset is used frequently by a team every day, the likelihood of discovering subtle data issues — and due to this fact refining expectations — increases significantly. That said, even easy expectations are rarely checked systematically in most environments, often as a result of time constraints or just because it’s not probably the most enjoyable or high-priority task on the to-do list.

Once we’ve defined our expectations, the subsequent step is to ascertain whether the information actually meets them. This implies applying data constraints and searching for violations. For every expectation, a number of constraints could be defined. These Data Quality rules could be translated into programmatic functions that return a binary decision — a Boolean value indicating whether a given value violates the tested constraint.

This strategy is usually implemented in lots of data quality management tools, which supply ways to detect all data errors in a dataset based on the defined constraints. An iterative process then begins to deal with each issue until all expectations are satisfied — i.e. no violations remain.

This strategy could appear straightforward and straightforward to implement in theory. Nonetheless, that’s often not what we see in practice — data quality stays a significant challenge and a time-consuming task in lots of organisations.

An LLM-based workflow to generate data expectations, detect violations, and resolve them

This validation workflow is split into two major components: the validation of column data types and the compliance with expectations.

One might handle each concurrently, but in our experiments, properly converting each column’s values in a knowledge frame beforehand is an important preliminary step. It facilitates data cleansing by breaking down the whole process right into a series of sequential actions, which improves performance, comprehension, and maintainability. This strategy is, in fact, somewhat subjective, but it surely tends to avoid coping with all data quality issues without delay wherever possible.

For instance and understand each step of the entire process, we’ll consider this generated example:

Examples of information validity issues are spread across the table. Each row intentionally embeds a number of issues:

  • Row 1: Uses a non‑standard date format and an invalid URL scheme (non‑conformity to expected formats).
  • Row 2: Accommodates a price value as text (“twenty”) as a substitute of a numeric value (inappropriate variable type).
  • Row 3: Has a rating given as “4 stars” mixed with numeric rankings elsewhere (mixed data types).
  • Row 4: Provides a rating value of “10”, which is out‑of‑range if rankings are expected to be between 1 and 5 (out‑of‑range value). Moreover, there’s a typo within the word “Food”.
  • Row 5: Uses a price with a currency symbol (“20€”) and a rating with extra whitespace (“5 ”), showing a scarcity of measurement standardisation and special characters/whitespace issues.

Validate Column Data Types

Estimate column data types

The duty here is to find out probably the most appropriate data type for every column in a knowledge frame, based on the column’s semantic meaning and statistical properties. The classification is restricted to the next options: string, int, float, datetime, and boolean. These categories are generic enough to cover most data types commonly encountered.

There are multiple ways to perform this classification, including deterministic approaches. The tactic chosen here leverages a big language model (Llm), prompted with details about each column and the general data frame context to guide its decision:

  • The list of column names
  • Representative rows from the dataset, randomly sampled
  • Column statistics describing each column (e.g. variety of unique values, proportion of top values, etc.)

:

1. Column Name: date 
  Description: Represents the date and time information related to each record. 
  Suggested Data Type: datetime

2. Column Name: category 
  Description: Accommodates the specific label defining the sort or classification of the item. 
  Suggested Data Type: string

3. Column Name: price 
  Description: Holds the numeric price value of an item expressed in monetary terms. 
  Suggested Data Type: float

4. Column Name: image_url 
  Description: Stores the online address (URL) pointing to the image of the item. 
  Suggested Data Type: string

5. Column Name: rating 
  Description: Represents the evaluation or rating of an item using a numeric rating. 
  Suggested Data Type: int

Convert Column Values into the Estimated Data Type

Once the information style of each column has been predicted, the conversion of values can begin. Depending on the table framework used, this step might differ barely, however the underlying logic stays similar. As an illustration, within the CleanMyExcel.io service, Pandas is used because the core data frame engine. Nonetheless, other libraries like Polars or PySpark are equally capable inside the Python ecosystem.
All non-convertible values are put aside for further investigation.

Analyse Non-convertible Values and Propose Substitutes

This step could be viewed as an imputation task. The previously flagged non-convertible values violate the column’s expected data type. Since the potential causes are so diverse, this step could be quite difficult. Once more, an LLM offers a helpful trade-off to interpret the conversion errors and suggest possible replacements.
Sometimes, the correction is easy—for instance, converting an age value of twenty into the integer 20. In lots of other cases, a substitute will not be so obvious, and tagging the worth with a sentinel (placeholder) value is a better option. In Pandas, for example, the special object pd.NA is suitable for such cases.

:

{
  “violations”: [
    {
      “index”: 2,
      “column_name”: “rating”,
      “value”: “4 stars”,
      “violation”: “Contains non-numeric text in a numeric rating field.”,
      “substitute”: “4”
    },
   {
      “index”: 1,
      “column_name”: “price”,
      “value”: “twenty”,
      “violation”: “Textual representation that cannot be directly converted to a number.”,
      “substitute”: “20”
    },
    {
      “index”: 4,
      “column_name”: “price”,
      “value”: “20€”,
      “violation”: “Price value contains an extraneous currency symbol.”,
      “substitute”: “20”
    }
  ]
}

Replace Non-convertible Values

At this point, a programmatic function is applied to interchange the problematic values with the proposed substitutes. The column is then tested again to make sure all values can now be converted into the estimated data type. If successful, the workflow proceeds to the expectations module. Otherwise, the previous steps are repeated until the column is validated.

Validate Column Data Expectations

Generate Expectations for All Columns

The next elements are provided:

  • Data dictionary: column name, a brief description, and the expected data type
  • Representative rows from the dataset, randomly sampled
  • Column statistics, reminiscent of variety of unique values and proportion of top values

Based on each column’s semantic meaning and statistical properties, the goal is to define validation rules and expectations that ensure data quality and integrity. These expectations should fall into one among the next categories related to standardisation:

  • Valid ranges or intervals
  • Expected formats (e.g. for emails or phone numbers)
  • Allowed values (e.g. for categorical fields)
  • Column data standardisation (e.g. ‘Mr’, ‘Mister’, ‘Mrs’, ‘Mrs.’ becomes [‘Mr’, ‘Mrs’])

:

Column name: date

• Expectation: Value have to be a legitimate datetime.
 - Reasoning: The column represents date and time information so each entry should follow a regular datetime format (for instance, ISO 8601). 
• Expectation: Datetime values should include timezone information (preferably UTC).
 - Reasoning: The provided sample timestamps include explicit UTC timezone information. This ensures consistency in time-based analyses.

──────────────────────────────
Column name: category

• Expectation: Allowed values ought to be standardized to a predefined set.
 - Reasoning: Based on the semantic meaning, valid categories might include “Books”, “Electronics”, “Food”, “Clothing”, and “Furniture”. (Note: The sample includes “Fod”, which likely needs correcting to “Food”.) 
• Expectation: Entries should follow a standardized textual format (e.g., Title Case).
 - Reasoning: Consistent capitalization and spelling will improve downstream analyses and reduce data cleansing issues.

──────────────────────────────
Column name: price

• Expectation: Value have to be a numeric float.
 - Reasoning: For the reason that column holds monetary amounts, entries ought to be stored as numeric values (floats) for accurate calculations.
• Expectation: Price values should fall inside a legitimate non-negative numeric interval (e.g., price ≥ 0).
 - Reasoning: Negative prices generally don’t make sense in a pricing context. Even when the minimum observed value within the sample is 9.99, allowing zero or positive values is more realistic for pricing data.

──────────────────────────────
Column name: image_url

• Expectation: Value have to be a legitimate URL with the expected format.
 - Reasoning: For the reason that column stores image web addresses, each URL should adhere to straightforward URL formatting patterns (e.g., including a correct protocol schema).
• Expectation: The URL should start with “https://”.
 - Reasoning: The sample shows that one URL uses “htp://”, which is probably going a typo. Enforcing a secure (https) URL standard improves data reliability and user security.

──────────────────────────────
Column name: rating

• Expectation: Value have to be an integer.
 - Reasoning: The evaluation rating is numeric, and as seen within the sample the rating is stored as an integer.
• Expectation: Rating values should fall inside a legitimate interval, reminiscent of between 1 and 5.
 - Reasoning: In lots of contexts, rankings are typically on a scale of 1 to five. Although the sample includes a price of 10, it is probably going a knowledge quality issue. Enforcing this range standardizes the evaluation scale.

Generate Validation Code

Once expectations have been defined, the goal is to create a structured code that checks the information against these constraints. The code format may vary depending on the chosen validation library, reminiscent of Pandera (utilized in CleanMyExcel.io), Pydantic, Great Expectations, Soda, etc.

To make debugging easier, the validation code should apply checks elementwise in order that when a failure occurs, the row index and column name are clearly identified. This helps to pinpoint and resolve issues effectively.

Analyse Violations and Propose Substitutes

When a violation is detected, it have to be resolved. Each issue is flagged with a brief explanation and a precise location (row index + column name). An LLM is used to estimate the perfect possible substitute value based on the violation’s description. Again, this proves useful as a result of the range and unpredictability of information issues. If the suitable substitute is unclear, a sentinel value is applied, depending on the information frame package in use.

:

{
  “violations”: [
    {
      “index”: 3,
      “column_name”: “category”,
      “value”: “Fod”,
      “violation”: “category should be one of [‘Books’, ‘Electronics’, ‘Food’, ‘Clothing’, ‘Furniture’]”,
      “substitute”: “Food”
    },
    {
      “index”: 0,
      “column_name”: “image_url”,
      “value”: “htp://imageexample.com/pic.jpg”,
      “violation”: “image_url should start with ‘https://’”,
      “substitute”: “https://imageexample.com/pic.jpg”
    },
    {
      “index”: 3,
      “column_name”: “rating”,
      “value”: “10”,
      “violation”: “rating ought to be between 1 and 5”,
      “substitute”: “5”
    }
  ]
}

The remaining steps are much like the iteration process used throughout the validation of column data types. Once all violations are resolved and no further issues are detected, the information frame is fully validated.

You’ll be able to test the feature described in this text on your personal dataset using the CleanMyExcel.io service, which is free and requires no registration.

Conclusion

Expectations may sometimes lack domain expertise — integrating human input may also help surface more diverse, specific, and reliable expectations.

A key challenge lies in automation throughout the resolution process. A human-in-the-loop approach could introduce more transparency, particularly within the number of substitute or imputed values.

This text is an element of a series of articles on automating data cleansing for any tabular dataset:

In upcoming articles, we’ll explore related topics already on the roadmap, including:

  • An in depth description of the spreadsheet encoder utilized in the article above.
  • Data uniqueness: stopping duplicate entities inside the dataset.
  • Data completeness: handling missing values effectively.
  • Evaluating data reshaping, validity, and other key facets of information quality.

Stay tuned!

Thanks to Marc Hobballah for reviewing this text and providing feedback.

All images, unless otherwise noted, are by the creator.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x