How Did Open Food Facts Fix OCR-Extracted Ingredients Using Open-Source LLMs?

-

Open Food Facts has tried to unravel this issue for years using Regular Expressions and existing solutions corresponding to Elasticsearch’s corrector, without success. Until recently.

Because of the most recent advancements in artificial intelligence, we now have access to powerful Large Language Models, also called LLMs.

By training our own model, we created the Ingredients Spellcheck and managed to not only outperform proprietary LLMs corresponding to GPT-4o or Claude 3.5 Sonnet on this task, but additionally to cut back the variety of unrecognized ingredients within the database by 11%.

This text walks you thru the several stages of the project and shows you the way we managed to enhance the standard of the database using Machine Learning.

Benefit from the reading!

When a product is added by a contributor, its pictures undergo a series of processes to extract all relevant information. One crucial step is the extraction of the list of ingredients.

When a word is identified as an ingredient, it’s cross-referenced with a taxonomy that comprises a predefined list of recognized ingredients. If the word matches an entry within the taxonomy, it’s tagged as an ingredient and added to the product’s information.

This tagging process ensures that ingredients are standardized and simply searchable, providing accurate data for consumers and evaluation tools.

But when an ingredient isn’t recognized, the method fails.

The ingredient “Jambon do porc” (Pork ham) was not recognized by the parser (from the Product Edition page)

For that reason, we introduced an extra layer to the method: the Ingredients Spellcheck, designed to correct ingredient lists before they’re processed by the ingredient parser.

An easier approach could be the Peter Norvig algorithm, which processes each word by applying a series of character deletions, additions, and replacements to discover potential corrections.

Nonetheless, this method proved to be insufficient for our use case, for several reasons:

  • Special Characters and Formatting: Elements like commas, brackets, and percentage signs hold critical importance in ingredient lists, influencing product composition and allergen labeling (e.g., “salt (1.2%)”).
  • Multilingual Challenges: the database comprises products from all around the word with a wide selection of languages. This further complicates a basic character-based approach like Norvig’s, which is language-agnostic.

As a substitute, we turned to the most recent advancements in Machine Learning, particularly Large Language Models (LLMs), which excel in a wide selection of Natural Language Processing (NLP) tasks, including spelling correction.

That is the trail we decided to take.

You may’t improve what you don’t measure.

What’s a very good correction? And easy methods to measure the performance of the corrector, LLM or non-LLM?

Our first step is to grasp and catalog the range of errors the Ingredient Parser encounters.

Moreover, it’s essential to evaluate whether an error should even be corrected in the primary place. Sometimes, attempting to correct mistakes could do more harm than good:

flour, salt (1!2%)
# Is it 1.2% or 12%?...

For these reasons, we created the Spellcheck Guidelines, a algorithm that limits the corrections. These guidelines will serve us in some ways throughout the project, from the dataset generation to the model evaluation.

The rules was notably used to create the Spellcheck Benchmark, a curated dataset containing roughly 300 lists of ingredients manually corrected.

This benchmark is the cornerstone of the project. It enables us to judge any solution, Machine Learning or easy heuristic, on our use case.

It goes along the Evaluation algorithm, a custom solution we developed that transform a set of corrections into measurable metrics.

The Evaluation Algorithm

Most of the prevailing metrics and evaluation algorithms for text-relative tasks compute the similarity between a reference and a prediction, corresponding to BLEU or ROUGE scores for language translation or summarization.

Nonetheless, in our case, these metrics fail short.

We would like to judge how well the Spellcheck algorithm recognizes and fixes the suitable words in a listing of ingredients. Due to this fact, we adapt the Precision and Recall metrics for our task:

Precision = Right corrections by the model / ​Total corrections made by the model

Recall = Right corrections by the model / ​Total variety of errors

Nonetheless, we don’t have the fine-grained view of which words were speculated to be corrected… We only have access to:

  • The original: the list of ingredients as present within the database;
  • The reference: how we expect this list to be corrected;
  • The prediction: the correction from the model.

Is there any solution to calculate the variety of errors that were accurately corrected, those that were missed by the Spellcheck, and at last the errors that were wrongly corrected?

The reply is yes!

Original:       "Th cat si on the fride,"
Reference: "The cat is on the fridge."
Prediction: "Th big cat is within the fridge."

With the instance above, we will easily spot which words were speculated to be corrected: The , is and fridge ; and which words were wrongly corrected: on into in. Finally, we see that an extra word was added: big .

If we align these 3 sequences in pairs, original-reference and original-prediction , we will detect which words were speculated to be corrected, and those who weren’t. This alignment problem is typical in bio-informatic, called Sequence Alignment, whose purpose is to discover regions of similarity.

It is a perfect analogy for our spellcheck evaluation task.

Original:       "Th    -   cat   si   on   the   fride,"
Reference: "The - cat is on the fridge."
1 0 0 1 0 0 1

Original: "Th - cat si on the fride,"
Prediction: "Th big cat is in the fridge."
0 1 0 1 1 0 1
FN FP TP FP TP

By labeling each pair with a 0 or 1 whether the word modified or not, we will calculate how often the model accurately fixes mistakes (True Positives — TP), incorrectly changes correct words (False Positives — FP), and misses errors that ought to have been corrected (False Negatives — FN).

In other words, we will calculate the Precision and Recall of the Spellcheck!

We now have a strong algorithm that’s able to evaluating any Spellcheck solution!

You could find the algorithm within the project repository.

Large Language Models (LLMs) have proved being great assist in tackling Natural Language task in various industries.

They constitute a path we’ve got to probe for our use case.

Many LLM providers brag concerning the performance of their model on leaderboards, but how do they perform on correcting error in lists of ingredients? Thus, we evaluated them!

We evaluated GPT-3.5 and GPT-4o from OpenAI, Claude-Sonnet-3.5 from Anthropic, and Gemini-1.5-Flash from Google using our custom benchmark and evaluation algorithm.

We prompted detailed instructions to orient the corrections towards our custom guidelines.

LLMs evaluation on our benchmark (image from creator)

GPT-3.5-Turbo delivered the perfect performance in comparison with other models, each by way of metrics and manual review. Special mention goes to Claude-Sonnet-3.5, which showed impressive error corrections (high Recall), but often provided additional irrelevant explanations, lowering its Precision.

Great! Now we have an LLM that works! Time to create the feature within the app!

Well, not so fast…

Using private LLMs reveals many challenges:

  1. Lack of Ownership: We turn into depending on the providers and their models. Latest model versions are released incessantly, altering the model’s behavior. This instability, primarily since the model is designed for general purposes moderately than our specific task, complicates long-term maintenance.
  2. Model Deletion Risk: Now we have no safeguards against providers removing older models. As an example, GPT-3.5 is slowly being replace by more performant models, despite being the perfect model for this task!
  3. Performance Limitations: The performance of a non-public LLM is constrained by its prompts. In other words, our only way of improving outputs is thru higher prompts since we cannot modify the core weights of the model by training it on our own data.

For these reasons, we selected to focus our efforts on open-source solutions that might provide us with complete control and outperform general LLMs.

The model training workflow: from dataset extraction to model training (image from creator)

Any machine learning solution starts with data. In our case, data is the corrected lists of ingredients.

Nonetheless, not all lists of ingredients are equal. Some are freed from unrecognized ingredients, some are only so unreadable they’d be no point correcting them.

Due to this fact, we discover an ideal balance by selecting lists of ingredients having between 10 and 40 percent of unrecognized ingredients. We also ensured there’s no duplicate throughout the dataset, but additionally with the benchmark to stop any data leakage throughout the evaluation stage.

We extracted 6000 uncorrected lists from the Open Food Facts database using DuckDB, a quick in-process SQL tool able to processing hundreds of thousands of rows under the second.

Nonetheless, those extracted lists are usually not corrected yet, and manually annotating them would take an excessive amount of time and resources…

Nonetheless, we’ve got access to LLMs we already evaluated on the precise task. Due to this fact, we prompted GPT-3.5-Turbo, the perfect model on our benchmark, to correct every list in respect of our guidelines.

The method took lower than an hour and price nearly 2$.

We then manually reviewed the dataset using Argilla, an open-source annotation tool specialized in Natural Language Processing tasks. This process ensures the dataset is of sufficient quality to coach a reliable model.

We now have at our disposal a training dataset and an evaluation benchmark to coach our own model on the Spellcheck task.

Training

For this stage, we decided to go along with Sequence-to-Sequence Language Models. In other words, these models take a text as input and returns a text as output, which suits the spellcheck process.

Several models fit this role, corresponding to the T5 family developed by Google in 2020, or the present open-source LLMs corresponding to Llama or Mistral, that are designed for text generation and following instructions.

The model training consists in a succession of steps, every one requiring different resources allocations, corresponding to cloud GPUs, data validation and logging. For that reason, we decided to orchestrate the training using Metaflow, a pipeline orchestrator designed for Data science and Machine Learning projects.

The training pipeline consists as follow:

  • Configurations and hyperparameters are imported to the pipeline from config yaml files;
  • The training job is launched within the cloud using AWS Sagemaker, along the set of model hyperparameters and the custom modules corresponding to the evaluation algorithm. Once the job is completed, the model artifact is stored in an AWS S3 bucket. All training details are tracked using Comet ML;
  • The fine-tuned model is then evaluated on the benchmark using the evaluation algorithm. Depending on the model sizem this process will be extremely long. Due to this fact, we used vLLM, a Python library designed to accelerates LLM inferences;
  • The predictions against the benchmark, also stored in AWS S3, are sent to Argilla for human-evaluation.

After iterating time and again between refining the information and the model training, we achieved performance comparable to proprietary LLMs on the Spellcheck task, scoring an F1-Rating of 0.65.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x