A Easy Recipe to Boost the Performance of MLLMs on Your Custom Use Case

-

An MLLM fine-tuning tutorial using the latest pocket-sized Mini-InternVL model

Photo by Maarten van den Heuvel on Unsplash

The world of enormous language models (LLMs) is always evolving, with recent advancements emerging rapidly. One exciting area is the event of multi-modal LLMs (MLLMs), able to understanding and interacting with each texts and pictures. This opens up a world of possibilities for tasks like document understanding, visual query answering, and more.

I recently wrote a general post about one such model you can take a look at here:

But on this one, we’ll explore a strong combination: the InternVL model and the QLoRA fine-tuning technique. We’ll give attention to how we are able to easily customize such models for any specific use-case. We’ll use these tools to create a receipt understanding pipeline that extracts key information like company name, address, and total amount of purchase with high accuracy.

This project goals to develop a system that may accurately extract specific information from scanned receipts, using InternVL’s capabilities. The duty presents a singular challenge, requiring not only robust natural language processing (NLP) but in addition the flexibility to interpret the visual layout of the input image. This can enable us to create a single, OCR-free, end-to-end pipeline that demonstrates strong generalization across complex documents.

To coach and evaluate our model, we’ll use the SROIE dataset. SROIE provides 1000 scanned receipt images, each annotated with key entities like:

  • Company: The name of the shop or business.
  • Date: The acquisition date.
  • Address: The shop’s address.
  • Total: The full amount paid.
Source: https://arxiv.org/pdf/2103.10213.pdf.

We’ll evaluate the performance of our model using a fuzzy similarity rating, a metric that measures the similarity between predicted and ground truth entities. This metric ranges from 0 (irrelevant results) to a 100 (perfect predictions).

InternVL is a family of multi-modal LLMs from the OpenGVLab, designed to excel in tasks involving image and text. Its architecture combines a vision model (like InternViT) with a language model (like InternLM2 or Phi-3). We’ll give attention to the Mini-InternVL-Chat-2B-V1–5 variant, a smaller version that’s well-suited for running on consumer-grade GPUs.

InternVL’s key strengths:

  • Efficiency: Its compact size allows for efficient training and inference.
  • Accuracy: Despite being smaller, it achieves competitive performance in various benchmarks.
  • Multi-modal Capabilities: It seamlessly combines image and text understanding.

Demo: You’ll be able to explore a live demo of InternVL here.

To further boost our model’s performance, we’ll use QLoRA which is a fine-tuning technique that significantly reduces memory consumption while preserving performance. Here’s how it really works:

  1. Quantization: The pre-trained LLM is quantized to 4-bit precision, reducing its memory footprint.
  2. Low-Rank Adapters (LoRA): As a substitute of modifying all parameters of the pre-trained model, LoRA adds small, trainable adapters to the network. These adapters capture task-specific information without requiring changes to the fundamental model.
  3. Efficient Training: The mix of quantization and LoRA enables efficient fine-tuning even on GPUs with limited memory.

Let’s dive into the code. First, we’ll assess the baseline performance of Mini-InternVL-Chat-2B-V1–5 with none fine-tuning:

quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

model = InternVLChatModel.from_pretrained(
args.path,
device_map={"": 0},
quantization_config=quant_config if args.quant else None,
torch_dtype=torch.bfloat16,
)

tokenizer = InternLM2Tokenizer.from_pretrained(args.path)
# set the max variety of tiles in `max_num`

model.eval()

pixel_values = (
load_image(image_base_path / "X51005255805.jpg", max_num=6)
.to(torch.bfloat16)
.cuda()
)

generation_config = dict(
num_beams=1,
max_new_tokens=512,
do_sample=False,
)

# single-round single-image conversation
query = (
"Extract the corporate, date, address and total in json format."
"Respond with a legitimate JSON only."
)
# print(model)
response = model.chat(tokenizer, pixel_values, query, generation_config)

print(response)

The result:

```json
{
"company": "SAM SAM TRADING CO",
"date": "Fri, 29-12-2017",
"address": "67, JLN MENHAW 25/63 TNN SRI HUDA, 40400 SHAH ALAM",
"total": "RM 14.10"
}
```

This code:

  1. Loads the model from the Hugging Face hub.
  2. Loads a sample receipt image and converts it to a tensor.
  3. Formulates an issue asking the model to extract relevant information from the image.
  4. Runs the model and outputs the extracted information in JSON format.

This zero-shot evaluation shows impressive results, achieving a median fuzzy similarity rating of 74.24%. This demonstrates InternVL’s ability to grasp receipts and extract information with no fine-tuning.

To further boost accuracy, we’ll fine-tune the model using QLoRA. Here’s how we implement it:

_data = load_data(args.data_path, fold="train")

# Quantization Config
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

model = InternVLChatModel.from_pretrained(
path,
device_map={"": 0},
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)

tokenizer = InternLM2Tokenizer.from_pretrained(path)

# set the max variety of tiles in `max_num`
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
print("img_context_token_id", img_context_token_id)
model.img_context_token_id = img_context_token_id

model.config.llm_config.use_cache = False

model = wrap_lora(model, r=128, lora_alpha=256)

training_data = SFTDataset(
data=_data, template=model.config.template, tokenizer=tokenizer
)

collator = CustomDataCollator(pad_token=tokenizer.pad_token_id, ignore_index=-100)

img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
print("img_context_token_id", img_context_token_id)
model.img_context_token_id = img_context_token_id
print("model.img_context_token_id", model.img_context_token_id)

train_params = TrainingArguments(
output_dir=str(BASE_PATH / "results_modified"),
num_train_epochs=EPOCHS,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
optim="paged_adamw_32bit",
save_steps=len(training_data) // 10,
logging_steps=len(training_data) // 50,
learning_rate=5e-4,
lr_scheduler_type="cosine",
warmup_steps=100,
weight_decay=0.001,
max_steps=-1,
group_by_length=False,
max_grad_norm=1.0,
)
# Trainer
fine_tuning = SFTTrainer(
model=model,
train_dataset=training_data,
dataset_text_field="###",
tokenizer=tokenizer,
args=train_params,
data_collator=collator,
max_seq_length=tokenizer.model_max_length,
)

print(fine_tuning.model.print_trainable_parameters())
# Training
fine_tuning.train()
# Save Model
fine_tuning.model.save_pretrained(refined_model)

This code:

  1. Loads the model with quantization enabled.
  2. Wraps the model with LoRA, adding trainable adapters.
  3. Creates a dataset from the SROIE dataset.
  4. Defines training arguments resembling learning rate, batch size, and epochs.
  5. Initializes a trainer to handle the training process.
  6. Trains the model on the SROIE dataset.
  7. Saves the fine-tuned model.

Here’s a sample comparison between the bottom model and the QLoRA fine-tuned model:

Ground Truth: 

{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2018",
"address": "NO 4,JALAN PERJIRANAN 10, TAMAN AIR BIRU, 81700 PASIR GUDANG, JOHOR.",
"total": "72.00"
}

Prediction Base: KO

```json
{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2016",
"address": "JM092487-D",
"total": "67.92"
}
```

Prediction QLoRA: OK

{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2018",
"address": "NO 4, JALAN PERUBANAN 10, TAMAN AIR BIRU, 81700 PASIR GUDANG, JOHOR",
"total": "72.00"
}

After fine-tuning with QLoRA, our model achieves a remarkable 95.4% fuzzy similarity rating, a major improvement over the baseline performance (74.24%). This demonstrates the facility of QLoRA in boosting model accuracy without requiring massive computing resources (15 min training on 600 samples on a RTX 3080 GPU).

We’ve successfully built a sturdy receipt understanding pipeline using InternVL and QLoRA. This approach showcases the potential of multi-modal LLMs for real-world tasks like document evaluation and knowledge extraction. In this instance use-case, we gained 30 points in prediction quality using just a few hundred examples and just a few minutes of compute time on a consumer GPU.

You could find the total code implementation for this project here.

The event of multi-modal LLMs is just just starting, and the longer term holds exciting possibilities. The world of automated document processing has immense potential within the era of MLLMs. These models can revolutionize how we extract information from contracts, invoices, and other documents, requiring minimal training data. By integrating text and vision, they will analyze the layout of complex documents with unprecedented accuracy, paving the way in which for more efficient and intelligent information management.

The longer term of AI is multi-modal, and InternVL and QLoRA are powerful tools to assist us unlock its potential on a small compute budget.

Links:

Code: https://github.com/CVxTz/doc-llm

Dataset Source: https://rrc.cvc.uab.es/?ch=13&com=introduction
Dataset License: licensed under a Creative Commons Attribution 4.0 International License.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x