How I Wonderful-Tuned Granite-Vision 2B to Beat a 90B Model — Insights and Lessons Learned

or vision-language models is a strong technique that unlocks their potential on specialized tasks. Nevertheless, despite their effectiveness, these approaches are sometimes out of reach for a lot of users as a result of their high computational cost and the necessity for GPUs with large VRAM — resources that only a small percentage of end users can access.

On this project, I fine-tuned IBM’s Granite-Vision 2B, a comparatively small yet powerful vision-language model, to tackle the challenge of converting images of tables into clean, structured HTML code.

What makes this project particularly exciting is that the fine-tuning was performed on a consumer-grade GPU — the NVIDIA RTX 4070 Ti Super — and yet, the resulting 2-billion-parameter model was in a position to outperform much larger models, including meta-llama/Llama-3.2–90B-Vision, on this image-to-text generation task. This success not only demonstrates the ability of parameter-efficient fine-tuning methods like LoRA but additionally highlights the sensible value of constructing specialized small models tailored to specific problems.

On this post, I’ll walk you thru the motivation behind this work, the model and dataset decisions, the custom HTML similarity metric I adapted, the experiments and results, and at last, the important thing insights and lessons learned throughout the method. Whether you’re keen on vision-language models, fine-tuning techniques, or practical AI applications, I hope this journey offers useful takeaways. The fine-tuning code used for this project was adapted from HuggingFace’s Granite Vision fine-tuning cookbook, authored by Eli Schwartz, who in turn adapted the unique code from Sergio Paniego.

Motivation

While working on Retrieval-Augmented Generation (RAG) projects, I encountered a serious challenge: accurately extracting large and sophisticated tables from PDFs, especially when these tables appeared as images. Despite trying different approaches — including tools like Unstructured and enormous vision-language models resembling Meta’s Llama 90B — the outcomes often fell in need of the accuracy needed.

This led me to think about a unique approach: a small, specialized vision-language model focused exclusively on table understanding and extraction. Such a model could function a dedicated preprocessing step to significantly improve RAG pipelines that depend on accurate table extraction.

Around the identical time, IBM released Granite-Vision 2B — a vision-language model with just the best balance of size and power. It’s capable enough to handle complex tables, yet sufficiently small to be fine-tuned on consumer-grade GPUs with 16 GB of VRAM. This made it a perfect candidate for my project.

The Task: Image to HTML (Table Extraction)

One necessary design selection was the goal format: HTML. By converting tables into clean HTML code, we obtain a structured and widely supported representation that might be easily converted into other formats. For instance, HTML tables might be readily imported into data evaluation tools like Pandas as dataframes, making downstream processing and evaluation far more efficient.

The unique plan was to construct a custom dataset by extracting HTML table tags, rendering them as images, and pairing each image with its corresponding HTML code. Fortunately, I discovered an answer: the PubTabNet-HTML dataset, which incorporates over 568,000 image–HTML pairs, excess of needed for this project.

PubTabNet was developed by IBM and relies on scientific articles from the PubMed Central Open Access Subset (business use collection). The tables were extracted by aligning PDF and XML versions of the articles. The annotations (i.e., the HTML labels) are licensed under the Community Data License Agreement – Permissive – Version 1.0, and while IBM doesn’t own the pictures, they’re utilized in accordance with the PMC Open Access Subset Terms of Use. This makes the dataset suitable for each research and business applications, provided the license terms are followed.

Custom Metric: HTML Similarity

Standard text similarity metrics like BLEU or ROUGE are insufficient for evaluating HTML table generation because they primarily deal with surface-level text matching and ignore necessary structural and stylistic facets of HTML code.

To raised capture the standard of generated HTML tables, I adapted a custom HTML Similarity metric that mixes multiple complementary components, where a very powerful ones (style and structure) are imported from niteru:

Style similarity (S): Extracts CSS classes of every html document and calculates the jaccard similarity of the sets of classes.
Structural similarity (T): Uses sequence comparison of the html tags to compute the similarity.
Content similarity (C): Based on normalized edit distance between the extracted plain text content of the tables.
Token overlap similarity (J): The Jaccard similarity between the sets of content tokens.

The ultimate similarity rating M is a weighted sum of those components:

I manually tested the metric on various example outputs, iteratively adjusting the weighting coefficients to raised capture meaningful similarities. This process resulted in a balanced evaluation that fairly rewards accurate table structure and elegance, alongside precise textual content. Python implementation is as follows:

from torchmetrics.text import EditDistance from niteru import style_similarity, structural_similarity ed_distance = EditDistance() def extract_table_text(html): """Extracts only the text from an HTML table in row-wise space-separated format.""" soup = BeautifulSoup(html, "html.parser") table = soup.find("table") # Find the primary table if not table: return "" # Extract rows and join cells with spaces return "n".join(" ".join(cell.get_text(strip=True) for cell in row.find_all(["th", "td"])) for row in table.find_all("tr")) def extract_html_table(html): """Extracts html table from text""" match = re.search(r'', html, re.DOTALL | re.IGNORECASE) if match: table_html = match.group() return table_html else: return html def html_similarity(html1, html2): html1 = extract_html_table(html1) html2 = extract_html_table(html2) # Compute individual similarity scores style_sim = style_similarity(html1, html2) # Assume returns [0,1] struct_sim = structural_similarity(html1, html2) # Assume returns [0,1] txt1, txt2 = extract_table_text(html1), extract_table_text(html2) content_sim = 1 - (ed_distance(txt1, txt2) / max(len(txt1), len(txt2) + 1e-10)) # Avoid division by zero jaccard_sim = 1 - (len(set(txt1.split()).intersection(set(txt2.split()))) / len(set(txt1.split()).union(set(txt2.split()))) + 1e-10) # Weighted sum of the similarities final_score = (0.10 * style_sim) + (0.40 * struct_sim) + (0.30 * content_sim) + (0.20 * jaccard_sim) # Ensure final rating is in [0,1] final_score = max(0, min(1, final_score)) return final_score The metric also features a -based function to extract only the HTML content inside tags. This was essential because one in all the reference models only generated incomplete or extra HTML outside of the table structure. By focusing the comparison strictly on the table content, the metric provides a more fair and meaningful evaluation across models. Developing a custom evaluation metric like that is crucial for reliably tracking model improvements and benchmarking performance against reference models. Training Setup To fine-tune the model efficiently on my NVIDIA RTX 4070 Ti Super, which has 16 GB VRAM, I used LoRA (Low-Rank Adaptation). This allowed me to update only a small variety of parameters, significantly reducing GPU memory usage. In truth, during training, the model used only about half of the available VRAM — with enough headroom to mess around with longer sequences, but not enough to handle multiple batch. Moreover, LoRA is mostly faster to coach than approaches like QLoRA. LoRA Setup I used the next LoRA configuration: # Setup LoRA target_modules = [] for layer_type in layers_to_tune: target_modules.extend( name for name, _ in model.named_modules() if (layer_type in name) and '_proj' in name ) peft_config = LoraConfig( r=16, lora_alpha=32, lora_dropout=0.1, target_modules=target_modules, use_dora=True, init_lora_weights="gaussian" ) Key points: r=16: This low-rank dimension provides a superb balance between model capability and GPU memory usage. use_dora=True: DoRA (Weight-Decomposed Low Rank Adaptation) improves the educational capability and stability of LoRA by decomposing the pretrained weights into magnitude and direction components, helping the model higher resemble the capability of full fine-tuning — all without adding inference overhead. Performed barely higher than the default setting. init_lora_weights="gaussian": No particular reason, I didn’t need to experiment with this parameter. target_modules: This versatile setup allows selectively targeting vision layers, language layers, or each, depending on the experiment. In practice, vision layers remained unaffected — even with use_dora=False— since DoRA currently supports only embedding, linear, and Conv2d layers. In consequence, I fine-tuned only the language layers. Dataset Setup During my initial experiments, I kept running into out-of-memory (OOM) errors — despite the fact that there was still plenty of accessible GPU VRAM after loading model, LoRA layers and optimizer parameters (around 4GB still free). There have been no memory spikes during training, however the crashes consistently happened at the identical training step. After some investigation, I spotted that the issue was brought on by large tables, which resulted in extremely long token sequences. To handle this, I adjusted the max_seq_length parameter and filtered out samples that exceeded this limit. After experimentation, I discovered that using max_seq_length = 1024 allowed me to fine-tune the model reliably without triggering OOM errors. To filter out oversized tables, I wrote a straightforward data processing function that: Filters out samples whose HTML token length exceeds max_seq_length Mechanically balances the number of coaching and test samples Uses streaming to avoid loading the whole dataset into memory (PubTabNet-HTML is kind of large, around 10 GB on disk) . def load_process_filter_dataset(dataset, max_seq_length, num_train_images, num_test_images, system_message): global processor ds = load_dataset(dataset, split='train', streaming=True) max_html_tokens = max_seq_length - len(processor.tokenizer.tokenize(system_message)) num_total_needed = num_train_images + num_test_images filtered_samples = [] p_bar = tqdm(total=num_total_needed, desc="Filtering dataset samples") for sample in ds: processed = process_and_filter_example(sample, max_html_tokens) if processed: filtered_samples.append(processed) p_bar.update(1) if len(filtered_samples) >= num_total_needed: break p_bar.close() # Convert to in-memory dataset ds_filtered = Dataset.from_list(filtered_samples) # Split into train/test ds_train = ds_filtered.select(range(num_train_images)) ds_test = ds_filtered.select(range(num_train_images, num_total_needed)) return ds_train, ds_test def process_and_filter_example(example, max_html_tokens): global processor extracted_table = extract_html_table(example['html_table']) token_count = len(processor.tokenizer.tokenize(extracted_table)) if token_count < max_html_tokens: example['html_table'] = extracted_table return example return None The ultimate configuration included num_train_images=10000 and num_test_images=250 to compute the evaluation loss. Wonderful-Tuning Configuration For training, I used the Transformers SFTTrainer to fine-tune the model: # Training arguments training_args = SFTConfig( output_dir=f"src/models/{model_name.split('/')[-1].replace('-', '_', 1).split('-')[0]}/checkpoints/{experiment_name}", num_train_epochs=1, per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=gradient_accumulation_steps, max_seq_length=max_seq_length, warmup_steps=10, learning_rate=3e-4, weight_decay=0.01, logging_strategy="steps", eval_strategy='steps', logging_steps=25, save_strategy="steps", save_steps=50, save_total_limit=1, greater_is_better=False, load_best_model_at_end=True, optim="adamw_torch_fused", bf16=True, push_to_hub=False, report_to="wandb" if not debug else "none", remove_unused_columns=False, gradient_checkpointing=True, dataset_text_field="", dataset_kwargs={"skip_prepare_dataset": True}, dataset_num_proc=8 ) # Setup Trainer trainer = SFTTrainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset, data_collator=collate_fn, peft_config=peft_config, processing_class=processor.tokenizer ) Key points: num_train_epochs=1: The dataset could be very large, and to run multiple experiments efficiently, I selected to coach for under one full epoch while maximizing learning per sample and number of coaching samples. per_device_train_batch_size=1: Larger batch sizes wouldn't slot in GPU memory without significantly reducing max_seq_length — which might hurt performance on large tables. Keeping longer sequences was more necessary for this task. gradient_accumulation_steps=8: Used to effectively simulate a bigger batch size and help stabilize the educational process, compensating for the small physical batch. That is the ultimate value, but experimented with gradient_accumulation_steps=4 as well. optim="adamw_torch_fused" and bf16=True: These settings leverage modern NVIDIA architectures (Ada Lovelace) to speed up training and reduce memory usage — as really useful for this hardware. Evaluation Loss Workaround On the time of developing the project, there may be a known issue within the Transformers + LoRA integration that causes an error when running evaluation with a validation dataset during training. Fortunately, a community-tested workaround is on the market (although not yet merged into the major branch), and I successfully used this fix in my experiments. Evaluation (Inference) Setup The evaluation dataset used for final scoring was completely independent from the eval_dataset used during training. It consists of 500 randomly chosen images, none of which were included in either the train_dataset or the training eval_dataset. Once fine-tuning was complete, I used the best model checkpoint — chosen based on the bottom evaluation loss — to run inference on these 500 samples. Initially, I attempted to perform inference by simply loading the LoRA/DoRA adapter on top of the bottom model. Nevertheless, I discovered that inference with DoRA adapters is incredibly slow when not merged into the model weights (as explained within the official PEFT docs). In truth, generating one test random sample took about 90 seconds on this configuration. To resolve this, I merged the adapter weights into the bottom model — which is the really useful practice — and after merging, inference speed improved dramatically: all the way down to ~20 seconds for a similar sample, making full evaluation runs far more practical. The reference models used for comparison with my fine-tuned models are: meta-llama/Llama-3.2–90B-Vision: Meta’s massive 90-billion parametermodel — the major baseline I aimed to surpass through specialization and parameter-efficient fine-tuning of a much smaller VLM. KennethTM/pix2struct-base-table2html: A much smaller model fine-tuned from Google’s pix2struct-base, highly specialized for the exact same dataset I utilized in this project. Because of its smaller size, the developer(s) was in a position to train it for a lot of more samples and over longer training runs — demonstrating the important thing advantage of using smaller, targeted models for specific tasks. These two baselines allowed me to benchmark each scaling-based performance (vs the 90B model) and specialization efficiency (vs the smaller, dedicated Pix2Struct model). Experiments & Results A complete of 9 experiments were conducted, iteratively modifying one or two components at a time. The goal was to grasp the effect of every change on model performance, progressively refining the setup to realize one of the best possible HTML Similarity rating in comparison with reference models. The experimental process was incremental: at any time when a change improved the outcomes, it was incorporated into the subsequent round of experiments and continued exploring latest variations. The experiments focused on adjusting the next components: Vision vs. Language Layers 1.1 lang_only 1.2 vision_only 1.3 lang_vision 2. Ground Truth Output Format 3. Training Framework 3.1 lang_table_unsloth 3.2 vision_table_unsloth 4. Gradient Accumulation 5. Prompt Format 6. Gradient Accumulation & Dataset Size Each the evaluation loss and the HTML Similarity metric were used to evaluate model performance, and I discovered them to be well correlated — confirming that HTML Similarity is a superb proxy for a way well the model is learning the duty. Before diving into the outcomes of every experiment, let’s first have a look at GPU memory utilization during training, which is commonly probably the most critical consider determining whether a model might be fine-tuned on consumer hardware. GPU Memory Utilization During Training | Image by writer from wandb.ai As shown within the graph, GPU utilization remained stable throughout training — averaging around 75% VRAM usage, or roughly 12 GB on my GPU. Most of VRAM usage (~5.5 GB) is the frozen model weights. LoRA gradients + optimizer states take little or no (<< 1 GB). Activations + overhead should fill the remainder (~5–6 GB), which relies on batch_size and max_seq_length. First Run: lang_only This experiment uses the next initial components/parameters: These were the starting values for the primary experiment. In subsequent runs, I modified a lot of them as I refined the approach. This primary experiment focused only on tuning language layers, while training the model to predict the complete raw HTML output — including all the things inside and across the tags. Since this was the primary run, I’ll include the training loss curve here as an example the way it behaves. For later experiments, I’ll omit this graph — because the behavior was similar across runs, with minor variations. In practice, evaluation loss is more useful for comparing performance across experiments. Training Loss | Image by writer from wandb.ai One necessary note in regards to the logging configuration: logging_steps=25 implies that the training loss is barely logged after every 25 steps, where each logged value is the typical over gradient_accumulation_steps=4. In consequence, the most important drop in loss appears on the second log point — where a lot of the initial learning happens. After that, the model continues learning more progressively, with a slow decreasing trend, depending on the problem of the training samples. Now, let’s take a have a look at the evaluation loss: Validation Loss 1 | Image by writer from wandb.ai Since we're evaluating on the identical set of 250 validation samples, the evaluation loss curve gives us a more stable and meaningful view of model learning — and can function a baseline for comparisons across future runs. Here, we observe a transparent and consistent downward trend throughout training. The initial loss starts near 0.03, with a gentle improvement as training progresses, eventually stabilizing slightly below 0.015. The sleek nature of this curve — in comparison with the more variable training loss — reflects the regular structure of the validation set and confirms that the model is generalizing well to unseen samples, even with a small batch size and a single epoch of coaching. Now, let’s compare the performance of this fine-tuned model against the reference models on the HTML Similarity metric: As we will see, this primary experiment already delivers strong performance gains — improving the bottom Granite-Vision 2B model by a big margin (+0.18) and clearly outperforming LLaMA 90B Vision on this specialized task. Only Pix2Struct retains a slight lead at this stage. Second Run: vision_only There isn’t much to research on this run. I tested several variations that might potentially unblock learning within the vision layers — including drastically increasing the educational rate — but without success. While the bottom code suggests that fine-tuning vision layers should be possible, in practice I discovered it was not working on this setup. The next evaluation loss curve confirms that no learning occurred — the loss remained constant throughout training. To avoid wasting compute resources, I finished the run early: Validation Loss 2 | Image by writer from wandb.ai Moreover, training was noticeably faster on this run in comparison with the previous lang_only experiment — suggesting that the language layers (which contain the majority of the model’s parameters) remained frozen, and only the small vision layers were being processed: Validation Samples per Second 1 | Image by writer from wandb.ai Third Run: lang_vision At this point, it was clear that only language layers were being effectively trained. On this lang_vision run — where each language and vision layers were chosen — I expected results just like lang_only. Indeed, the evaluation loss curve confirmed this expectation, showing nearly equivalent behavior to lang_only: Validation Loss 3 | Image by writer from wandb.ai Once this was clear, I again stopped training early to conserve resources, and proceeded to check latest approaches. Fourth Run: lang_table_only This experiment modified the next component: The goal of this run was to coach the model to predict only the table content, with none surrounding HTML wrapper code. This approach could help improve learning — by removing unnecessary tokens — and in addition align the training behavior more closely with Pix2Struct’s model. Moreover, by stripping out the wrapper HTML, the goal sequences became shorter — which allowed longer and more complex tables to suit inside the model’s context window. This modification could also improve the model’s ability to generalize to larger or more detailed tables. Let’s have a look at the evaluation loss in comparison with the primary run: Validation Loss 4 | Image by writer from wandb.ai At first glance, the upper evaluation loss may appear counterintuitive. Nevertheless, there’s a transparent explanation: the wrapper HTML code is trivial for the model to learn — because it tends to be nearly equivalent across many training samples. These repetitive tokens reduce cross-entropy loss, artificially lowering the typical loss in earlier runs. By removing them, the model now focuses entirely on the tougher and variable table content — leading to the next but more meaningful loss value. Now, let’s see how this variation impacted the HTML Similarity metric: In this primary test, we observe no significant gain or degradation from using this latest output format. It is feasible that the model would wish more epochs or larger training samples to totally adapt to this latest format. One other idea is to update the prompt — in order that from the very first step the model understands it should focus solely on table content, slightly than having to infer this behavior through training alone. This shall be explored in the subsequent experiments. Fifth / Sixth Run: lang_table_unsloth, vision_table_unsloth In these experiments, I explored the next components: At this point, I discovered the promising Unsloth framework — which claims to supply 2x faster training with as much as 70% lower memory usage. In fact, I desired to test whether it could speed up my workflow. My first idea was to leverage the improved memory handling to run longer sequences (max_seq_length=2048), but in my case this quickly led to Out of Memory (OOM) errors — so I reverted to my previous configuration. The training speed improvements, nevertheless, were undeniable — almost 4x faster than my earlier runs: Validation Samples per Second 2 | Image by writer from wandb.ai Unfortunately, this got here at a transparent cost to loss performance: Validation Loss 5 | Image by writer from wandb.ai Given this noticeable drop in quality, I paused the experiment to analyze further — particularly to see if Unsloth would allow me to coach vision layers, which is one in all its advertised benefits. Nevertheless, I encountered the exact same behavior as with HuggingFace Transformers — no actual learning in vision layers. With these leads to mind, I made a decision to put aside Unsloth for this projectand proceed using HuggingFace Transformers, which had shown more reliable learning in earlier runs. Seventh Run: lang_table_only_2 Listed here are the brand new parameters for this run: Going back to the previous configuration, I wanted to research the impact of a larger virtual batch size (via higher gradient_accumulation_steps). The outcomes were promising — the evaluation loss became smoother and trended closer to the unique lang_only run, despite the fact that the model was now predicting only the table content: Validation Loss 6 | Image by writer from wandb.ai Based on this positive result, I made a decision to maintain this gradient_accumulation_steps=8 setting for the ultimate experiment. Evaluating this model on HTML Similarity resulted in a small but meaningful improvement — finally reaching parity with Pix2Struct: Naturally, the goal just isn't simply to match Pix2Struct — but to surpass it. Two necessary levers remained to explore: dataset size and prompt. Eighth Run: lang_table_only_3 The updated parameters for this run were: I by accident reverted gradient_accumulation_steps back to 4 on this run, only realizing it once the training was nearly complete — but this actually gave me an extra-chance to look at its effect on learning. The major goal here was to double the training size (to 10K images) and to check the updated, clearer prompt format. Unfortunately, a random CUDA error caused training to halt around 80% completion — besides, the development was clear: Validation Loss 7 | Image by writer from wandb.ai As expected, some smoothness was lost as a result of the smaller virtual batch size, but the brand new prompt proved very effective — noticeably boosting model learning. This set the stage perfectly for the final experiment, using this improved prompt, 10K training samples, and restoring gradient_accumulation_steps to eight. Final Run: lang_table_only_4 The ultimate set of parameters are: The evaluation loss for this final run: Validation Loss 7 | Image by writer from wandb.ai As expected, restoring the gradient_accumulation_steps to eight smoothed the loss curve, reducing spikes and achieving barely lower overall loss values. With a full epoch of coaching on 10K images, this became the best-performing model across all experiments. Now, let’s have a look at the ultimate results on the HTML Similarity metric: Final HTML Similarity Results | Image by writer from matplotlib The goal of this project was achieved — the fine-tuned model now surpasses each reference models on this task. Looking back at the unique Granite-Vision 2B, the LoRA fine-tuning improved performance to 0.77, a +21 percentage point gain — all achieved in under 8 hours on a consumer-grade GPU. Qualitative Results To raised illustrate how much the model improved through fine-tuning, let’s have a look at a particular example: Image ID 618932. PubTabNet Evaluation Sample with ID 618932 | Image from PMC This table is especially tricky — under the Kappa column there are sub-headers (Present study and King et al. 2001). These complex layouts typically challenge generic VLMs, especially once they haven’t been exposed to enough similar examples during training. Models can often understand these sub-headers and answer questions about them, but generating the complete table structure in HTML often requires further prompt tuning and specialized fine-tuning. Let’s first see how a , non-fine-tuned Granite-Vision 2B model performs on this task. Baseline: Raw Granite-Vision 2B The model can answer questions based on the table accurately: prompt='What's the Kappa value for the query "Do you communicate with this power?" in the current study?' res = predict(sample['image'], prompt=prompt) print(res) Out[1]: 74 Nevertheless, when asked to generate the complete HTML table, the model struggles: prompt = "Convert table to HTML ()" html = predict(sample['image'], prompt=prompt) html = '' if ' ' not in html else html display(HTML(html)) Out[2]: And the HTML Similarity metrics for this attempt: Style similarity: 1.0000 Structural similarity: 0.4091 Lev-Edit Distance: 0.1434 Final HTML Similarity Rating: 0.3619 Wonderful-Tuned Model: lang_table_only_4 Now, let’s try the very same test using the fine-tuned model: from src.models.granite_vision.transformers_library import LLM as granite_vision model = granite_vision( model_path, adapter='lang_table_only_4' ) Out[4]: Model loaded Adapter 'lang_table_only_4' loaded Adapter 'lang_table_only_4' merged Using cuda: NVIDIA GeForce RTX 4070 Ti SUPER And the identical prediction prompt: prompt = "Convert table to HTML ()" html = model.predict(sample['image'], max_new_tokens=1024, query=prompt) display(HTML(html)) Out[5]: The fine-tuned model now produces an output that closely matches the bottom truth, accurately capturing the table structure and sub-headers — something the bottom model struggled with. Final HTML Similarity metrics: Style similarity: 1.0000 Structural similarity: 0.9231 Lev-Edit Distance: 1.0000 Final HTML Similarity Rating: 0.9615 This instance shows a transparent quantitative improvement as well: from a rating of 0.36 to 0.96 on a fancy table structure — confirming that fine-tuning on this specialized task dramatically boosts the model’s capability. Inference Speed One major advantage of using a smaller model — other than the flexibility to fine-tune on consumer-grade hardware — is inference speed. Even when larger models offer competitive performance, latency and throughput remain key aspects, especially in production settings. Let’s compare the inference speed of different models: Inference SpeedM | Image by writer from matplotlib As shown within the plot, Pix2Struct is by far the fastest model. For some use cases — resembling batch-processing 1000's of documents for table extraction — this speed advantage could translate into significant time savings and lower compute costs. Nevertheless, the fine-tuned Granite-Vision 2B achieves a superb balance when the quantity of documents to process just isn't massive, having a superior accuracy on this specialized task and fairly fast inference without the necessity for terribly large compute infrastructure. Conclusions This project demonstrated that with LoRA-based fine-tuning and a targeted task (table extraction → HTML), a small vision-language model (Granite-Vision 2B) can outperform much larger models — even Meta’s 90B LLaMA Vision — while requiring only a consumer GPU and lower than a day of coaching. Just a few key takeaways: Small, specialized models matter — you don’t at all times need 70B+ models to resolve specific problems with high accuracy. Parameter-efficient fine-tuning (LoRA) is a game-changer: adapting large foundation models becomes accessible for many practitioners. Prompt design and training targets have an enormous influence — small changes (like switching to lang_table_only or refining the prompt) directly impacted performance. Having a custom metric (HTML Similarity) was critical to trace meaningful progress beyond generic text-based metrics. Smaller models not only train faster, but additionally infer faster — ideal for production pipelines with high volume. Finally — and perhaps most significantly — any such experimentation shows that you may move fast and iterate even with limited hardware. Wonderful-tuning powerful open models and adapting them to real-world tasks is not reserved to big labs anymore. I hope this encourages other AI engineers to experiment with small VLMs and fine-tuning techniques for their very own projects and solutions — and to see that powerful results are possible even without massive compute budgets! What’s Next? There are definitely some interesting follow-up ideas that might be explored next: Prompt engineering refinements: Final tests (while writing this blog) showed that separating prompts into (defining behavior) and (providing task instructions) significantly improved the bottom model’s performance. Applying this strategy during fine-tuning could further enhance the model’s ability to consistently generate accurate HTML. This shall be tested in upcoming experiments. Training vision layers: Currently, only the language layers are fine-tuned, as training the vision layers through text-only loss proved ineffective. A more advanced approach could involve adding an auxiliary vision loss — for instance, contrastive learning between vision outputs and HTML structure — to raised adapt the vision backbone for table extraction tasks. Improved generalization: The present model is fine-tuned on a single dataset. Expanding training to incorporate more diverse document layouts, table styles, and noisy OCR scenarios could improve robustness and transferability to real-world data. Links Should you liked this post, be happy to succeed in out or share your personal experiments! Towards Data Science is a community publication. Submit your insights to succeed in our global audience and earn through the TDS Creator Payment Program. Write for TDS Related Articles Step-by-step code guide to constructing a Convolutional Neural Network August 20, 2024 6 min read A deep dive on the meaning of understanding and the way it applies to LLMs August 21, 2024 31 min read A beginner’s guide to forecast reconciliation August 20, 2024 13 min read Feature engineering, structuring unstructured data, and lead scoring August 21, 2024 7 min read With demos, our latest solution, and a video August 16, 2024 10 min read Explore the wisdom of LSTM leading into xLSTMs - a probable competition to the present-day LLMs This sophistication matrix can show you where it's worthwhile to go

How I Wonderful-Tuned Granite-Vision 2B to Beat a 90B Model — Insights and Lessons Learned

Motivation

The Task: Image to HTML (Table Extraction)

Custom Metric: HTML Similarity

Training Setup

LoRA Setup

Dataset Setup

Wonderful-Tuning Configuration

Evaluation Loss Workaround

Evaluation (Inference) Setup

Experiments & Results

First Run: `lang_only`

Second Run: `vision_only`

Third Run: `lang_vision`

Fourth Run: `lang_table_only`

Fifth / Sixth Run: `lang_table_unsloth`, `vision_table_unsloth`

Seventh Run: `lang_table_only_2`

Eighth Run: `lang_table_only_3`

Final Run: `lang_table_only_4`

Qualitative Results

Baseline: Raw Granite-Vision 2B

Wonderful-Tuned Model: `lang_table_only_4`

Inference Speed

Conclusions

What’s Next?

Links

Related Articles

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Automobile Example

How I Wonderful-Tuned Granite-Vision 2B to Beat a 90B Model — Insights and Lessons Learned

Motivation

The Task: Image to HTML (Table Extraction)

Custom Metric: HTML Similarity

Training Setup

LoRA Setup

Dataset Setup

Wonderful-Tuning Configuration

Evaluation Loss Workaround

Evaluation (Inference) Setup

Experiments & Results

First Run: lang_only

Second Run: vision_only

Third Run: lang_vision

Fourth Run: lang_table_only

Fifth / Sixth Run: lang_table_unsloth, vision_table_unsloth

Seventh Run: lang_table_only_2

Eighth Run: lang_table_only_3

Final Run: lang_table_only_4

Qualitative Results

Baseline: Raw Granite-Vision 2B

Wonderful-Tuned Model: lang_table_only_4

Inference Speed

Conclusions

What’s Next?

Links

Related Articles

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

First Run: `lang_only`

Second Run: `vision_only`

Third Run: `lang_vision`

Fourth Run: `lang_table_only`

Fifth / Sixth Run: `lang_table_unsloth`, `vision_table_unsloth`

Seventh Run: `lang_table_only_2`

Eighth Run: `lang_table_only_3`

Final Run: `lang_table_only_4`

Wonderful-Tuned Model: `lang_table_only_4`

What are your thoughts on this topic?
Let us know in the comments below.