While developing Docmatix, we noticed that fine-tuning Florence-2 on it yielded great performance on DocVQA, but resulted in low scores on the benchmark. To reinforce performance, we needed to fine-tune the model further on DocVQA to learn the syntax required for the benchmark. Interestingly, this extra fine-tuning appeared to perform worse in accordance with human evaluators, which is why we primarily used it for ablation studies and released the model only trained on Docmatix for broader use.
Although the generated answers semantically align with the reference answers, as illustrated in Figure 1, they still receive low scores. This raises these questions: Should we fine-tune the models to enhance these metrics, or should we develop recent metrics that higher align with human perception?
Figure 1: t-SNE visualization of Zero-Shot Generated and Reference Answers from Docmatix dataset
Introduction
Our community has recently focused on out-of-distribution (OOD) evaluation, utilizing methods like zero-shot transfer to unseen VQA tasks or fine-tuning on one VQA dataset and evaluating on one other. This shift is increasingly relevant with the rise of synthetic datasets comparable to Docmatix, SciGraphQA, SimVQA used to fine-tune Vision Language Models (VLMs).
Traditionally, VQA Accuracy has been the predominant metric for evaluating model performance. It relies on exact string matching between a model’s predicted answer and a set of reference answers annotated by humans. This metric worked well because VQA evaluation followed an independent and identically distributed (IID) paradigm, where training and testing data distributions were similar, allowing models to adapt effectively See details here.
In OOD settings, generated answers may not match reference answers despite being correct resulting from differences in format, specificity, or interpretation. This paradigm is perfectly illustrated within the Figure 1, where we compare the zero-shot generated captions vs the reference captions from the synthetic dataset. This is especially true for instruction-generated datasets and their human-curated counterparts. Some methods have attempted to align answer formats with references, but this only addresses the symptom, not the basis reason behind flawed evaluation metrics. While human evaluation is reliable, it is expensive and never scalable, highlighting the necessity for metrics that higher align with human judgment.
Method
Docmatix is the biggest synthetic DocVQA dataset, generated from the curated document dataset, PDFA. It’s 100x larger than previously available datasets. The human-curated counterpart is DocVQA, which serves as an evaluation benchmark for VQA models for Document Understanding. On this post, we’re going to use the subset of Docmatix which consists around 200 test samples, which could be downloaded here Docmatix-zero-shot-exp.

Figure 2: The examples of Q&A pairs from Docmatix and DocVQA test set. Note: the corresponding images are usually not shown here.
Although the content of the query and answer pairs in Docmatix and DocVQA is analogous, their styles differ significantly. Traditional metrics like CIDER, ANLS, and BLEU could be overly restrictive for zero-shot evaluation on this context. Motivated by the similarity of the embeddings observed in t-SNE (Figure 1), we decided to make use of a special evaluation metric. On this post, we consider the LAVE (LLM-Assisted VQA Evaluation) metric to higher assess generalization on this unseen but semantically similar dataset.
Figure 3: t-SNE visualization of Query, Answer and Image features from Docmatix and DocVQA datasets
Figure 5: t-SNE visualization of Query, Answer and Image features from Docmatix and DocVQA datasets
For our evaluation, we selected MPLUGDocOwl1.5 as a baseline model. This model achieves an 84% ANLS rating on the test subset of the unique DocVQA dataset. We then ran a zero-shot generation on a subset of Docmatix, consisting of 200 images. We used Llama-2-Chat-7b for rating the answers.
About LAVE
We followed the procedure outlined within the paper. The VQA evaluation is framed as an answer-rating task suitable for in-context learning with LLMs. We used a rating scale from 1 to three to account for ambiguous questions or incomplete answers. The prompt included a task description, several demonstrations of input/output, and the input for a test example.
We structured our task description and included the instruction “Give the rationale before rating” to showcase a justification for the assigned rating. Each demonstration comprised a matter, a set of reference answers, the candidate answer, the reply rating, and an evidence for the rating. We also include the “Provide just one rating” to avoid sentence-by-sentence evaluation, which sometimes resulted in several rankings.
task_description = """You might be given a matter, a set of gold-standard reference answers written by
experts, and a candidate answer. Please rate the accuracy of the candidate answer for the query
considering the reference answers. Use a scale of 1-3, with 1 indicating an incorrect or irrelevant
answer, 2 indicating an ambiguous or incomplete answer, and three indicating an accurate answer.
Give the rationale before rating. Provide just one rating.
THIS IS VERY IMPORTANT:
A binary query should only be answered with 'yes' or 'no',
otherwise the candidate answer is inaccurate."""
demonstrations = [
{
"question": "What's the weather like?",
"reference_answer": ["sunny", "clear", "bright", "sunny", "sunny"],
"generated_answer": "cloudy"
}
]
Scoring Function
Given the LLM’s generated text for the test example, we extracted the rating from the last character (either 1, 2, or 3) and mapped it to a rating within the range [0, 1]: [ s = frac{r – 1}{2} ]
Table of Results
The outcomes of our evaluation are summarized within the table below:
| Metric | CIDER | BLEU | ANLS | LAVE |
|---|---|---|---|---|
| Rating | 0.1411 | 0.0032 | 0.002 | 0.58 |
Qualitative Examples
Figure 4: Llama rating and rationale for the generated and reference answers from Docmatix test subset.
Figure 5: Llama rating and rationale for the generated and reference answers from Docmatix test subset.
Are we too strict in evaluating VQA systems and do we’d like finetuning?
We’ve got roughly 50% accuracy gain when using LLMs to judge responses, indicating that the answers could be correct despite not adhering to a strict format. This implies that our current evaluation metrics could also be too rigid. It’s vital to notice that this shouldn’t be a comprehensive research paper, and more ablation studies are needed to completely understand the effectiveness of various metrics on the evaluation of zero-shot performance on synthetic dataset. We hope this work serves as a place to begin to broaden the present research deal with improving the evaluation of zero-shot vision-language models throughout the context of synthetic datasets and to explore more efficient approaches beyond prompt learning.
References
@inproceedings{cascante2022simvqa,
title={Simvqa: Exploring simulated environments for visual query answering},
creator={Cascante-Bonilla, Paola and Wu, Hui and Wang, Letao and Feris, Rogerio S and Ordonez, Vicente},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={5056--5066},
yr={2022}
}
@article{hu2024mplug,
title={mplug-docowl 1.5: Unified structure learning for ocr-free document understanding},
creator={Hu, Anwen and Xu, Haiyang and Ye, Jiabo and Yan, Ming and Zhang, Liang and Zhang, Bo and Li, Chen and Zhang, Ji and Jin, Qin and Huang, Fei and others},
journal={arXiv preprint arXiv:2403.12895},
yr={2024}
}
@article{agrawal2022reassessing,
title={Reassessing evaluation practices in visual query answering: A case study on out-of-distribution generalization},
creator={Agrawal, Aishwarya and Kaji{'c}, Ivana and Bugliarello, Emanuele and Davoodi, Elnaz and Gergely, Anita and Blunsom, Phil and Nematzadeh, Aida},
journal={arXiv preprint arXiv:2205.12191},
yr={2022}
}
@inproceedings{li2023blip,
title={Blip-2: Bootstrapping language-image pre-training with frozen image encoders and enormous language models},
creator={Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven},
booktitle={International conference on machine learning},
pages={19730--19742},
yr={2023},
organization={PMLR}
}
@inproceedings{manas2024improving,
title={Improving automatic vqa evaluation using large language models},
creator={Ma{~n}as, Oscar and Krojer, Benno and Agrawal, Aishwarya},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={5},
pages={4171--4179},
yr={2024}
}
@article{li2023scigraphqa,
title={Scigraphqa: A big-scale synthetic multi-turn question-answering dataset for scientific graphs},
creator={Li, Shengzhi and Tajbakhsh, Nima},
journal={arXiv preprint arXiv:2308.03349},
yr={2023}
}
