Visual Document Retrieval Goes Multilingual

-


Marco Cimolai's avatar

Logan Markewich's avatar

TL;DR: We present vdr-2b-multi-v1, one of the best multilingual embedding model for visual document retrieval. We also release its English-only twin vdr-2b-v1 and open-source the brand new vdr-multilingual-train dataset. With 500k high-quality samples, it’s the most important open-source multilingual synthetic dataset for visual document retrieval.

image/png

Introducing vdr-2b-multi-v1 (🤗), a multilingual embedding model designed for visual document retrieval across multiple languages and domains. This model is designed to encode document page screenshots into dense single-vector representations, this can effectively allow to go looking and query visually wealthy multilingual documents without the necessity for any OCR, data extraction pipelines, chunking…

The vdr-2b-multi-v1 model is predicated on MrLight/dse-qwen2-2b-mrl-v1 and is trained on an intensive self-made dataset of multilingual query-image pairs. This model is inbuilt collaboration with LlamaIndex and is the subsequent iteration of mcdse-2b-v1. Our vdr-2b-multi-v1 extends and improves the training and methods used to coach it, leading to a far more powerful and higher model.

  • Trained on 🇮🇹 Italian, 🇪🇸 Spanish, 🇬🇧 English, 🇫🇷 French and 🇩🇪 German: Together they form a brand new large, open-source, multilingual training dataset of 500k high-quality samples.

  • Low VRAM and Faster Inference: On synthetic Visual Document Retrieval (ViDoRe) benchmarks, our English-only model with 768 image patches performs higher than the bottom model with 2560 image patches. This leads to 3x faster inference and far lower VRAM usage.

  • Cross-lingual Retrieval: Substantially higher on real-world scenarios. For instance, you may seek for German documents with Italian queries.

  • Matryoshka Representation Learning: You possibly can reduce the vectors size 3x and still keep 98% of the embeddings quality. This enables for notably faster retrieval speeds while reducing storage costs.



Usage

🎲 Check out vdr-2b-multi-v1 now, available on this Hugging Face Space!

Generating embeddings with vdr-2b-multi-v1 is simpler than ever with SentenceTransformers and LlamaIndex direct integrations. Start with just just a few lines of code:

via LlamaIndex
pip install -U llama-index-embeddings-huggingface
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

model = HuggingFaceEmbedding(
    model_name="llamaindex/vdr-2b-multi-v1",
    device="cpu",  
    trust_remote_code=True,
)

image_embedding = model.get_image_embedding("image.png")
query_embedding = model.get_query_embedding("Chi ha inventato Bitcoin?")
via SentenceTransformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    model_name_or_path="llamaindex/vdr-2b-multi-v1",
    device="cuda",
    trust_remote_code=True,
    
    model_kwargs={
        "torch_dtype": torch.bfloat16, 
        "device_map": "cuda:0", 
        "attn_implementation": "flash_attention_2"
    },
)

embeddings = model.encode("image.png")



Training Dataset

Training good single-vector models for visual document retrieval requires high-quality data, but the present multimodal off-the-shelf datasets are very scarce and never multilingual.

So, we have spent lots of time constructing it from scratch. The raw dataset consists of 500k multilingual query-image samples, collected and generated from scratch using public web PDFs. The queries related to each image are synthetic and generated using VLMs. For comparison, our dataset has 10x more samples than the prior largest open source synthetic dataset for multimodal visual document retrieval, i.e. the scraped documents generated for the ColPali training dataset.

image/png



Data Gathering

For every language, we generate a protracted list of search queries covering many various topics, that are then used to go looking for PDFs. We use the language filtering capabilities of the search engine to scrape documents which can be only in the required language. This “search by topic” technique ensures that the model has seen lots of diverse topics and domains, and that it performs well in real life scenarios.

The scraping process produced ~50k multilingual documents. Contrary to the tactic utilized in the previous mcdse-2b-v1 model, pages weren’t extracted randomly. As a substitute, each page of every PDF was run through a document layout evaluation model to find out whether the page contained more textual or visual elements. The result’s a number that classifies the page as text-only, visual-only or mixed. This labelling step was then used to sample ~100k pages, ensuring they were evenly distributed by page type.



Synthetic Generation

The queries were then generated using gemini-1.5-pro and Qwen2-VL-72B. They were tasked to provide you with a selected and a general query. Only the precise query is then used to coach the model, but forcing the LLM to tell apart between the 2 often resulted in stronger specific questions for information retrieval training.

After generation, an extra cleansing step ensures that the questions are adequate for training. This includes:

  • Ensuring the language is correct
  • Fix formatting problems
  • Remove markdown
  • Ensuring that just one query is posed
  • Removing grounding phrases (e.g. “in keeping with Figure 1”, “this document”, …)



Filtering and Hard-Negative Mining

This cleansing step ensures that the queries are syntactically correct and follow some strict guidelines. However it still doesn’t be certain that the queries are adequate for information retrieval.

To filter out bad questions, we now have embedded and indexed each broad query with the voyage-3 embedding model. For every specific query, we search the index. The query is marked as ‘good’ if its associated broad query appears in the highest 100 results. This method removes low entropy, duplicate or too similar questions. On average, 40% of queries were faraway from each language dataset.

Hard negatives were then mined using voyage-3 only on specific questions with a hard and fast threshold of 0.75. Experiments were also carried out using positive aware negative mining as described in nvidia/NV-Retriever-v1, but on this dataset it seems to supply too easy/distant negatives.



Download

The (vdr-multilingual-train 🤗) training dataset is now open-source and directly available on Hugging Face. The training dataset consists of 496,167 PDF pages, of which only 280,679 are related to the filtered queries (using the tactic described above). The photographs that remain with out a query are still used as hard negatives.

Language # filtered queries # unfiltered queries
English 53,512 94,225
Spanish 58,738 102,685
Italian 54,942 98,747
German 58,217 100,713
French 55,270 99,797
TOTAL 280,679 496,167

The dataset is product of 5 different subsets, each for each language. Here you may explore it directly:

Alternatively, you may download languages individually by specifying the language subset in load_dataset:

from datasets import load_dataset

italian_dataset = load_dataset("llamaindex/vdr-multilingual-train", "it", split="train")

english_dataset = load_dataset("llamaindex/vdr-multilingual-train", "en", split="train")

french_dataset = load_dataset("llamaindex/vdr-multilingual-train", "fr", split="train")

german_dataset = load_dataset("llamaindex/vdr-multilingual-train", "de", split="train")

spanish_dataset = load_dataset("llamaindex/vdr-multilingual-train", "es", split="train")



Evaluations

image/png

The model has been evaluated on the ViDoRe benchmark and on custom-built evaluation sets that allow testing its multilingual capabilities on text-only, visual-only and mixed page screenshots. The evaluation dataset can be publicly available on Hugging Face (vdr-multilingual-test 🤗).

We made sure that no page in these datasets was also present within the training set to avoid any evaluation contamination. The datasets were collected and generated using the identical methods because the training dataset, but with a smaller sample size. The filtering step was all done manually: each query is evaluated, curated and improved (if mandatory) to make sure high data quality.

All evaluations are performed by calculating NDCG@5 scores using 1536 dimensions vectors and a picture resolution that might be represented with maximum 768 tokens.

Avg French (text) French (visual) French (mix)
dse-qwen2-2b-mrl-v1 93.5 94.7 90.8 95.1
vdr-2b-multi-v1 95.6 95.6 93.3 97.9
+2.2%
Avg German (text) German (visual) German (mix)
dse-qwen2-2b-mrl-v1 93.0 93.4 90.0 95.5
vdr-2b-multi-v1 96.2 94.8 95.7 98.1
+3.4%
Avg Italian (text) Italian (visual) Italian (mix)
dse-qwen2-2b-mrl-v1 95.1 95.1 94.0 96.2
vdr-2b-multi-v1 97.0 96.4 96.3 98.4
+2%
Avg Spanish (text) Spanish (visual) Spanish (mix)
dse-qwen2-2b-mrl-v1 96.7 97.2 94.7 98.2
vdr-2b-multi-v1 98.1 98.3 96.9 99.1
+1.4%
Avg English (text) English (visual) English (mix)
dse-qwen2-2b-mrl-v1 98.0 98.3 98.5 97.1
vdr-2b-multi-v1 98.1 97.9 99.1 97.3
+0.1%

The multilingual model outperforms the bottom model in every language and each page type, on average by +2.3%. On the ViDoRe benchmark, it also performs barely higher (+0.5%).
Our fine-tuned vdr-2b-multi-v1 makes big leaps in performance, especially in non-English visual-only or mixed pages. See for instance the +6.33% NDCG@5 improvement for German visual-only retrieval over the bottom model.

We also trained a version only on the English subset (vdr-2b-v1 🤗). On the complete ViDoRe benchmark (evaluated with 768 image tokens), each the multilingual and English-only versions outperform the bottom model.

Avg shiftproject government healthcare energy ai docvqa arxivqa tatdqa infovqa tabfquad
dse-qwen2-2b-mrl-v1 83.6 79.8 95.7 96.9 92.0 98.2 56.3 85.2 53.9 87.5 90.3
vdr-2b-multi-v1 84.0 82.4 95.5 96.5 91.2 98.5 58.5 84.7 53.6 87.1 92.2
vdr-2b-v1 84.3 83.4 96.9 97.2 92.6 96.8 57.4 85.1 54.1 87.9 91.3



Faster Inference

image/png

The English-only vdr-2b-v1 model also matches the performance of the bottom model on the ViDoRe benchmark synthetic datasets, while only using 30% of the image tokens (768 vs. 2560). This effectively leads to 3x faster inference and far lower VRAM usage.

Avg shiftproject government healthcare energy ai
dse-qwen2-2b-mrl-v1 (2560 image tokens) 93.0 82 96 96.4 92.9 97.5
vdr-2b-v1 (768 image tokens) 93.4 83.4 96.9 97.2 92.6 96.8



Cross-Lingual Retrieval

Although the model was trained on each language individually, it also improves in cross-lingual retrieval. To check this ability, the German evaluation set queries were translated into Italian using DeepL. The document page screenshots remain in the unique German language.

Avg Italian -> German (text) Italian -> German (visual) Italian -> German (mix)
dse-qwen2-2b-mrl-v1 93.1 92.6 93.5 93.3
vdr-2b-multi-v1 95.3 95.0 95.8 95.1
+2.3%

The model is significantly higher across all document types, with a median improvement of +2.3%. These retrieval capabilities are essential for real-world use cases, especially in linguistically fragmented continents corresponding to Europe. For instance, it enables language-independent searches on complex multilingual sources corresponding to European binding decisions, instruction manuals, financial asset KIDs, pharmaceutical package leaflets and lots of more…



MRL and Binary Embeddings

This model is trained using Matryoshka Representation Learning (MRL). The loss function used during training is calibrated to trace performance across all these dimensions, leading the model to frontload a very powerful identifying information. This effectively lets you shrink the embedding dimensions in keeping with your scale and budget.
To learn more about MRL, this blog post by Hugging Face explains it thoroughly.

To check the model retrieval capabilities with different vector dimensions, evaluations are performed within the Italian->German cross-lingual benchmark.



NDCG@5 (float)

Avg Italian -> German (text) Italian -> German (visual) Italian -> German (mix)
1536 dimensions
dse-qwen2-2b-mrl-v1 93.1 92.6 93.5 93.3
vdr-2b-multi-v1 95.3 95.0 95.9 95.1
+2.3%
1024 dimensions
dse-qwen2-2b-mrl-v1 92.2 90.9 92.3 93.5
vdr-2b-multi-v1 94.6 93.1 95.7 95.1
+2.5%
512 dimensions
dse-qwen2-2b-mrl-v1 89.8 87.9 89.4 92.2
vdr-2b-multi-v1 93.0 91.1 93.4 94.5
+3.4%



NDCG@5 (binary)

Avg Italian -> German (text) Italian -> German (visual) Italian -> German (mix)
1536 dimensions
dse-qwen2-2b-mrl-v1 89.8 88.2 90.3 90.8
vdr-2b-multi-v1 92.3 89.6 94.1 93.3
+2.8%
1024 dimensions
dse-qwen2-2b-mrl-v1 86.7 84.9 88.2 86.9
vdr-2b-multi-v1 90.8 87.0 92.6 92.8
+4.6%
512 dimensions
dse-qwen2-2b-mrl-v1 79.2 80.6 81.7 75.4
vdr-2b-multi-v1 82.6 77.7 86.7 83.3
+4.0%

1024 dimension float vectors offer a excellent balance between quality and size. They’re ~30% smaller but still retain 99% of the retrieval performance. This can be true for the 1536 dimensions binary vectors, which have 10x fewer bytes per vector but still retain 97% of their retrieval quality. It is also interesting to see that 1536 binary vectors almost match the performance of the bottom model 1536 float vectors.



Conclusions and Next Steps

We imagine that vdr-2b-multi-v1 and vdr-2b-v1 will prove useful to many users.

Our multilingual model is the primary of its kind, it significantly improves performance in multilingual and cross-lingual scenarios, and because of MRL and Binary Quantization, retrieval is more efficient and faster than ever. We imagine this can unlock recent use-cases and opportunities, especially in linguistically fragmented continents corresponding to Europe.

Its English-only twin represents a substantial improvement over the bottom model, now having the ability to embed documents 3x faster, with much less VRAM and with the identical (or higher) retrieval quality.

All this is feasible because of the brand new vdr-multilingual-train dataset. With 500k top quality samples, it’s the most important multilingual open source synthetic dataset for visual document retrieval.

Future work will explore how our models perform when adapted to recent and specific domains. This continues to be within the early stages of development and more work must be done before results are published, but early tests already appear to suggest impressive retrieval gains with very minimal data and computational resources.

Stay tuned for future updates!



Links



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x