Home Artificial Intelligence Beyond English: Implementing a multilingual RAG solution

Beyond English: Implementing a multilingual RAG solution

0
Beyond English: Implementing a multilingual RAG solution

Splitting text, the easy way (Image generated by creator w. Dall-E 3)

When preparing data for embedding and retrieval in a RAG system, splitting the text into appropriately sized chunks is crucial. This process is guided by two fundamental aspects, Model Constraints and Retrieval Effectiveness.

Model Constraints

Embedding models have a maximum token length for input; anything beyond this limit gets truncated. Pay attention to your chosen model’s limitations and be sure that each data chunk doesn’t exceed this max token length.

Multilingual models, particularly, often have shorter sequence limits in comparison with their English counterparts. As an illustration, the widely used Paraphrase multilingual MiniLM-L12 v2 model has a maximum context window of just 128 tokens.

Also, consider the text length the model was trained on — some models might technically accept longer inputs but were trained on shorter chunks, which could affect performance on longer texts. One such is example, is the Multi QA base from SBERT as seen below,

Retrieval effectiveness

While chunking data to the model’s maximum length seems logical, it won’t at all times result in the most effective retrieval outcomes. Larger chunks offer more context for the LLM but can obscure key details, making it harder to retrieve precise matches. Conversely, smaller chunks can enhance match accuracy but might lack the context needed for complete answers. Hybrid approaches use smaller chunks for search but include surrounding context at query time for balance.

While there isn’t a definitive answer regarding chunk size, the considerations for chunk size remain consistent whether you’re working on multilingual or English projects. I might recommend reading further on the subject from resources corresponding to Evaluating the Ideal Chunk Size for RAG System using Llamaindex or Constructing RAG-based LLM Applications for Production.

Text splitting: Methods for splitting text

Text might be split using various methods, mainly falling into two categories: rule-based (specializing in character evaluation) and machine learning-based models. ML approaches, from easy NLTK & Spacy tokenizers to advanced transformer models, often depend upon language-specific training, primarily in English. Although easy models like NLTK & Spacy support multiple languages, they mainly address sentence splitting, not semantic sectioning.

Since ML based sentence splitters currently work poorly for many non-English languages, and are compute intensive, I like to recommend starting with a straightforward rule-based splitter. Should you’ve preserved relevant syntactic structure from the unique data, and formatted the info accurately, the result will probably be of excellent quality.

A standard and effective method is a recursive character text splitter, like those utilized in LangChain or LlamaIndex, which shortens sections by finding the closest split character in a prioritized sequence (e.g., nn, n, ., ?, !).

Taking the formatted text from the previous section, an example of using LangChains recursive character splitter would seem like:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("intfloat/e5-base-v2")

def token_length_function(text_input):
return len(tokenizer.encode(text_input, add_special_tokens=False))

text_splitter = RecursiveCharacterTextSplitter(
# Set a extremely small chunk size, just to point out.
chunk_size = 128,
chunk_overlap = 0,
length_function = token_length_function,
separators = ["nn", "n", ". ", "? ", "! "]
)

split_texts = text_splitter(formatted_document['Boosting RAG: Picking the Best Embedding & Reranker models'])

Here it’s necessary to notice that one should define the tokenizer because the embedding model intended to make use of, since different models ‘count’ the words in another way. The function will now, in a prioritized order, split any text longer than 128 tokens first by the nn we introduced at end of sections, and if that is just not possible, then by end of paragraphs delimited by n and so forth. The primary 3 chunks will probably be:

Token of text: 111 

UPDATE: The pooling method for the Jina AI embeddings has been adjusted to make use of mean pooling, and the outcomes have been updated accordingly. Notably, the JinaAI-v2-base-en with bge-reranker-largenow exhibits a Hit Rate of 0.938202 and an MRR (Mean Reciprocal Rank) of 0.868539 and withCohereRerank exhibits a Hit Rate of 0.932584, and an MRR of 0.873689.

-----------

Token of text: 112

When constructing a Retrieval Augmented Generation (RAG) pipeline, one key component is the Retriever. We've got quite a lot of embedding models to select from, including OpenAI, CohereAI, and open-source sentence transformers. Moreover, there are several rerankers available from CohereAI and sentence transformers.
But with all these options, how will we determine the most effective mix for top-notch retrieval performance? How will we know which embedding model matches our data best? Or which reranker boosts our results essentially the most?

-----------

Token of text: 54

On this blog post, we’ll use the Retrieval Evaluation module from LlamaIndex to swiftly determine the most effective combination of embedding and reranker models. Let's dive in!
Let’s first start with understanding the metrics available in Retrieval Evaluation

Now that we have now successfully split the text in a semantically meaningful way, we are able to move onto the ultimate a part of embedding these chunks for storage.

LEAVE A REPLY

Please enter your comment!
Please enter your name here