Home Artificial Intelligence OpenAI vs Open-Source Multilingual Embedding Models

OpenAI vs Open-Source Multilingual Embedding Models

2
OpenAI vs Open-Source Multilingual Embedding Models

Selecting the model that works best on your data

We’ll use the EU AI act as the info corpus for our embedding model comparison. Image by Dall-E 3.

OpenAI recently released their recent generation of embedding models, called embedding v3, which they describe as their most performant embedding models, with higher multilingual performances. The models are available two classes: a smaller one called text-embedding-3-small, and a bigger and more powerful one called text-embedding-3-large.

Little or no information was disclosed regarding the way these models were designed and trained. As their previous embedding model release (December 2022 with the ada-002 model class), OpenAI again chooses a closed-source approach where the models may only be accessed through a paid API.

But are the performances so good that they make it price paying?

The motivation for this post is to empirically compare the performances of those recent models with their open-source counterparts. We’ll depend on an information retrieval workflow, where essentially the most relevant documents in a corpus need to be found given a user query.

Our corpus will likely be the European AI Act, which is currently in its final stages of validation. An interesting characteristic of this corpus, besides being the first-ever legal framework on AI worldwide, is its availability in 24 languages. This makes it possible to check the accuracy of information retrieval across different families of languages.

The post will undergo the 2 essential following steps:

  • Generate a custom synthetic query/answer dataset from a multilingual text corpus
  • Compare the accuracy of OpenAI and state-of-the-art open-source embedding models on this tradition dataset.

The code and data to breed the outcomes presented on this post are made available in this Github repository. Note that the EU AI Act is used for example, and the methodology followed on this post may be adapted to other data corpus.

Allow us to first start by generating a dataset of questions and answers (Q/A) on custom data, which will likely be used to evaluate the performance of various embedding models. The advantages of generating a custom Q/A dataset are twofold. First, it avoids biases by ensuring that the dataset has not been a part of the training of an embedding model, which can occur on reference benchmarks similar to MTEB. Second, it allows to tailor the assessment to a selected corpus of information, which may be relevant within the case of retrieval augmented applications (RAG) for instance.

We’ll follow the easy process suggested by Llama Index of their documentation. The corpus is first split right into a set of chunks. Then, for every chunk, a set of synthetic questions are generated by way of a big language model (LLM), such that the reply lies within the corresponding chunk. The method is illustrated below:

Generating a matter/answer dataset on your data, methodology from Llama Index

Implementing this strategy is easy with an information framework for LLM similar to Llama Index. The loading of the corpus and splitting of text may be conveniently carried out using high-level functions, as illustrated with the next code.

from llama_index.readers.web import SimpleWebPageReader
from llama_index.core.node_parser import SentenceSplitter

language = "EN"
url_doc = "https://eur-lex.europa.eu/legal-content/"+language+"/TXT/HTML/?uri=CELEX:52021PC0206"

documents = SimpleWebPageReader(html_to_text=True).load_data([url_doc])

parser = SentenceSplitter(chunk_size=1000)
nodes = parser.get_nodes_from_documents(documents, show_progress=True)

In this instance, the corpus is the EU AI Act in English, taken directly from the Web using this official URL. We use the draft version from April 2021, as the ultimate version shouldn’t be yet available for all European languages. On this version, English language may be replaced within the URL by any of the 23 other EU official languages to retrieve the text in a distinct language (BG for Bulgarian, ES for Spanish, CS for Czech, and so forth).

Download links to the EU AI Act for the 24 official EU languages (from EU official website)

We use the SentenceSplitter object to separate the document in chunks of 1000 tokens. For English, this leads to about 100 chunks.

Each chunk is then provided as context to the next prompt (the default prompt suggested within the Llama Index library):

prompts={}
prompts["EN"] = """
Context information is below.

---------------------
{context_str}
---------------------

Given the context information and never prior knowledge, generate only questions based on the below query.

You're a Teacher/ Professor. Your task is to setup {num_questions_per_chunk} questions for an upcoming quiz/examination.
The questions needs to be diverse in nature across the document. Restrict the inquiries to the context information provided."
"""

The prompt goals at generating questions on the document chunk, as if a teacher were preparing an upcoming quiz. The variety of inquiries to generate for every chunk is passed because the parameter ‘num_questions_per_chunk’, which we set to 2. Questions can then be generated by calling the generate_qa_embedding_pairs from the Llama Index library:

from llama_index.llms import OpenAI
from llama_index.legacy.finetuning import generate_qa_embedding_pairs

qa_dataset = generate_qa_embedding_pairs(
llm=OpenAI(model="gpt-3.5-turbo-0125",additional_kwargs={'seed':42}),
nodes=nodes,
qa_generate_prompt_tmpl = prompts[language],
num_questions_per_chunk=2
)

We rely for this task on the GPT-3.5-turbo-0125 mode from OpenAI, which is based on OpenAI the flagship model of this family, supporting a 16K context window and optimized for dialog (https://platform.openai.com/docs/models/gpt-3-5-turbo).

The resulting objet ‘qa_dataset’ comprises the questions and answers (chunks) pairs. For instance of generated questions, here is the result for the primary two questions (for which the ‘answer’ is the primary chunk of text):

1) What are the essential objectives of the proposal for a Regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act) based on the explanatory memorandum?
2) How does the proposal for a Regulation on artificial intelligence aim to handle the risks related to using AI while promoting the uptake of AI within the European Union, as outlined within the context information?

The variety of chunks and questions is dependent upon the language, starting from around 100 chunks and 200 questions for English, to 200 chunks and 400 questions for Hungarian.

Our evaluation function follows the Llama Index documentation and consists in two essential steps. First, the embeddings for all answers (document chunks) are stored in a VectorStoreIndex for efficient retrieval. Then, the evaluation function loops over all queries, retrieves the highest k most similar documents, and the accuracy of the retrieval in assessed when it comes to MRR (Mean Reciprocal Rank).

def evaluate(dataset, embed_model, insert_batch_size=1000, top_k=5):
# Get corpus, queries, and relevant documents from the qa_dataset object
corpus = dataset.corpus
queries = dataset.queries
relevant_docs = dataset.relevant_docs

# Create TextNode objects for every document within the corpus and create a VectorStoreIndex to efficiently store and retrieve embeddings
nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
index = VectorStoreIndex(
nodes, embed_model=embed_model, insert_batch_size=insert_batch_size
)
retriever = index.as_retriever(similarity_top_k=top_k)

# Prepare to gather evaluation results
eval_results = []

# Iterate over each query within the dataset to guage retrieval performance
for query_id, query in tqdm(queries.items()):
# Retrieve the top_k most similar documents for the present query and extract the IDs of the retrieved documents
retrieved_nodes = retriever.retrieve(query)
retrieved_ids = [node.node.node_id for node in retrieved_nodes]

# Check if the expected document was among the many retrieved documents
expected_id = relevant_docs[query_id][0]
is_hit = expected_id in retrieved_ids # assume 1 relevant doc per query

# Calculate the Mean Reciprocal Rank (MRR) and append to results
if is_hit:
rank = retrieved_ids.index(expected_id) + 1
mrr = 1 / rank
else:
mrr = 0
eval_results.append(mrr)

# Return the common MRR across all queries as the ultimate evaluation metric
return np.average(eval_results)

The embedding model is passed to the evaluation function by way of the `embed_model` argument, which for OpenAI models is an OpenAIEmbedding object initialised with the name of the model, and the model dimension.

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(model=model_spec['model_name'],
dimensions=model_spec['dimensions'])

The dimensions API parameter can shorten embeddings (i.e. remove some numbers from the tip of the sequence) without the embedding losing its concept-representing properties. OpenAI for instance suggests of their annoucement that on the MTEB benchmark, an embedding may be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002 embedding with a size of 1536.

We ran the evaluation function on 4 different OpenAI embedding models:

  • two versions of text-embedding-3-large : one with the bottom possible dimension (256), and the opposite one with the best possible dimension (3072). These are called ‘OAI-large-256’ and ‘OAI-large-3072’.
  • OAI-small: The text-embedding-3-small embedding model, with a dimension of 1536.
  • OAI-ada-002: The legacy text-embedding-ada-002 model, with a dimension of 1536.

Each model was evaluated on 4 different languages: English (EN), French (FR), Czech (CS) and Hungarian (HU), covering examples of Germanic, Romance, Slavic and Uralic language, respectively.

embeddings_model_spec = {
}

embeddings_model_spec['OAI-Large-256']={'model_name':'text-embedding-3-large','dimensions':256}
embeddings_model_spec['OAI-Large-3072']={'model_name':'text-embedding-3-large','dimensions':3072}
embeddings_model_spec['OAI-Small']={'model_name':'text-embedding-3-small','dimensions':1536}
embeddings_model_spec['OAI-ada-002']={'model_name':'text-embedding-ada-002','dimensions':None}

results = []

languages = ["EN", "FR", "CS", "HU"]

# Loop through all languages
for language in languages:

# Load dataset
file_name=language+"_dataset.json"
qa_dataset = EmbeddingQAFinetuneDataset.from_json(file_name)

# Loop through all models
for model_name, model_spec in embeddings_model_spec.items():

# Get model
embed_model = OpenAIEmbedding(model=model_spec['model_name'],
dimensions=model_spec['dimensions'])

# Assess embedding rating (when it comes to MRR)
rating = evaluate(qa_dataset, embed_model)

results.append([language, model_name, score])

df_results = pd.DataFrame(results, columns = ["Language" ,"Embedding model", "MRR"])

The resulting accuracy when it comes to MRR is reported below:

Summary of performances for the OpenAI models

As expected, for the massive model, higher performances are observed with the larger embedding size of 3072. Compared with the small and legacy Ada models, the massive model is nonetheless smaller than we might have expected. For comparison, we also report below the performances obtained by the OpenAI models on the MTEB benchmark.

Performances of OpenAI embedding models, as reported of their official announcement

It’s interesting to notice that the differences in performances between the massive, small and Ada models are much less pronounced in our assessment than within the MTEB benchmark, reflecting the indisputable fact that the common performances observed in large benchmarks don’t necessarily reflect those obtained on custom datasets.

The open-source research around embeddings is sort of lively, and recent models are usually published. An excellent place to maintain updated concerning the latest published models is the Hugging Face 😊 MTEB leaderboard.

For the comparison in this text, we chosen a set of 4 embedding models recently published (2024). The factors for selection were their average rating on the MTEB leaderboard and their ability to take care of multilingual data. A summary of the essential characteristics of the chosen models are reported below.

Chosen open-source embedding models
  • E5-Mistral-7B-instruct (E5-mistral-7b): This E5 embedding model by Microsoft is initialized from Mistral-7B-v0.1 and fine-tuned on a combination of multilingual datasets. The model performs best on the MTEB leaderboard, but can also be by far the largest one (14GB).
  • multilingual-e5-large-instruct (ML-E5-large): One other E5 model from Microsoft, meant to raised handle multilingual data. It’s initialized from xlm-roberta-large and trained on a combination of multilingual datasets. It is far smaller (10 times) than E5-Mistral, but additionally has a much lower context size (514).
  • BGE-M3: The model was designed by the Beijing Academy of Artificial Intelligence, and is their state-of-the-art embedding model for multilingual data, supporting greater than 100 working languages. It was not yet benchmarked on the MTEB leaderboard as of twenty-two/02/2024.
  • nomic-embed-text-v1 (Nomic-Embed): The model was designed by Nomic, and claims higher performances than OpenAI Ada-002 and text-embedding-3-small while being only 0.55GB in size. Interestingly, the model is the primary to be fully reproducible and auditable (open data and open-source training code).

The code for evaluating these open-source models is analogous to the code used for OpenAI models. The essential change lies within the model specifications, where additional details similar to maximum context length and pooling types need to be specified. We then evaluate each model for every of the 4 languages:

embeddings_model_spec = {
}

embeddings_model_spec['E5-mistral-7b']={'model_name':'intfloat/e5-mistral-7b-instruct','max_length':32768, 'pooling_type':'last_token',
'normalize': True, 'batch_size':1, 'kwargs': {'load_in_4bit':True, 'bnb_4bit_compute_dtype':torch.float16}}
embeddings_model_spec['ML-E5-large']={'model_name':'intfloat/multilingual-e5-large','max_length':512, 'pooling_type':'mean',
'normalize': True, 'batch_size':1, 'kwargs': {'device_map': 'cuda', 'torch_dtype':torch.float16}}
embeddings_model_spec['BGE-M3']={'model_name':'BAAI/bge-m3','max_length':8192, 'pooling_type':'cls',
'normalize': True, 'batch_size':1, 'kwargs': {'device_map': 'cuda', 'torch_dtype':torch.float16}}
embeddings_model_spec['Nomic-Embed']={'model_name':'nomic-ai/nomic-embed-text-v1','max_length':8192, 'pooling_type':'mean',
'normalize': True, 'batch_size':1, 'kwargs': {'device_map': 'cuda', 'trust_remote_code' : True}}

results = []

languages = ["EN", "FR", "CS", "HU"]

# Loop through all models
for model_name, model_spec in embeddings_model_spec.items():

print("Processing model : "+str(model_spec))

# Get model
tokenizer = AutoTokenizer.from_pretrained(model_spec['model_name'])
embed_model = AutoModel.from_pretrained(model_spec['model_name'], **model_spec['kwargs'])

if model_name=="Nomic-Embed":
embed_model.to('cuda')

# Loop through all languages
for language in languages:

# Load dataset
file_name=language+"_dataset.json"
qa_dataset = EmbeddingQAFinetuneDataset.from_json(file_name)

start_time_assessment=time.time()

# Assess embedding rating (when it comes to hit rate at k=5)
rating = evaluate(qa_dataset, tokenizer, embed_model, model_spec['normalize'], model_spec['max_length'], model_spec['pooling_type'])

# Get duration of rating assessment
duration_assessment = time.time()-start_time_assessment

results.append([language, model_name, score, duration_assessment])

df_results = pd.DataFrame(results, columns = ["Language" ,"Embedding model", "MRR", "Duration"])

The resulting accuracies when it comes to MRR are reported below.

Summary of performances for the open-source models

BGE-M3 seems to offer one of the best performances, followed on average by ML-E5-Large, E5-mistral-7b and Nomic-Embed. BGE-M3 model shouldn’t be yet benchmarked on the MTEB leaderboard, and our results indicate that it could rank higher than other models. It’s also interesting to notice that while BGE-M3 is optimized for multilingual data, it also performs higher for English than the opposite models.

We moreover report the processing times for every embedding model below.

Processing times in seconds for going throught the English Q/A dataset

The E5-mistral-7b, which is greater than 10 times larger than the opposite models, is without surprise by far the slowest model.

Allow us to put side-by-side of the performance of the eight tested models in a single figure.

Summary of performances for the eight tested models

The important thing observations from these results are:

  • Best performances were obtained by open-source models. The BGE-M3 model, developed by the Beijing Academy of Artificial Intelligence, emerged as the highest performer. The model has the identical context length as OpenAI models (8K), for a size of two.2GB.
  • Consistency Across OpenAI’s Range. The performances of the massive (3072), small and legacy OpenAI models were very similar. Reducing the embedding size of the massive model (256) nonetheless led to a degradation of performances.
  • Language Sensitivity. Just about all models (except ML-E5-large) performed best on English. Significant variations in performances were observed in languages like Czech and Hungarian.

Must you due to this fact go for a paid OpenAI subscription, or for hosting an open-source embedding model?

OpenAI’s recent price revision has made access to their API significantly more cost-effective, with the associated fee now standing at $0.13 per million tokens. Coping with a million queries per 30 days (and assuming that every query involves around 1K token) would due to this fact cost on the order of $130. Depending in your use case, it might due to this fact not be cost-effective to rent and maintain your personal embedding server.

Cost-effectiveness is nonetheless not the only consideration. Other aspects similar to latency, privacy, and control over data processing workflows may additionally must be considered. Open-source models offer the advantage of complete data control, enhancing privacy and customization. However, latency issues have been observed with OpenAI’s API, sometimes leading to prolonged response times.

In conclusion, the alternative between open-source models and proprietary solutions like OpenAI’s doesn’t lend itself to a simple answer. Open-source embeddings present a compelling option, combining performance with greater control over data. Conversely, OpenAI’s offerings should still appeal to those prioritizing convenience, especially if privacy concerns are secondary.

Notes:

2 COMMENTS

  1. Your posts are so thought-provoking and often leave me pondering long after I have finished reading Keep challenging your readers to think outside the box

  2. The positivity and optimism conveyed in this blog never fails to uplift my spirits Thank you for spreading joy and positivity in the world

LEAVE A REPLY

Please enter your comment!
Please enter your name here