Training Improved Text Embeddings with Large Language Models

Artificial Intelligence

Training Improved Text Embeddings with Large Language Models

admin

January 12, 2024

Training Improved Text Embeddings with Large Language Models

Text embeddings are vector representations of words, sentences, paragraphs or documents that capture their semantic meaning. They function a core constructing block in lots of natural language processing (NLP) applications today, including information retrieval, query answering, semantic search and more.

vector embedding

Recent advances in large language models (LLMs) like GPT-3 have shown impressive capabilities in few-shot learning and natural language generation. Can we leverage LLMs to also advance the state of text embeddings? Of their paper “Improving Text Embeddings with Large Language Models“, researchers from Microsoft propose a novel method that achieves superior results by generating synthetic training data with LLMs and fine-tuning on it.

Challenges with Existing Methods

Traditional text embedding techniques like weighted averages of word vectors or TF-IDF fail to adequately capture the wealthy contextual information in text. Newer methods based on pre-trained language models like BERT obtain a lot better context-aware embeddings.

Nonetheless, they require complex multi-stage training pipelines:

Pre-train on billions of weakly labeled or artificial text pairs
High-quality-tune on limited hand-curated datasets

This demands massive compute resources and human effort for data collection. The training data can also be constrained in diversity and language coverage. For example, the BEIR benchmark comprises datasets for under 15 retrieval tasks in English.

Existing methods predominantly use smaller BERT-style architectures because the backbone model. They’re unable to reap the benefits of more advanced LLMs and related techniques.

Methodology: Synthetic Data Generation with LLMs

To beat these limitations, the researchers propose a novel single-stage training approach that leverages LLMs like GPT-3 and GPT-4 to generate diverse synthetic training data.

The important thing steps are:

Task Taxonomy: Define a taxonomy that categorizes text embedding tasks into:
- Asymmetric tasks (query and document not paraphrases e.g. search)
- Symmetric tasks (query and document are paraphrases e.g. semantic similarity)
Prompt Design: Create prompt templates tailored to every task type that guide the LLM to generate relevant training examples.
Synthetic Data Generation: Prompt the LLM with the designed prompts to generate tons of of 1000’s of (query, document) pairs covering a wide selection of semantic tasks across 93 languages.
Model Training: High-quality-tune a strong open-source LLM equivalent to Mistral on the synthetic data using contrastive loss.

This system allows creating ample training data for diverse tasks in multiple languages with none human labeling effort. By leveraging the knowledge already embedded in LLMs through pre-training on web-scale corpora, we are able to synthesize high-quality data precisely tailored for text embeddings.

The researchers display this with a 2-step prompting strategy:

Prompt GPT-4 to suggest potential retrieval tasks

Prompt for generating high-level retrieval tasks

Prompt it again to generate (query, document) samples based on the suggested tasks

n generate (query, positive, hard negative) triplets

Some key elements of the prompt design:

Natural language prompts for intuitive human-like instructions
Placeholders to encourage diversity (e.g. query length, clarity, document length)
Combining data from multiple templates for a similar task type
Weighting languages based on resource availability

In total, they were in a position to generate 500k text embedding examples at a compute cost of 180M tokens. The dominant language was English (43%) followed by Polish, Japanese, Italian and others.

For model training, they opted for fine-tuning the open-source 7B parameter Mistral model as an alternative of smaller BERT-style architectures. Since Mistral was already pre-trained on massive text corpora, no additional contrastive pre-training was needed. Adding it provided negligible improvements.

The complete fine-tuning took lower than 1k steps, using a combination of synthetic and human-labeled data. This demonstrates the sample efficiency of the proposed approach.

Results

The researchers evaluated their model on the MTEB benchmark, which covers diverse tasks across classification, clustering, semantic similarity, summarization and data retrieval.

Their model outperformed previous state-of-the-art by 2.4 points in average rating, establishing recent records for nearly every category:

Model	Previous SOTA	Proposed Model
Classification	76.0	78.5
Clustering	46.1	50.3
Pairwise Classification	87.1	88.3
Reranking	60.0	60.2
Retrieval	54.3	56.9
STS	83.1	84.6
Summarization	31.6	31.4
Average	64.2	66.6

Remarkably, even without using any labeled data and training solely on synthetic data, it achieved competitive accuracy – only 3.5 points behind the fully supervised model. This demonstrates the viability of generating text embeddings just using LLMs, without human annotation effort.

The researchers also evaluated on the multilingual MIRACL benchmark covering 18 languages. Their model outperformed previous best on high-resource languages but was weaker on low-resource ones. They hypothesize this could possibly be mitigated by pre-training LLMs more extensively on low-resource languages.

In summary, text embeddings trained on LLM-generated synthetic data establish recent state-of-the-art results, while using simpler and more efficient training in comparison with prior multi-stage approaches. With further research intoprompt engineering and artificial data quality, this system could greatly advance multilingual text embeddings.

Evaluation

This work offers several worthwhile takeaways:

LLMs like GPT-3 and GPT-4 have a formidable ability to generate high-quality synthetic training data for diverse NLP tasks when prompted appropriately. This may reduce reliance on human-labeled data.
For text embeddings, contrastive pre-training provides negligible gains over just fine-tuning models like Mistral that have already got trillion-scale pre-training. That is a crucial insight into training efficiency.
Retrieval augmented generation methods are enabling LLMs to dynamically access external knowledge. Hence improving text embeddings is worthwhile for enhancing these LLMs.
There is critical room for improvement in low-resource languages. Multilingual LLMs pre-trained on more representative data could help close this gap.
Conceptually, language modeling and text embeddings are two sides of the identical coin – understanding language semantics. With synthetic data prompting, LLMs might be organically fine-tuned into embedders without complex pipelines.

Some promising directions for future work include:

Leveraging open-source LLMs like GPT-NeoX to generate synthetic data
Exploring lightweight post-training to adapt embedders to longer contexts
Development of prompt engineering techniques to manage quality and task coverage
Methods to enhance inference latency and storage costs for industrial usage

Beyond beating benchmarks, employing large language models to reinforce text embeddings opens up intriguing possibilities for the longer term. As LLMs proceed to advance of their mastery over natural language, their aptitude for generating high-fidelity synthetic data is more likely to improve as well.

Nonetheless, critical research directions remain to translate this potential into real-world impact.

Customization and Control

A key advantage of synthetic data is the flexibility to programmatically generate examples tailored to specific needs. Because the paper demonstrated, prompt engineering allows creating training data for tons of of 1000’s of embedding tasks.

Yet, current prompt design practices remain more an art than science. Developing systematic, reproducible methods to exactly control the properties of generated data would expand the applicability of this system.

For example, techniques to modulate aspects just like the complexity, ambiguity and novelty of examples could help address robustness issues in downstream tasks. Dynamic prompt generation to match evolving real-world distributions is one other open challenge.

Training at Scale

While pre-trained LLMs already encode substantial linguistic knowledge, their data generation skills are more likely to enhance further with additional scale. Models like GPT-4 trained on trillions of tokens of web text exhibit strong few-shot learning, but haven’t been optimized specifically for synthesizing training data.

Architectures and objectives tailored to bootstrapping self-supervised data generation at web-scale could substantially advance the standard and efficiency of this system. Efficient integration of retrieved knowledge to enhance learned knowledge is one other promising direction.

Multitask and Multilingual

Because the paper noted, improving performance on low-resource languages stays a problem. Slightly than pre-train a single massive LLM, another is training a fleet of smaller expert models that specialise in particular data modalities or language domains.

Such an ensemble approach could help improve coverage over rare tasks and languages by sharing representations learned across experts. Continual learning to expand language and task expertise over time can also be an exciting prospect.

In conclusion, this paper introduces an modern concept of synthesizing training data from LLMs to create performant text embeddings. Their results display the effectiveness of this system, outperforming previous benchmarks. As LLMs and artificial data techniques progress, tapping into their knowledge to coach embedders could turn into a highly promising direction.