Train a Sentence Embedding Model with 1B Training Pairs

Sentence embedding is a technique that maps sentences to vectors of real numbers. Ideally, these vectors would capture the semantic of a sentence and be highly generic. Such representations could then be used for a lot of downstream applications reminiscent of clustering, text mining, or query answering.

We developed state-of-the-art sentence embedding models as a part of the project “Train the Best Sentence Embedding Model Ever with 1B Training Pairs”. This project took place throughout the Community week using JAX/Flax for NLP & CV, organized by Hugging Face. We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, in addition to guidance from Google’s Flax, JAX, and Cloud team members about efficient deep learning frameworks!

Training methodology

Model

Unlike words, we are able to not define a finite set of sentences. Sentence embedding methods, due to this fact, compose inner words to compute the ultimate representation. For instance, SentenceBert model (Reimers and Gurevych, 2019) uses Transformer, the cornerstone of many NLP applications, followed by a pooling operation over the contextualized word vectors. (c.f. Figure below.)

Multiple Negative Rating Loss

The parameters from the composition module are often learned using a self-supervised objective. For the project, we used a contrastive training method illustrated within the figure below. We constitute a dataset with sentence pairs $(a_{i}, p_{i})$

Formally, given a batch of coaching samples, the model optimises the next loss function:

$-frac{1}{n}sum_{i=1}^nfrac{exp(sim(a_i, p_i))}{sum_j exp(sim(a_i, p_j))}$

An illustrative example could be seen below. The model first embeds each sentence from every pair within the batch. Then, we compute a similarity matrix between every possible pair $(a_{i}, p_{j})$

Intuitively, the model should assign high similarity to the sentences « How many individuals live in Berlin? » and « Around 3.5 million people live in Berlin » and low similarity to other negative answers reminiscent of « The capital of France is Paris » as detailed within the Figure below.

Within the loss equation, sim indicates a similarity function between $(a, p)$

Cosine-similarity	Dot-product
Vector has highest similarity to itself since $c o s (a, a) = 1$	Other vectors can have higher dot-products $d o t (a, a) < d o t (a, b)$
With normalised vectors it is the same as the dot product. The max vector length is equals 1.	It could be slower with certain approximate nearest neighbour methods because the max vector not known.
With normalised vectors, it’s proportional to euclidian distance. It really works with k-means clustering.	It doesn’t work with k-means clustering.

In practice, we used a scaled similarity because rating differences tends to be too small and apply a scaling factor $C$

Improving Quality with Higher Batches

In our method, we construct batches of sample pairs $(a_{i}, p_{i})$

1. Size matters

In contrastive learning, a bigger batch size is synonymous with higher performance. As shown within the Figure extracted from Qu and al., (2021), a bigger batch size increases the outcomes.

2. Hard Negatives

In the identical figure, we observe that including hard negatives also improves performance. Hard negatives are sample $p_{j}$

3. Cross dataset batches

We concatenated multiple datasets to coach our models. We built a big batch and gathered samples from the identical batch dataset to limit the subject distribution and favor hard negatives. Nonetheless, we also mix not less than two datasets within the batch to learn a world structure between topics and never only a neighborhood structure inside a subject.

Training infrastructure and data

As mentioned earlier, the amount of knowledge and the batch size directly impact the model performances. As a part of the project, we benefited from efficient hardware infrastructure. We trained our models on TPUs that are compute units developed by Google and super efficient for matrix multiplications. TPUs have some hardware specificities which could require some specific code implementation.

Moreover, we trained models on a big corpus as we concatenated multiple datasets as much as 1 billion sentence pairs! All datasets used are detailed for every model within the model card.

Conclusion

You could find all models and datasets we created throughout the challenge in our HuggingFace repository. We trained 20 general-purpose Sentence Transformers models reminiscent of Mini-LM (Wang and al., 2020), RoBERTa (liu and al., 2019), DistilBERT (Sanh and al., 2020) and MPNet (Song and al., 2020). Our models achieve SOTA on multiple general-purpose Sentence Similarity evaluation tasks. We also shared 8 datasets specialized for Query Answering, Sentence-Similarity, and Gender Evaluation.

General sentence embeddings could be used for a lot of applications. We built a Spaces demo to showcase several applications:

The sentence similarity module compares the similarity of the foremost text with other texts of your alternative. Within the background, the demo extracts the embedding for every text and computes the similarity between the source sentence and the opposite using cosine similarity.
Asymmetric QA compares the reply likeliness of a given query with answer candidates of your alternative.
Search / Cluster returns nearby answers from a question. For instance, for those who input « python », it can retrieve closest sentences using dot-product distance.
Gender Bias Evaluation report inherent gender bias in training set via random sampling of the sentences. Given an anchor text without mentioning gender for goal occupation and a pair of propositions with gendered pronouns, we compare if models assign a better similarity to a given proposition and due to this fact evaluate their proportion to favor a selected gender.

The Community week using JAX/Flax for NLP & CV has been an intense and highly rewarding experience! The standard of Google’s Flax, JAX, and Cloud and Hugging Face team members’ guidance and their presence helped us all learn lots. We hope all projects had as much fun as we did in ours. Each time you could have questions or suggestions, don’t hesitate to contact us!

Source link

Train a Sentence Embedding Model with 1B Training Pairs

Training methodology

Model

Multiple Negative Rating Loss

Improving Quality with Higher Batches

1. Size matters

2. Hard Negatives

3. Cross dataset batches

Training infrastructure and data

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Course Launch Community Event

Large Language Models: A Recent Moore’s Law?

Scaling up BERT-like model Inference on modern CPU

Architecting GPUaaS for Enterprise AI On-Prem

Nice-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers

Train a Sentence Embedding Model with 1B Training Pairs

Training methodology

Model

Multiple Negative Rating Loss

Improving Quality with Higher Batches

1. Size matters

2. Hard Negatives

3. Cross dataset batches

Training infrastructure and data

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.