Text and code embeddings by contrastive pre-training


Text embeddings are useful features in lots of applications resembling semantic search and computing text similarity. Previous work typically trains models customized for various use cases, various in dataset selection, training objective and model architecture. On this work, we show that contrastive pre-training on unsupervised data at scale results in top quality vector representations of text and code. The identical unsupervised text embeddings that achieve recent state-of-the-art leads to linear-probe classification also display impressive semantic search capabilities and sometimes even perform competitively with fine-tuned models. On linear-probe classification accuracy averaging over 7 tasks, our greatest unsupervised model achieves a relative improvement of 4% and 1.8% over previous best unsupervised and supervised text embedding models respectively. The identical text embeddings when evaluated on large-scale semantic search attains a relative improvement of 23.4%, 14.7%, and 10.6% over previous best unsupervised methods on MSMARCO, Natural Questions and TriviaQA benchmarks, respectively. Similarly to text embeddings, we train code embedding models on (text, code) pairs, obtaining a 20.8% relative improvement over prior best work on code search.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x