Every little thing You Should Know About Evaluating Large Language Models

Artificial Intelligence

Every little thing You Should Know About Evaluating Large Language Models

admin

August 28, 2023

Every little thing You Should Know About Evaluating Large Language Models

Open Language Models

From perplexity to measuring general intelligence

Image generated by the writer using Stable Diffusion.

As open source language models develop into more available, getting lost in all the choices is straightforward.

How can we determine their performance and compare them? And the way can we confidently say that one model is healthier than one other?

This text provides some answers by presenting training and evaluation metrics, and general and specific benchmarks to have a transparent picture of your model’s performance.

When you missed it, take a have a look at the primary article within the Open Language Models series:

Language models define a probability distribution over a vocabulary of words to pick out the almost definitely next word in a sequence. Given a text, a language model assigns a probability to every word within the language, and the almost definitely is chosen.

Perplexity measures how well a language model can predict the subsequent word in a given sequence. As a training metric, it shows how well the models learned its training set.

We won’t go into the mathematical details but intuitively, minimizing perplexity means maximizing the expected probability.

In other words, the perfect model is the one which just isn’t surprised when it sees the brand new text since it’s expecting it — meaning it already predicted well what words are coming next within the sequence.

While perplexity is useful, it doesn’t consider the meaning behind the words or the context wherein they’re used, and it’s influenced by how we tokenize our data — different language models with various vocabularies and tokenization techniques can produce various perplexity scores, making direct comparisons less meaningful.

Perplexity is a useful but limited metric. We use it primarily to trace progress during a model’s training or to check…