Introduction
In a YouTube video titled , former Senior Director of AI at Tesla, Andrej Karpathy discusses the psychology of Large Language Models (LLMs) as emergent cognitive effects of the training pipeline. This text is inspired by his explanation of LLM hallucinations and the data presented within the video.
You would possibly have seen model hallucinations. They’re the instances where LLMs generate incorrect, misleading, or entirely fabricated information that appears plausible. These hallucinations occur because LLMs don’t “know” facts in the best way humans do; as an alternative, they predict words based on patterns of their training data. Early models released just a few years ago struggled significantly with hallucinations. Over time, mitigation strategies have improved the situation, though hallucinations haven’t been fully eliminated.
Zyler Vance is a very fictitious name I got here up with. After I input the prompt “Who’s Zyler Vance?” into the falcon-7b-instruct model, it generates fabricated information. Zyler Vance isn’t a personality in The (2018)movie. This model, being an older version, is susceptible to hallucinations.
LLM Training Pipeline
To know where these hallucinations originate from, you’ve got to be accustomed to the training pipeline. Training LLMs typically involve three major stages.
- Pretraining
- Post-training: Supervised Tremendous-Tuning (SFT)
- Post-training: Reinforcement Learning with Human Feedback (RLHF)
Pretraining
That is the initial stage of the training for LLMs. During pretraining the model is exposed to an enormous quantity of very high-quality and diverse text crawled from the web. Pretraining helps the model learn general language patterns, grammar, and facts. The output of this training phase is named the bottom model. It’s a token simulator that predicts the following word in a sequence.
To get a way of what the pretraining dataset might seem like you may see the FineWeb dataset. FineWeb dataset is fairly representative of what you would possibly see in an enterprise-grade language model. All the foremost LLM providers like OpenAI, Google, or Meta may have some equivalent dataset internally just like the FineWeb dataset.
Post-Training: Supervised Tremendous-Tuning
As I discussed before, the bottom model is a token simulator. It simply samples web text documents. We want to show this base model into an assistant that may answer questions. Due to this fact, the pretrained model is further refined using a dataset of conversations. These conversation datasets have a whole bunch of 1000’s of conversations which might be multi-term and really long covering a various breadth of topics.

These conversations come from human labelers. Given conversational context human lablers write out ideal responses for an assistant in any situation. Later, we take the bottom model that’s trained on web documents and substitute the dataset with the dataset of conversations. Then proceed the model training on this latest dataset of conversations. This manner, the model adjusts rapidly and learns the statistics of how this assistant responds to queries. At the top of coaching the model is in a position to imitate human-like responses.
OpenAssistant/oasst1 is certainly one of the open-source conversations dataset available at hugging face. This can be a human-generated and human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages.
Post-training: Reinforcement Learning with Human Feedback
Supervised Tremendous-Tuning makes the model capable. Nevertheless, even a well-trained model can generate misleading, biased, or unhelpful responses. Due to this fact, Reinforcement Learning with Human Feedback is required to align it with human expectations.
We start with the assistant model, trained by SFT. For a given prompt we generate multiple model outputs. Human labelers rank or rating multiple model outputs based on quality, safety, and alignment with human preferences. We use these data to coach an entire separate neural network that we call a reward model.
The reward model imitates human scores. It’s a simulator of human preferences. It’s a very separate neural network, probably with a transformer architecture, nevertheless it isn’t a language model within the sense that it generates diverse language. It’s only a scoring model.
Now the LLM is fine-tuned using reinforcement learning, where the reward model provides feedback on the standard of the generated outputs. So as an alternative of asking an actual human, we’re asking a simulated human for his or her rating of an output. The goal is to maximise the reward signal, which reflects human preferences.
Why Hallucinations?
Now that we have now a clearer understanding of the training technique of large language models, we will proceed with our discussion on hallucinations.
Hallucinations originate from the Supervised Tremendous-Tuning stage of the training pipeline. The next is a selected example of three potential conversations you would possibly have in your training set.

As I even have shown earlier, that is what human-assistant conversations would seem like within the training time. These conversations are created by human labelers under strict guidelines. When a labeler is writing the proper answer for the assistant in each certainly one of these cases either they know this person or they research them on the web. After that, they write the assistant response that has a confident tone of a solution.
At test time, if the model is asked about a person it has not seen during training, it doesn’t simply respond with an acknowledgment of ignorance. Simply put it doesn’t reply with “Oh, I don’t know”. As a substitute, the model statistically imitates the training set.
Within the training set, the questions in the shape “Who’s X?” are confidently answered with the proper answer. Due to this fact on the test time, the model replies with the variety of the reply and it gives the statistically more than likely guess. So it just makes stuff up that’s statistically consistent with the variety of the reply in its training set.
Model Interrogation
Our query now’s learn how to mitigate the hallucinations. It is obvious that our dataset should include examples where the proper answer for the assistant is that the model doesn’t find out about some particular fact. Nevertheless, these answers should be produced only in instances where the model actually doesn’t know. So the important thing query is how will we know what the model knows and what it doesn’t? We want to probe the model to figure that out empirically.
The duty is to determine the boundary of the model’s knowledge. Due to this fact, we want to interrogate the model to determine what it knows and doesn’t know. Then we will add examples to the training set for the things that the model doesn’t know. The proper response, in such cases, is that the model doesn’t know them.

Let’s take a take a look at how Meta handled hallucinations using this idea for the Llama 3 series of models.
Of their 2024 paper titled “The Llama 3 Herd of Models”, Touvron et al. describe how they’ve developed a knowledge-probing technique to attain this. Their primary approach involves generating data that aligns model generations with subsets of factual data present within the pre-training data. They describe the next procedure for the info generation process:
After that data generated from the knowledge probe is used to encourage the model to only answer the questions for which it knows about, and refrain from answering questions that it’s unsure about. Implementing this system has improved the hallucination issue over time.
Using Web Search
Now we have higher mitigation strategies than simply saying we have no idea. We are able to provide the LLM with a possibility to generate factual responses and accurately address the query. What would you do, in a case where I ask you a factual query that you just don’t have a solution to? How do you answer the query? You possibly can perform some research and search the web to determine the reply to the query. Then tell me the reply to the query. We are able to do the identical thing with LLMs.
You possibly can consider the knowledge contained in the parameters of the trained neural network as a vague recollection of things that the model has seen during pretraining an extended time ago. Knowledge within the model parameters is analogous to something in your memory that you just read a month ago. You possibly can remember things that you just read repeatedly over time than something you read rarely. When you don’t have recollection of data that you just read, what you do is go and look it up. Once you look up information, you might be essentially refreshing your working memory with information, allowing you to retrieve and discuss it.
We want some equivalent mechanism to permit the model to refresh its memory or recollection of data. We are able to achieve this by introducing tools for the model. The model can use web search tools as an alternative of just replying with “I’m sorry, I don’t know the reply”. To attain this we want to introduce special tokens, equivalent to and together with a protocol that defines how the model is allowed to make use of these tokens. On this mechanism, the language model can emit special tokens. Now in a case where the model doesn’t know the reply, it has the choice to emit the special token as an alternative of replying with “I’m sorry, I don’t know the reply”. After that, the model will emit the query and .
Here when this system that’s sampling from the model encounters the special token during inference, it is going to pause the generation process as an alternative of sampling the following token within the sequence. It’ll initiate a session with the search engine, input the search query into the search engine, and retrieve all of the extracted text from the outcomes. Then it is going to insert that text contained in the context window.
The extracted text from the online search is now inside the context window that might be fed into the neural network. Consider the context window because the working memory of the model. The info contained in the context window is directly accessible by the model. It’s directly fed into the neural network. Due to this fact it isn’t any longer a vague recollection of data. Now, when sampling latest tokens, it may well very easily reference the info that has been copy-pasted there. Thus, this can be a general overview of how these web search tools function.

How can we teach the model to appropriately use these tools like web search? Again we accomplish this through training sets. We now need enough data and diverse conversations that reveal, by example, how the model should use web search. We want as an instance with examples elements equivalent to: “What are the settings where you might be using the search? What does it seem like? How do you begin a search?” Due to the pretraining stage, it possesses a native understanding of what an internet search is and what constitutes search query. Due to this fact, in case your training set incorporates several thousand examples, the model will have the opportunity to grasp clearly how the tool works.
Conclusion
Large language model hallucinations are inherent consequences of the training pipeline, particularly arising from the supervised fine-tuning stage. Since language models are designed to generate statistically probable text, they often produce responses that appear plausible but lack a factual basis.
Early models were susceptible to hallucinations significantly. Nevertheless, the issue has improved with the implementation of varied mitigation strategies. Knowledge probing techniques and training the model to make use of web search tools have been proven effective in mitigating the issue. Despite these improvements, completely eliminating hallucinations stays an ongoing challenge. As LLMs proceed to evolve, mitigating hallucinations to a big extent is crucial to making sure their reliability as a trustworthy knowledge base.
When you enjoyed this text, connect with me on X (formerly Twitter) for more insights.
