. We’ve all heard or experienced it.
Natural Language Generation models can sometimes hallucinate, i.e., they begin generating text that just isn’t quite accurate for the prompt provided. In layman’s terms, they begin that’s not strictly related to the context given or plainly inaccurate. Some hallucinations may be comprehensible, for instance, mentioning something related but not precisely the topic in query, other times it could appear to be legitimate information however it’s simply not correct, it’s made up.
That is clearly an issue after we start using generative models to finish tasks and we intend to eat the data they generated to make decisions.
The issue just isn’t necessarily tied to how the model is generating the text, but in the data it’s using to generate a response. When you train an LLM, the data encoded within the training data is crystalized, it becomes a static representation of the whole lot the model knows up until that cut-off date. In an effort to make the model update its or its knowledge base, it must be retrained. Nonetheless, training Large Language Models requires money and time.
One in every of the fundamental motivations for developing RAG s the increasing demand for factually accurate, contextually relevant, and up-to-date generated content.[1]
When fascinated by a technique to make generative models aware of the wealth of latest information that’s created on a regular basis, researchers began exploring efficient ways to maintain these models-up-to-date that didn’t require constantly re-training models.
They got here up with the thought for Hybrid Models, meaning, generative models which have a way of fetching external information that may complement the info the LLM already knows and was trained on. These modela have a information retrieval component that enables the model to access up-to-date data, and the generative capabilities they’re already well-known for. The goal being to make sure each fluency and factual correctness when producing text.
This hybrid model architecture is named Retrieval Augmented Generation, or RAG for brief.
The RAG era
Given the critical have to keep models updated in a time and value effective way, RAG has turn into an increasingly popular architecture.
Its retrieval mechanism pulls information from external sources that should not encoded within the LLM. For instance, you possibly can see RAG in motion, in the actual world, if you ask Gemini something concerning the Brooklyn Bridge. At the underside you’ll see the external sources where it pulled information from.
By grounding the ultimate output on information obtained from the retrieval module, the final result of those Generative AI applications, is less more likely to propagate any biases originating from the outdated, point-in-time view of the training data they used.
The second piece of the Rag Architecture is what’s probably the most visible to us, consumers, the generation model. This is often an LLM that processes the data retrieved and generates human-like text.
RAG combines retrieval mechanisms with generative language models to reinforce the accuracy of outputs[1]
As for its internal architecture, the retrieval module, relies on dense vectors to discover the relevant documents to make use of, while the generative model, utilizes the standard LLM architecture based on transformers.

This architecture addresses very essential pain-points of generative models, however it’s not a silver bullet. It also comes with some challenges and limitations.
The Retrieval module may struggle in getting the most recent documents.
This a part of the architecture relies heavily on Dense Passage Retrieval (DPR)[2, 3]. In comparison with other techniques similar to BM25, which is predicated on TF-IDF, DPR does a a lot better job at finding the semantic similarity between query and documents. It leverages semantic meaning, as a substitute of straightforward keyword matching is very useful in open-domain applications, i.e., take into consideration tools like Gemini or ChatGPT, which should not necessarily experts in a specific domain, but a little bit bit about the whole lot.
Nonetheless, DPR has its shortcomings too. The dense vector representation can result in irrelevant or off-topic documents being retrieved. DPR models appear to retrieve information based on knowledge that already exists inside their parameters, i.e, facts have to be already encoded with the intention to be accessible by retrieval[2].
[…] if we extend our definition of retrieval to also encompass the power to navigate and elucidate concepts previously unknown or unencountered by the model—a capability akin to how humans research and retrieve information—our findings imply that DPR models fall wanting this mark.[2]
To mitigate these challenges, researchers thought of adding more sophisticated query expansion and contextual disambiguation. Query expansion is a set of techniques that modify the unique user query by adding relevant terms, with the goal of building a connection between the intent of the user’s query with relevant documents[4].
There are also cases when the generative module fails to completely take into consideration, into its responses, the data gathered within the retrieval phase. To deal with this, there have been recent improvements on attention and hierarchical fusion techniques [5].
Model performance is a vital metric, especially when the goal of those applications is to seamlessly be a part of our day-to-day lives, and make probably the most mundane tasks almost effortless. Nonetheless, running RAG end-to-end may be computationally expensive. For each query the user makes, there must be one step for information retrieval, and one other for text generation. That is where recent techniques, similar to Model Pruning [6] and Knowledge Distillation [7] come into play, to be certain that even with the extra step of looking for up-to-date information outside of the trained model data, the general system continues to be performant.
Lastly, while the data retrieval module within the RAG architecture is meant to mitigate bias by accessing external sources which can be more up-to-date than the info the model was trained on, it may very well not fully eliminate bias. If the external sources should not meticulously chosen, they will proceed so as to add bias and even amplify existing biases from the training data.
Conclusion
Utilizing RAG in generative applications provides a major improvement on the model’s capability to remain up-to-date, and provides its users more accurate results.
When utilized in domain-specific applications, its potential is even clearer. With a narrower scope and an external library of documents pertaining only to a specific domain, these models have the power to do a more practical retrieval of latest information.
Nonetheless, ensuring generative models are continually up-to-date is much from a solved problem.
Technical challenges, similar to, handling unstructured data or ensuring model performance, proceed to be energetic research topics.
Hope you enjoyed learning a bit more about RAG, and the role any such architecture plays in making generative applications stay up-to-date without requiring to retrain the model.
- A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. (2024). Shailja Gupta and Rajesh Ranjan and Surya Narayan Singh. (ArXiv)
- Retrieval-Augmented Generation: Is Dense Passage Retrieval Retrieving. (2024). Benjamin Reichman and Larry Heck— (link)
- Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D. & Yih, W. T. (2020). Dense passage retrieval for open-domain query answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6769-6781).(Arxiv)
- Hamin Koo and Minseon Kim and Sung Ju Hwang. (2024).Optimizing Query Generation for Enhanced Document Retrieval in RAG. (Arxiv)
- Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain query answering. In Proceedings of the sixteenth Conference of the European Chapter of the Association for Computational Linguistics: Principal Volume (pp. 874-880). (Arxiv)
- Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). Learning each weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (pp. 1135-1143). (Arxiv)
- Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. ArXiv. /abs/1910.01108 (Arxiv)