In this text I’ll show you create your personal RAG dataset consisting of contexts, questions, and answers from documents in any language.
Retrieval-Augmented Generation (RAG) [1] is a way that enables LLMs to access an external knowledge base.
By uploading PDF files and storing them in a vector database, we are able to retrieve this data via a vector similarity search after which insert the retrieved text into the LLM prompt as additional context.
This provides the LLM with recent knowledge and reduces the opportunity of the LLM making up facts (hallucinations).
Nevertheless, there are lots of parameters we’d like to set in a RAG pipeline, and researchers are all the time suggesting recent improvements. How will we know which parameters to decide on and which methods will really improve performance for our particular use case?
For this reason we’d like a validation/dev/test dataset to judge our RAG pipeline. The dataset ought to be from the domain we have an interest…

