I explain find out how to construct an app that generates multiple alternative questions (MCQs) on any user-defined subject. The app is extracting Wikipedia articles which are related to the user’s request and uses RAG to question a chat model to generate the questions.
I’ll display how the app works, explain how Wikipedia articles are retrieved, and show how these are used to invoke a chat model. Next, I explain the important thing components of this app in additional detail. The code of the app is obtainable here.
App Demo
The gif above shows the user entering the training context, the generated MCQ and the feedback after the user submitted a solution.

At the primary screen the user describes the context of the MCQs that needs to be generated. After pressing “Submit Context” the app searches for Wikipedia articles which content matches the user query.

The app splits each Wikipedia page into sections and scores them based on how closely they match the user query. These scores are used to sample the context of the subsequent query which is displayed in the subsequent screen with 4 decisions to reply. The user can select a alternative and submit it by “Submit Answer”. Additionally it is possible to skip this query via “Next Query”. On this case it is taken into account that the query didn’t meet the user’s expectation. It is going to be avoided to make use of the context of this query for the generation of following questions. To finish the session the user can select “End MCQ”.

The following screen after the user submitted a solution shows if the reply was correct and provides an extra explanation. Following, the user can either get a brand new query via “Next Query” or end the session with “End MCQ”.

The top session screen shows what number of questions were accurately and wrongly answered. Moreover, it also comprises the variety of questions the user rejected via “Next Query”. If the user selects “Start Recent Session” the beginning screen might be displayed where a brand new context for the subsequent session might be provided.
Concept
The aim of this app is to provide top quality and up-to-date questions on any user-defined topic. Thereby user feedback is taken into account to make sure that the generated questions are meeting the user’s expectations.
To retrieve high-quality and up-to-date context, Wikipedia articles are chosen with respect to the user’s query. Each article is split into sections while every section is scored based on its similarity with the user query. If the user rejects a matter the respective section rating might be downgraded to scale back the likelihood of sampling this section again.
This process might be separated into two workflows:
- Context Retrieval
- Query Generation
That are described below.
Context Retrieval
The workflow how the context of the MCQs is derived from Wikipedia based on the user query is shown below.

The user inserts the query that describes the context of the MCQs initially screen. An example of the user query might be: “Ask me anything about stars and planets”.
To efficiently seek for Wikipedia articles this question is converted into keywords. The keywords of the query above are: “Stars”, “Planets”, “Astronomy”, “Solar System”, and “Galaxy”.
For every keyword a Wikipedia search is executed of which the highest three pages are chosen. Not each of those 15 pages are a superb fit to the query provided by the user. To remove irrelevant pages on the earliest possible stage the vector similarity of the embedded user query and page excerpt is calculated. Pages which similarity is below a threshold are filtered out. In our example 3 of 15 pages were removed.
The remaining pages are read and divided into sections. As not the whole page content could also be related to the user query, splitting the pages into sections allows to pick parts of the page that fit specifically well to the user query. Hence, for every section the vector similarity against the user query is calculated and sections with low similarity are filtered out. The remaining 12 pages contained 305 sections of which 244 were kept after filtering.
The last step of the retrieval workflow is to assign a rating to every section with respect to the vector similarity. This rating will later be used to sample sections for the query generation.
Query Generation
The workflow to generate a brand new MCQ is shown below:

Step one is to sample one section with respect to the section scores. The text of this section is inserted along with the user query right into a prompt to invoke a chat model. The chat model returns a json formatted response that comprises the query, answer decisions, and a proof of the proper answer. In case the context provided will not be suitable to generate a MCQ that addresses the user query the chat model is instructed to return a keyword to discover that the query generation was not successful.
If the query generation was successful, the questions and the reply decisions are exhibited to the user. Once the user submits a solution it’s evaluated if the reply was correct, and the reason of the proper answer is shown. To generate a brand new query the identical workflow is repeated.
In case the query generation was not successful, or the user rejected the query by clicking on “Next Query” the rating of the section that was chosen to generate the prompt is downgraded. Hence, it’s less likely that this section might be chosen again.
Key Components
Next, I’ll explain some key components of the workflows in additional detail.
Extracting Wiki Articles
Wikipedia articles are extracted in two steps: First a search is run to seek out suitable pages. After filtering the search results, the pages separated by sections are read.
Search requests are sent to this URL. Moreover, a header containing the requestor’s contact information and a parameter dictionary with the search query and the variety of pages to be returned. The output is in json format that might be converted to a dictionary. The code below shows find out how to run the request:
headers = {'User-Agent': os.getenv('WIKI_USER_AGENT')}
parameters = {'q': search_query, 'limit': number_of_results}
response = requests.get(WIKI_SEARCH_URL, headers=headers, params=parameters)
page_info = response.json()['pages']
After filtering the search results based on the pages’ excerpts the text of the remaining pages is imported using wikipediaapi:
import wikipediaapi
def get_wiki_page_sections_as_dict(page_title, sections_exclude=SECTIONS_EXCLUDE):
wiki_wiki = wikipediaapi.Wikipedia(user_agent=os.getenv('WIKI_USER_AGENT'), language='en')
page = wiki_wiki.page(page_title)
if not page.exists():
return None
def sections_to_dict(sections, parent_titles=[]):
result = {'Summary': page.summary}
for section in sections:
if section.title in sections_exclude: proceed
section_title = ": ".join(parent_titles + [section.title])
if section.text:
result[section_title] = section.text
result.update(sections_to_dict(section.sections, parent_titles + [section.title]))
return result
return sections_to_dict(page.sections)
To access Wikipedia articles, the app uses wikipediaapi.Wikipedia, which requires a user-agent string for identification. It returns a WikipediaPage object which comprises a summary of the page, page sections with the title and the text of every section. Sections are hierarchically organized meaning each section is one other WikipediaPage object with one other list of sections which are the subsections of the respective section. The function above reads all sections of a page and returns a dictionary that maps a concatenation of all section and subsection titles to the respective text.
Context Scoring
Sections that fit higher to the user query should get the next probability of being chosen. That is achieved by assigning a rating to every section which is used as weight for sampling the sections. This rating is calculated as follows:
[s_{section}=w_{rejection}s_{rejection}+(1-w_{rejection})s_{sim}]
Each section receives a rating based on two aspects: how often it has been rejected, and the way closely its content matches the user query. These scores are combined right into a weighted sum. The section rejection rating consists of two components: the variety of how often the section’s page has been rejected over the very best variety of page rejections and the variety of this section’s rejections over the very best variety of section rejections:
[s_{rejection}=1-frac{1}{2}left( frac{n_{page(s)}}{max_{page}n_{page}} + frac{n_s}{max_{s}n_s} right)]
Prompt Engineering
Prompt engineering is an important aspect of the Learning App’s functionality. This app is using two prompts to:
- Get keywords for the wikipedia page search
- Generate MCQs for sampled context
The template of the keyword generation prompt is shown below:
KEYWORDS_TEMPLATE = """
You are an assistant to generate keywords to go looking for Wikipedia articles that contain content the user desires to learn.
For a given user query return at most {n_keywords} keywords. Be certain every keyword is a superb match to the user query.
Reasonably provide fewer keywords than keywords which are less relevant.
Instructions:
- Return the keywords separated by commas
- Don't return the rest
"""
This technique message is concatenated with a human message containing the user query to invoke the Llm model.
The parameter n_keywords
set the utmost variety of key words to be generated. The instructions make sure that the response might be easily converted to an inventory of key words. Despite these instructions, the LLM often returns the utmost variety of keywords, including some less relevant ones.
The MCQ prompt comprises the sampled section and invokes the chat model to reply with a matter, answer decisions, and a proof of the proper answer in a machine-readable format.
MCQ_TEMPLATE = """
You might be a learning app that generates multiple-choice questions based on educational content. The user provided the
following request to define the training content:
"{user_query}"
Based on the user request, following context was retrieved:
"{context}"
Generate a multiple-choice query directly based on the provided context. The proper answer should be explicitly stated
within the context and will all the time be the primary option in the alternatives list. Moreover, provide a proof for why
the proper answer is correct.
Variety of answer decisions: {n_choices}
{previous_questions}{rejected_questions}
The JSON output should follow this structure (for variety of decisions = 4):
{{"query": "Your generated query based on the context", "decisions": ["Correct answer (this must be the first choice)","Distractor 1","Distractor 2","Distractor 3"], "explanation": "A temporary explanation of why the proper answer is correct."}}
Instructions:
- Generate one multiple-choice query strictly based on the context.
- Provide exactly {n_choices} answer decisions, ensuring the primary one is the proper answer.
- Include a concise explanation of why the proper answer is correct.
- Don't return the rest than the json output.
- The provided explanation shouldn't assume the user is aware of the context. Avoid formulations like "As stated within the text...".
- The response should be machine readable and never contain line breaks.
- Check if it is feasible to generate a matter based on the provided context that's aligned with the user request. If it will not be possible set the generated query to "{fail_keyword}".
"""
The inserted parameters are:
user_query
: text of user querycontext
: text of sampled sectionn_choices
: variety of answer decisionsprevious_questions
: instruction to not repeat previous questions with list of all previous questionsrejected_questions
: instruction to avoid questions of comparable nature or context with list of rejected questionsfail_keyword
: keyword that indicates that query couldn’t be generated
Including previous questions reduces the possibility that the chat model repeats questions. Moreover, by providing rejected questions, the user’s feedback is taken into account when generating recent questions. The instance should make sure that the generated output is in the proper format in order that it may well be easily converted to a dictionary. Setting the proper answer as the primary alternative avoids requiring an extra output that indicates the proper answer. When showing the alternatives to the user the order of decisions is shuffled. The last instruction defines what output needs to be provided in case it will not be possible to generate a matter matching the user query. Using a standardized keyword makes it easy to discover when the query generation has failed.
Streamlit App
The app is built using Streamlit, an open-source app framework in Python. Streamlit has many functions that allow so as to add page elements with just one line of code. Like for instance the element through which the user can write the query is created via:
context_text = st.text_area("Enter the context for MCQ questions:")
where context_text
comprises the string, the user has written. Buttons are created with st.button
or st.radio
where the returned variable comprises the data if the button has been pressed or what value has been chosen.
The page is generated top-down by a script that defines each element sequentially. Each time the user is interacting with the page, e.g. by clicking on a button the script might be re-run with st.rerun()
. When re-running the script, it is necessary to hold over information from the previous run. This is completed by st.session_state
which might contain any objects. For instance, the MCQ generator instance is assigned to session states as:
st.session_state.mcq_generator = MCQGenerator()
in order that when the context retrieval workflow has been executed, the found context is obtainable to generate a MCQ at the subsequent page.
Enhancements
There are numerous options to boost this app. Beyond Wikipedia, users could also upload their very own PDFs to generate questions from custom materials—resembling lecture slides or textbooks. This is able to enable the user to generate questions on any context, for instance it might be used to organize for exams by uploading course materials.
One other aspect that might be improved is to optimize the context selection to attenuate the variety of rejected questions by the user. As an alternative of updating scores, also a ML model might be trained to predict how likely it’s that a matter might be rejected with respect to features like similarity to accepted and rejected questions. Each time one other query is rejected this model might be retrained.
Also, the generated query might be saved in order that when a user desires to repeat the training exercise these questions might be used again. An algorithm might be applied to pick previously wrongly answered questions more regularly to deal with improving the learner’s weaknesses.
Summary
This text showcases how retrieval-augmented generation (RAG) might be used to construct an interactive learning app that generates high-quality, context-specific multiple-choice questions from Wikipedia articles. By combining keyword-based search, semantic filtering, prompt engineering, and a feedback-driven scoring system, the app dynamically adapts to user preferences and learning goals. Leveraging tools like Streamlit enables rapid prototyping and deployment, making this an accessible framework for educators, students, and developers alike. With further enhancements—resembling custom document uploads, adaptive query sequencing, and machine learning-based rejection prediction—the app holds strong potential as a flexible platform for personalized learning and self-assessment.
Further Reading
To learn more about RAGs I can recommend these articles from Shaw Talebi and Avishek Biswas. Harrison Hoffman wrote two excellent tutorials on embeddings and vector databases and constructing an LLM RAG Chatbot. manage states in streamlit might be present in Baertschi’s article.