The best way to Do Evals on a Bloated RAG Pipeline

to Constructing an Overengineered Retrieval System. That one was about constructing the whole system. This one is about doing the evals for it.

Within the previous article, I went through different parts of a RAG pipeline: chunking the info properly, query optimization, retrieval (semantic, BM25, or hybrid search), re-ranking, expanding chunks to neighbors, constructing the context, after which generation with an LLM.

Certainly one of the questions I got was: does expanding chunks to neighbors actually improve answers, or does it just add noise and make it harder for the model to remain grounded?

In order that’s what we’ll test here. We’ll run some basic evaluations and take a look at metrics like faithfulness, answer relevancy, context relevance, and hallucination rate, and compare results across different models and datasets.

I’ve collected many of the results here and here already, but we’ll undergo them too.

As a note, I’m planning to match this sort of “advanced” pipeline to a more naive baseline later. But this text is principally about evaluating the pipeline because it is.

I all the time undergo some intro sections before I dig in, but should you’re new-new, I’d first read up on find out how to construct a basic RAG system, how embeddings work, and an actual intro to evals/metrics. Then it’s also possible to read find out how to construct the over-engineered pipeline I introduced above, or a minimum of skim it.

If none of that is latest, then skip to the outcomes part.

Why we perform evals

Evals are about ensuring to pressure-test the system on a much bigger (more targeted) corpus than your favorite 10 questions, and ensuring that whatever changes you push don’t change the standard of the system.

Changes in data, prompts, or models can very much affect performance without you seeing it.

Chances are you’ll also need to point out your team the overall performance of the system you’ve built before being allowed to check it on real users.

But before you do that, it’s good to resolve what to check.

What does a successful system seem like to you? In the event you care about multi-hop, you wish questions that truly require multi-hop. In the event you care about Q&A and proper citations, you test for that. Otherwise, you find yourself evaluating the improper thing.

This can be a bit like doing investigative work: you test something, you are attempting to grasp the outcomes, and you then construct higher tests.

To do that well, you need to try to construct a golden set (often from user logs) to check with.

This isn’t all the time possible, so in situations like this we construct synthetic datasets. This may occasionally not be the very best strategy to do it, as it’s going to clearly be biased and won’t reflect what your users will actually ask. Nevertheless, it’s possible you’ll need to start out somewhere.

For this text, I’ve created three different datasets so we are able to discuss it: one created from the ingested corpus, one that creates messy user questions from the corpus, and one with random questions on RAG that haven’t been generated from the corpus in any respect.

You’ll have the opportunity to see how these datasets give us different results on the metrics, but that all of them mean various things.

What to think about

I’m not going to undergo every little thing there may be to take into consideration here, because doing evals well is pretty difficult (although also fun should you like statistics and data).

But there are a number of things it’s good to remember: LLM judges are biased, cherry-picking questions is an issue, gold answers are best if you may have them, and using a bigger dataset with tags helps you break down where and the way the system is failing.

In the event you’ve read the eval metrics article, you’ve already seen the thought of LLM-as-a-judge. It could possibly be useful, but it surely’s not inherently reliable since it has baked-in preferences and blind spots.

There are things that can make you go mad, like a judge punishing a solution that’s based on the corpus but not explicitly stated within the retrieved chunks (summaries / small inferences), or judging the identical answer otherwise depending on how the query is phrased.

You’ll realize this later once you dig into the questions which can be failing to grasp why.

One other thing to remember is to make certain to not “cherry-pick” questions, even should you feel the urge to.

You obviously have to start out somewhere, but the goal is to get near what your users are literally asking, find the problems, and to update the dataset constantly based on what the system seems to fail in. It’s easy to get nice numbers should you mostly test “easy” questions, but then the eval becomes less useful.

The perfect thing is to have not only real user questions but additionally gold answers.

So even should you can “bypass” having references through the use of an LLM judge, having the proper answers for those questions is best. That’s when you need to use the LLM to evaluate whether the output matches the gold answer, as a substitute of asking it to evaluate the reply by itself.

Sample size matters too. Too small and it is probably not reliable. Too big and it’s easy to miss smaller problems.

If you may have enough data, you may tag questions into topics, different wordings (pessimistic / typical phrasing), and differing kinds (short / long / messy) so you may see what breaks where.

I’ve heard recommendations that start with something like 200–1,000 real queries with gold answers should you want this to be an actual evaluation setup.

Since this complete exercise is hypothetical, and the system has ingested documents to demo the thought of expanding to neighbors, the evals can have datasets which were synthetically generated, and thus less reliable, but there continues to be learnings we are able to get from it.

Deciding on metrics & datasets

This section is about two things: which metrics I’m using to guage the pipeline, and the way I’m using them across datasets to see if neighbor expansion seems to assist.

First, should you haven’t examine evals for LLM systems in any respect, go read this article. It gives you a taxonomy of the various metrics on the market (RAG included).

Since I’m lazy for this, I needed reference-free metrics, but this will even limit us to what we are able to actually test. We will have the judge take a look at the context, the query, and the generated answer.

Just a few metrics that may help listed here are faithfulness (is the reply grounded within the provided context), answer relevancy (does it actually answer the query), context relevancy (how much of the context is just noise), and hallucination (what number of claims are literally backed up by the provided context).

Since we would like to work out if seed expansion is helpful, and without constructing two different pipelines, we are able to do one easy comparison: ask the judge to take a look at the seed chunks vs. the ultimate expanded context and rating how much of the reply comes from each for the faithfulness metric.

If grounding improves when the judge sees the expanded context, that’s a minimum of evidence that the model is using the expanded chunks and it’s not only noise. We would want more testing, though, to say obviously which is the winner.

Finally, the datasets matter as much because the metrics.

In the event you’ve read the primary article, you already know that each one the docs which were ingested are scientific articles that mention RAG. So all of the questions that we create here have to be about RAG.

I have generated three different datasets with a special RAG flavor.

The first is predicated on the ingested corpus, going through each scientific article and writing two questions each that it could actually answer.

The second is doing the identical but providing messy questions like,

This messy user questions dataset may very well be good to check the query optimizer should you read the primary article (but I don’t have those results for you here). Here it’s going to tell us if stating things otherwise would skew the outcomes.

The third dataset is predicated on 66 random RAG questions found online. Which means these questions may not have answers within the corpus (the ingested RAG articles are only from September to October, so we don’t know exactly what they contain).

So the primary two will evaluate how well the pipeline behaves, whether it could actually answer questions on the documents it has, and the third one tells us what it’s missing and the way it behaves on questions that it may not have the opportunity to reply.

Though it is a bit simplified, as the primary questions could also be structured on sections and the random ones could also be higher answered by seed chunks.

Running the evals

To run the evals, you first must run the pipeline on every query, for each model, and store the outcomes.

In the event you don’t store every little thing you wish, you may’t debug later. You should have the opportunity to go from a low rating back to the precise answer, the precise retrieved context, and the precise model settings.

I also wanted to match models, because people assume “larger model = higher answers,” and that’s not all the time true, especially for easier tasks. So I’m running the identical pipeline across GPT-5-mini, GPT-5.1, and GPT-5.2, for several datasets.

Once that’s done, I construct the eval layer on top of those stored outputs.

I used RAGAS for the usual metrics and DeepEval for the custom ones. You may obviously construct it manually, but it surely’s much easier this fashion. I really like how seamless DeepEval is, though it’s harder to debug should you find issues with the judge later.

Just a few specifics: the pipeline runs with no context cap, the judge model is gpt-4o-mini, and we use n=3 for RAGAS and n=1 for the custom judges.

Since neighbor expansion is the entire point of this pipeline, remember we also run this check: for faithfulness, we rating grounding against the seed chunks and against the total expanded context, to see if there’s a difference.

Eval results of datasets & models

Let’s run the evals for the various datasets, metrics, and models to see how the pipeline is doing and the way we are able to interpret the outcomes. Remember you could find the total results here and here (especially should you dislike my childish sketches).

We will start with the outcomes from the dataset generated by the corpus.

Keep in mind that the actual tabe you’ll find here

The table above shows the primary RAGAS metrics. Faithfulness (does it stay grounded within the context provided) and answer relevancy (does it answer the query) are very high.

That is to be expected, as we’re principally giving it questions that it should have the opportunity to reply with the documents. If these showed low numbers, there can be something severely off within the pipeline.

It also gives us back seed faithfulness, where the judge is estimating how grounded the reply is to the seed chunks. This one is overall quite a bit lower than the total context faithfulness, 12–18 points across the various models.

In fewer words: we are able to say that the LLM is using among the full context, not only the seed chunks, when generating its answer.

What we are able to’t judge though is that if the seed-only answer would have been just nearly as good. This can require us to run two pipelines and compare the identical metrics and datasets for every.

Now let’s take a look at these next metrics (for a similar dataset).

I might have estimated that context relevance would decrease here, because it’s the total context that pulls in as much as 10 different chunk neighbors for the section.

A reason for this will likely be that the questions generated are based on sections, which implies that added context helps to reply them.

Structure citations (i.e. does it cite its claims appropriately) looks alright, but hallucination is high, which is nice (1 means no made-up claims in the reply).

Now you’ll see that the various models provide little or no difference when it comes to performance.

Yes, this is kind of a straightforward Q&A task. However it does exhibit that the extra size of the model is probably not needed for every little thing, and the added context expansion may have the opportunity to act as a buffer for the smaller models.

Now let’s take a look at the outcomes if we modify the dataset to those messy user questions as a substitute.

We see a number of drops in points, but they still stay high, though without isolating the outliers here we are able to’t say why. But faithfulness looks lower when only judging with the seed chunks for the messy user questions, which is interesting.

Let’s now turn to the third dataset, which is able to have the opportunity to inform us quite a bit more.

We see throughout worse numbers which is in fact expected, the corpus that has been ingested probably can’t answer all of those questions so well. This helps us point to where we have now missing information.

Faithfulness stays high though still for the total context runs. Here the difference from the seed-only runs are quite a bit higher, which implies the added expansion is getting used more in the reply.

Something that was strange here was how GPT-5.2 consistently did worse for answer relevance across two different runs. This generally is a metric thing, or it could actually be a model thing where it answers more cautiously than before, thus getting a lower rating.

This also tells you why it’s so necessary to check these latest models on your individual pipelines before adding them in.

Let’s proceed with the opposite metrics for the random dataset.

Context relevance could be very low, so the judge thinks there may be a variety of noise within the context that does indirectly answer the query for this dataset.

The context relevance metric rewards retrieving a high proportion of directly relevant chunks, but seed+expand intentionally adds neighbor chunks (definitions, adjoining sections, captions) that ought to improve grounding. Unfortunately, there could also be some noise that comes with that.

Each structure citations and hallucination rating worse here, probably since it’s harder to remain grounded should you’re not provided any relevant information to make use of.

Now this was a primary construct for this pipeline. Based on these results, you may improve it until you drive those numbers up. Possibly change the metrics to be more custom to your pipeline.

Each datasets show various things: the corpus-based one shows the pipeline works, the random one shows there may be missing information and that there may be quite a bit more noise to navigate around.

What to examine

So what do you do now that you may have all these scores? You may take a look at regressions, worst offenders, and do some random sampling.

Like I said before, this whole exercise is a bit like investigative work, which you’ll love should you love working with data. In the event you do have enough data and also you’ve tagged it, then by all means break it down and take a look at the person numbers and see what you discover. But there are three checks you need to all the time do.

First is regressions: check the stuff that used to pass and now fails following changes.

Second is checking the worst offenders (bottom ~5–10 data points) to see what is going on. It may very well be the metric, the judge, it may very well be the retrieval, and it may very well be the model.

Below is one among the worst ones for full faithfulness and answer relevancy.

"query": "how often does entity linking mess up acronyms and lookalikes, and the way do i test that",
"answer": "The documents don't report a direct number for "how often" entity linking fails specifically on acronyms or look‑alike names.nnWhat we do see is that:n- Even when structure is correct, LLMs often link to the improper property or entity (semantic inaccuracies) in SPARQL generation over a scholarly KG [2]. This is similar failure mode you'd see when an acronym (e.g., NMF, ACE) is mapped to the improper node.n- In medical KGs, entity linking relies on exact keywords after which semantic similarity of embeddings when no exact match exists [3]. This last step explicitly allows linking to a merely "semantically related" concept quite than the precise one, which is where lookalikes and acronym collisions can occur [3].n- In conversational entity‑centric systems, the biggest single error category is coreference/entity confusion (27% of failures) [4], showing that confusing similar mentions is a serious practical problem even outside strict EL.nnHow to check this in your setting, using patterns from the papers:n1) Construct a targeted test setn   - Create a small benchmark that stresses acronyms and visually/lexically similar entities, analogous to how MATE is used to emphasize cross‑modal entity linking [1].n   - Include positive pairs (correct acronym→entity) and hard negatives (different entities sharing acronym or near‑an identical names).nn2) Run your full EL pipelinen   - Use the identical identifier / keyword / embedding‑based linking strategy described in MedKGent [3].nn3) Manually and/or LLM‑assisted evaluationn   - Have humans or an LLM rating whether each link is correct vs. incorrect, just like the triple‑validity rubric utilized in MedKGent [3] and the SPARQL error audit in FIRESPARQL [2].nn4) Analyze error typesn   - Break down errors just like the semantic vs. structural evaluation in [2] and the error table in [4], but with categories "acronym collision," "lookalike surface form," etc.nnThis offers you an empirical error rate for your individual stack, focused specifically on acronyms and lookalikes.",
"full_faithfulness": 0.24,
"seed_faithfulness": 0.8260869565000001,
"answer_relevancy": 0.0,
"context_relevance": 0.208549739206933,
"context_relevance_reason": "The context provided does indirectly address the user's query about how often entity linking messes up acronyms and lookalikes, nor does it offer methods for testing that. While it discusses entity linking and its evolution, it lacks specific information on the problems related to acronyms and lookalikes, which is the core of the user's inquiry.",
"hallucination_score": 0.6572611409640697,
"hallucination_reason": "The response accurately identifies that the documents don't provide a selected frequency for the way often entity linking fails with acronyms or lookalikes, which aligns with the input query. It also discusses relevant issues akin to semantic inaccuracies and coreference confusion, that are pertinent to the subject. Nevertheless, it lacks direct references to specific claims made within the context, akin to the restrictions of traditional EL methods or the role of tangible keywords in medical KGs, which could have strengthened the response further.",
"full_contexts": ["Entity LinkingnnEntity Linking (EL) has evolved from text-only methods to Multimodal Entity Linking (MEL), and more recently to Cross-Modal Entity Linking (CMEL), which supports crossmodal reasoning. Traditional EL methods associate textual entities with their corresponding entries in a knowledge base, but overlook non-textual information (Shen, Wang, and Han 2015; Shen et al. 2023). MEL extends EL by incorporating visual information as auxiliary attributes to enhance alignment between entities and knowledge base entries (Gan et al. 2021; Liu et al. 2024b; Song et al. 2024).", "However, MEL does not establish cross-modal relations beyond these auxiliary associations, thereby limiting genuine cross-modal interaction.", "CMEL goes further by treating visual content as entities-aligning visual entities with their textual counterparts-to construct MMKGs and facilitate explicit crossmodal inference (Yao et al. 2023). Research on CMEL remains in its early stages, lacking a unified theoretical framework and robust evaluation protocols. The MATE benchmark is introduced to assess CMEL performance, but its synthetic 3D scenes fall short in capturing the complexity and diversity of real-world images (Alonso et al. 2025). To bridge this gap, we construct a CMEL dataset featuring greater real-world complexity and propose a spectral clustering-based method for candidate entity generation to drive further advances in CMEL research.", "3 Error type analysis on generated SPARQL queriesnnDespite the improvements of LLMs on QA over SKGs, LLMs face limitations when handling KG-specific parsing. The experimental results conducted by Sören Auer et al.[2] showed that only 63 out of 100 handcrafted questions may very well be answered by ChatGPT, of which only 14 answers were correct. To higher understand why LLMs fail to generate the proper SPARQL query to a NLQ, we conduct a pilot experiment on using ChatGPT(GPT-4) with a random one-shot example to generate SPARQL queries for 30 handcrafted questions within the SciQA benchmark datasets.", "Insights from this pilot experiment revealed two major categories of errors LLMs are inclined to make on this task: semantic inaccuracies and structural inconsistencies. Semantic inaccuracies occur when LLMs fail to link the proper properties and entities in ORKG, despite generating SPARQL queries with correct structure. Our observations reveal that LLMs are inclined to depend on the instance provided within the one-shot learning process to generate the proper structure for a certain type", "of questions, but often struggle with linking the proper properties and entities because LLMs don't learn the content of the underlying KG. Structural inconsistencies arise as a consequence of LLMs' lack of ontological schema of the underlying KG, resulting in errors in query structure, akin to missing or abundant links (triples), despite appropriately linking to the mentioned entities or properties.", "Figure 1 shows the instance of semantic inaccuracies and structural inconsistencies problem with the generated SPARQL queries in our pilot study. In the instance of the semantic inaccuracies problem, ChatGPT did not link the proper property orkgp:P15687; as a substitute, it linked to a improper property orkgp:P7101. In the instance of the structural inconsistencies problem, the SPARQL query generated by ChatGPT directly links Contribution to Metrics, fails to detect the proper schema of the ORKG where Contribution and Metric are connected via Evaluation.", "Fig. 1: Examples of semantic inaccuracies and structural inconsistencies problem with the generted SPARQL queriesnnSemantic inaccuracies ProblemnnFail to link the proper properties and entities in ORKGnnWhat is the utmost sample size?nnContribution Evaluation Metric P34 P2006 P7046nnStructural inconsistencies ProblemnnMake errors in query structure, akin to missing or abundant links (triples)nnWhat are the metrics utilized by paper "Using NMF-based text summarizationnnto improve supervised and unsupervised classification?nnorkgp:P15687 rdfs:label Sample size (n)nnorkgp:P7101 rdfs:label has elements", "2 Resultsn2.1 Method overviewnnas its confidence rating. For example, if the triple (NPPA, Negative Correlate, Water) appears in 90% of the outputs, its confidence rating is 0.9. Low-confidence triples (rating < 0.6) are filtered out, and only high-confidence triples are retained for downstream graph construction. Each triple can be annotated with the PubMed ID of the source abstract and a timestamp, ensuring traceability and source attribution. For instance, (NPPA, Negative Correlate, Water) would have a PubMed ID of 10494624 and a timestamp of 2000-01-01.", "As shown in Figure 1 c , for every retained triple, akin to (NPPA, Negative Correlate, Water), the Constructor Agent checks its presence in the present KG. If absent ( i.e. , either the top or tail entities are missing), it's inserted; if present, its confidence rating is updated in keeping with Equation (1). The associated PubMed ID is appended, and the timestamp is updated to reflect the most recent publication. For instance, if an existing triple (NPPA, Negative Correlate, Water) has a confidence rating of 0.7, PubMed ID 10691132, and timestamp 1999-12-31, and a brand new occurrence with a confidence rating of 0.9, PubMed ID 10494624, and timestamp 2000-01-01 is encountered, the updated triple can have a confidence rating of 0.97, PubMed IDs [10691132, 10494624], and a timestamp of 2000-01-01. If the top and tail entities are present however the relation differs, akin to existing (NPPA, Associate, Water) vs. incoming (NPPA, Negative Correlate, Water), only essentially the most appropriate relation is maintained. The Constructor Agent invokes the LLM to resolve the conflict by choosing the more suitable relation, considering each the present and incoming triple's confidence scores and timestamps. If the LLM selects the brand new triple, the present one is replaced; otherwise, no changes are made. The prompt design for relation conflict resolution is shown in Prolonged Data Figure 2 c . Together, the 2 agents extract structured medical facts and integrate them right into a dynamic, time-aware KG. See more details within the Section 4.", "2.2 Structural Characterization of the Knowledge GraphnnIn this section, we detail the structural characteristics of the medical KG we constructed, with an emphasis on the distribution of node types, relationship types, and the arrogance scores of relationship triples. We also present a visualization of a subgraph centered on COVID-19 for instance the graph's structure.", "Using the MedKGent framework, we extracted knowledge triples from the abstracts of 10,014,314 medical papers, with 3,472,524 abstracts (34.68%) yielding extractable triples. The relatively low extraction rate might be attributed to several aspects: first, some abstracts lacked sufficient structured information for triple extraction; second, only triples with a confidence rating exceeding 0.6 were retained, excluding those with lower confidence; and third, some triples extracted by LLMs contained formatting issues, akin to extraneous or irrelevant characters, which were discarded. In total, our Extractor Agent identified 8,922,152 valid triples from the abstracts. Nevertheless, the extracted triples contained a big variety of duplicates and conflicts. To resolve this, our Constructor Agent integrates the triples in chronological order. During this process, duplicates are merged, with the arrogance rating for every triple increasing in proportion to its frequency, reflecting greater certainty. For conflicting triples, where the identical entity pair is related to multiple relations, the Constructor Agent retains essentially the most appropriate relationship. Following this consolidation, the ultimate KG comprises 2,971,384 distinct triples.", "We conducted a comprehensive statistical evaluation of the ultimate constructed KG, which comprises 156,275 nodes. As shown in Figure 2 a , the node distribution is predominantly dominated by Gene and Chemical nodes, with smaller proportions of other entities akin to Disease, Variant, Species, and CellLine. The KG includes 2,971,384 relationship triples (edges), representing a spread of interactions between entities, as illustrated in Figure 2 b . Probably the most common relationship type is 'Associate', followed by 'Negative Correlate' and 'Positive Correlate', indicating strong associations between medical entities. Less frequent relationships, akin to 'Interact', 'Prevent', and 'Drug Interact', provide additional insights into the complexities of medical interactions. The distribution of confidence scores for these relationship triples, shown in Figure 2 c , with confidence values discretized to the closest smaller 0.05 increment (rounding all the way down to the closest multiple of 0.05), reveals a transparent dominance of high-confidence triples. A major proportion of triples exhibit confidence scores of 0.95, reflecting the cumulative increase in confidence resulting from the repetition of triples in the course of the graph construction process. This high-confidence distribution reinforces the reliability and robustness of the KG.", "We visualized a neighborhood subgraph of the constructed KG with COVID-19 because the central node, highlighting five surrounding relationship triples, as shown in Figure 2 d . Each node is characterised by six key attributes: the Identifier, which uniquely references the node and normalizes multiple synonymous mentions to a standardized terminology entry; the Entity Type, which classifies the entity; the Terminology, which maps the entity type to its corresponding standard terminology; the Page Link, providing a reference to the entity within the Terminology; the Exact Keywords, which lists common names and aliases of the entity in lowercase; and the Semantic Embedding, a vector representation of the entity. In practice, these attributes facilitate entity linking inside a question by matching entities to their corresponding nodes within the KG. When the Identifier of an entity within the query is offered, entity linking might be efficiently performed using this unique reference. Within the absence of an Identifier, precise matching", "Figure 2: A comprehensive statistical evaluation and visualization of the constructed KG, consisting of 156,275 nodes and a pair of,971,384 relationship edges. a . Node distribution inside the KG, with Gene and Chemical nodes predominating, and smaller proportions of Disease, Variant, Species, and CellLine. b . Relationship type distribution inside the KG, highlighting the prevalence of 'Associate' relationships, followed by 'Negative Correlate' and 'Positive Correlate', with less common interactions akin to 'Interact', 'Prevent', and 'Drug Interact'. c . The distribution of confidence scores for relationship triples, discretized to the closest smaller 0.05 increment, ensures values are rounded all the way down to the closest multiple of 0.05. This distribution reveals a transparent dominance of high-confidence triples, particularly those with scores of 0.95, underscoring the robustness of the KG. d . Local subgraph visualization centered on COVID-19, displaying five surrounding relationship triples. Each node is characterised by key attributes, including Identifier, Entity Type, Terminology, Page Link, Exact Keywords, and Semantic Embedding, facilitating efficient entity linking through exact or similarity matching. The relationships within the KG are further enriched by attributes akin to Confidence, PubMed IDs, and Timestamp, enhancing traceability, accuracy, and temporal relevance.nnCOVID -19 ACE2 Pneu- monia Lung Disea -ses MAD00 04J08 tociliz- umab Treat Identifier : MESH:C000718219 Entity Type : Chemical Terminology : NCBI MeSH Page Link", ": meshb.nlm.nih.gov/record/ui?ui=C000718219nnExact Keywords : [mad0004j08] Semantic Embedding : [- 0.12, …, 0.10 ] : MESH:D000086382nnEntity Type:nnDiseasenn: meshb.nlm.nih.gov/record/ui?ui=D000086382nn: [ncp, covid-19]n0.25, …, 0.09nnIdentifier:nnMESH:C502936nChemicalnnTerminology:nnNCBI MeSHnn: meshb.nlm.nih.gov/record/ui?ui=C502936nn: [mra, tocilizumab] 0.12, …, 0.13 Associate 59272 Genenn:nnNCBI Genenn: www.ncbi.nlm.nih.gov/gene/59272nn: [ace2, ace2p]n0.22, …, 0.09]nMESH:D011014nn: meshb.nlm.nih.gov/record/ui?ui=D011014nn: [pneumonia]n0.18, …, 0.01nMESH:D008171nn: meshb.nlm.nih.gov/record/ui?ui=D008171nn: [lung diseases,lung damage]nn: [ 0.06, …, 0.11 d a b Drug_Interact (0.1%) 0.70 0.65 'Prevent (0.79 0.75 7.89) (7.5%) 0.60 (8.1%) (5.4% (47.7%) 0.80 CellLine Positive (8.9%) (0.5%) Correlate 0.85 (19.9%) (10.3%) Variant (1.49) (5.9%) Cause (1.4% 0.90 (33.6%) Inhibit (1.2% Negative_Correlate Stimulate (0.5%) (13.7%) Species Compare (26.1%) Cotreat (1.0%)", "Figure 3: Comprehensive evaluation of extraction quality for relationship triples generated by the Extractor Agent. Systematic assessment of extraction accuracy using both automated evaluations by LLMs and independent manual expert review. a . Proportion of valid relationship triples (score ≥ 2.0) across relation types, as assessed by GPT4.1 on a randomly selected subset of 34,725 abstracts (83,438 triples). b . Proportion of valid relationship triples across relation types, as assessed by DeepSeek-v3 on the same subset. c . Validity rates from independent manual evaluation by three domain experts on a subset of 400 abstracts (1,060 triples), demonstrating high inter-expert consistency. d-f . Performance of GPT-4.1 and DeepSeek-v3 compared to three expert evaluations on the shared evaluation subset, reporting precision, recall, and F1 score. g . Pairwise inter-rater agreement between experts and LLMs quantified by Cohen's kappa coefficients, demonstrating substantial consistency across all evaluators.nnGPT-4.nnAutomated EvaluationnnDeepSeek-v3 Automated EvaluationnnManual Evaluation 0936| 0.0307 0,8875 0,8880 0 8700 0.7160 0.4nnExpert1's Evaluation as ReferencennExpert2's Evaluation as ReferencennExpert3's Evaluation as ReferencennPairvise Cohen's 0 9761 09761 0 0602 00760 0.9502 00537 0,9503 0 9440 0.5663 08143 0,8818 0 5446 0.6762 0,8853 0.5446 0.6906 06818 0.6008 0 6560 GPT-4,1 DeepSeek-v3 GPT-4.1 Correlale Corelate Cause Inhon Irhon Cotcat Inlatact Colrcat Kappa ison", "is achieved by checking whether the entity appears in the Exact Keywords list of a specific node. Alternatively, semantic vectors of the query entities can be compared with those in the KG to identify the most similar entities, enabling semantic similarity matching. This approach is particularly beneficial for entities with multiple names, ensuring accurate linking even when not all aliases are captured in the Exact Keywords list.", "The relationships between entities are characterized by three key attributes. Confidence reflects the reliability of the relationship, with higher values indicating greater certainty based on its frequency across multiple sources. The PubMed IDs attribute lists the PubMed identifiers of the papers from which the relationship is derived, enabling easy access to the original publications via the PubMed website 2 . If the relationship appears in multiple papers, all relevant PubMed IDs are included, further increasing the confidence score. Finally, Timestamp denotes the most recent occurrence of the relationship, specifically the publication date of the latest paper. Notably, while Timestamp captures only the latest appearance, the full temporal span of the relationship-including its earliest mention-can be readily retrieved through the associated PubMed IDs via the PubMed website. These attributes collectively enhance the traceability, accuracy, and temporal relevance of the relationships within the KG.", "4 Methodsn4.2.2 Constructor AgentnnA chemical/drug treats a disease. The Treat relationship typically occurs between Chemical and Disease.nnMeSH (Medical Subject Headings)nndbSNP, otherwise HGNV formatnnNCBI TaxonomynCell LinenCellosaurusnnYour task is to select the most appropriate relationnnbetween two medical entities to form morennreasonable knowledge triple.nnThere is an and Now, a new between e1 andnne2 is proposed.nnPlease decide which relation should be retainednnbetween e1 and e2.nnIf r1 should be kept, respond with "Y".nnIf r2 should replace it, respond with "N".nnYou may consider the following two factors to assistnnyour decision:nn(1) Then, andnthat ofnn;nn(2) ThenfornnIn general, relations withnnhigher confidence scores or more recent timestamps are likelynnretained.nnYour output should contain only "Y" or "N". Do notnnprovide any explanations.nnOutput:nnc", "Extended Data Figure 2: a . Prompt template for relation extraction. Given a biomedical abstract and its extracted entities, the Extractor Agent prompts the LLM to infer semantic relations between entity pairs using a predefined relation set and textual descriptions. b . Reference terminologies for entity normalization. Each biomedical entity type is mapped to a standard terminology: Gene (NCBI Gene), Disease and Chemical (MeSH), Variant (dbSNP or HGNV), Species (NCBI Taxonomy), and Cell Line (Cellosaurus). c . Prompt design for relation conflict resolution. When conflicting relations exist between the same entity pair, the Constructor Agent prompts the LLM to select the most appropriate one based on confidence scores and timestamps. d . Schema for predefined relation types. The 12 core relation types-seven bidirectional and five unidirectional-are listed alongside their directionality, descriptions, and allowed entity-type combinations.", "4.3 Quality AssessmentnnWe assessed the quality of relational triples extracted by the Extractor Agent through both automated and manual evaluations, leveraging two state-of-the-art LLMs-GPT-4.1 [74] and DeepSeek-v3 [75]-as well as three PhD students with interdisciplinary expertise in medicine and computer science. For every medical abstract and its corresponding set of extracted triples, individual triples were evaluated using a standardized four-level scoring rubric: 3.0 (Correct), 2.0 (Likely Correct), 1.0 (Likely Incorrect), and 0.0 (Incorrect). The evaluation prompt provided to each LLMs and human annotators is illustrated in Prolonged Data Figure 3 a .", "A relational triple was defined as valid if it received a rating of ≥ 2 . 0 . The validity rate was calculated as:nnTo assess the reliability of automatic evaluation, we compared LLM-based assessments with human annotations on a shared evaluation subset, treating human judgments as ground truth. The precision, recall, and F 1 -score of the automated evaluations were computed as:nnwhere TP, FP, and FN represent true positives, false positives, and false negatives, respectively. To further quantify inter-rater agreement, we calculated Cohen's Kappa coefficient [82] for every pair of evaluators, including each LLMs and human annotators, leading to 10 pairwise comparisons across the five raters. The Kappa coefficient was computed as:nnwhere p 0 represents the observed agreement and p e denotes the expected agreement by likelihood. This evaluation provides a quantitative measure of rating consistency across evaluators.", "4.4 Retrieval-Augmented GenerationnnThe constructed KG serves as a reliable external source for information retrieval and might be integrated into LLMs via a RAG framework. By providing structured biomedical context, the KG enhances LLM performance across a spread of medical QA benchmarks.", "Given a user query q , we first extract the set of medical entities present within the query, denoted as E q = { e q 1 , e q 2 , · · · } . When using PubTator3 [80]-the same entity recognition tool employed during KG constructioneach extracted entity is assigned a novel identifier. This permits for efficient entity linking by matching these identifiers to the corresponding nodes N q = { n q 1 , n q 2 , · · · } inside the graph. Alternatively, if medical entities are extracted using other methods-such as prompting a LLM-they may lack standardized identifiers. In such cases, the extracted entity mentions are first converted to lowercase and matched against the Exact Keywords attribute of every node within the KG. A successful match enables linkage of the entity to the corresponding graph node. In each approaches, if an entity can't be linked via its identifier or if its surface form doesn't appear in any node's Exact Keywords list, we apply a semantic similarity strategy to finish the entity linking process. Specifically, the embedding of the query entity is computed using the identical model employed for generating node-level semantic representations ( i.e. , BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext [81]) and is compared against the Semantic Embedding of all nodes within the KG. The entity is then linked to the node with the very best semantic similarity rating, which can correspond to either the precise concept or a semantically related medical entity. This entity linking framework-combining identifier-based matching, lexical normalization, and semantic embedding-ensures robust and versatile integration of KG-derived knowledge into downstream QA tasks.", "Following entity linking, we construct evidence subgraphs using a neighbor-based exploration strategy [86] to boost the reasoning capabilities of LLMs. For every entity-linked node within the query-specific set N q , we retrieve its one-hop neighbors inside the KG. Specifically, for every node n q i ∈ N q , all adjoining nodes n q ′ i are identified, and the corresponding triples ( n q i , r, n q ′ i ) are appended to form a localized subgraph G q i . This expansion captures the immediate relational context surrounding the query entities, which is important for enabling fine-grained medical reasoning. The entire evidence set for a given query is then defined because the union of those localized subgraphs: G q = { G q 1 , G q 2 , · · · } . The resulting subgraph G q may contain numerous relational triples, including redundant or irrelevant information, which might adversely impact LLM reasoning [87]. To handle this, we leverage the LLM's inherent rating capability to selectively filter high-value knowledge [88]. Given the query q and", "You might be tasked with evaluating the validity of the knowledge triples extracted from the abstract of a medical paper.nnGiven the abstract (nn) of a medical paper and the extracted triplesnn) from this abstract.nnEach triple is represented within the format:nn"Head Entity Name (Alias1, Alias2) | Relationship Name | Tail Entity Name (Alias1, Alias2)"nn,nnwith triples separated by ' $ '.", "Some entities may haven't any aliases or multiple aliases, that are separated by ', ' inside the '()'.nnYour task is to guage the validity of every triple, with a selected give attention to thennrelationshipnnit describes, based on the knowledge provided within the abstract. Consider whether the stated relationship accuratelynnreflects the connection between the top and tail entities as presented or implied within the text.", "For every triple, evaluate its validity using the next scoring scale and assign a confidence rating:nn•nnCorrect (3.0):nnThe relationship logically and accurately describes the relation between the top and tail entities asnnexplicitly mentioned or directly and strongly supportednnby the abstract. Thennrelationship type isnprecisennand the connection isnnundeniablennbased on the text, requiring minimal inference.nnLikely Correct (2.0):nnThe relationship isnngenerally acceptable and directionally correctnn. The core connection between the entities isnnvalid and supported by the text (explicitly, implicitly, or viannreasonable inference)nn, even when the connection type hasnnminor inaccuracies or lacks ideal precisionnn.nnLikely Incorrect (1.0):nnsubstantially inaccurate or misleadingnnsignificantly misrepresentingnnthe connection described within the abstract, even when the entities are mentioned together.nnIncorrect (0.0):nnnot supported by the abstract whatsoevernn, isnnclearly and undeniably contradictednnby the text, or involves annfundamental misunderstandingnnof the entities or theirnnconnection as presented.nnOutput the evaluation in a set format:nnFirst line: 'Evaluation: ' followed by the evaluation of all triples, separated by '; '. Each triple's evaluation should explainnnwhynnthe specific confidence rating (3.0, 2.0, 1.0, or 0.0) was assigned based on the criteriannabove and the abstract's content.", "Second line: Only the numerical confidence scores for all triples, separated by ' $ ', in the identical order because the input triples (e.g., 3.0 $ 2.0 $ 1.0 $ 0.0). This line must contain only numbers (formatted to onenndecimal places like 3.0, 2.0, 1.0, 0.0), decimal points, and ' $ ' as separator, with no additional text or English letters.", "5 Resultsn5.1 Fundamental Resultsnn| | Model | FR (%) | DC (%) | UCS (/5) |n|---:|:-------------------|:-----------|:-----------|:-----------|n| 0 | Stateless LLM | 54.1 (0.4) | 48.3 (0.5) | 2.1 (0.1) |n| 1 | Vector RAG | 71.6 (0.6) | 66.4 (0.7) | 3.4 (0.1) |n| 2 | Entity-RAG | 75.9 (0.5) | 72.2 (0.6) | 3.7 (0.1) |n| 3 | Semantic Anchoring | 83.5 (0.3) | 80.8 (0.4) | 4.3 (0.1) |nnTable 1: Overall performance on MultiWOZ-Long. Semantic Anchoring outperforms all baselines across metrics. Improvements in FR and DC are statistically significant at p < 0 . 01 ; UCS gains are significant at p < 0 . 05 . Values are mean ± stdev over three runs.", "Figure 2 analyzes how performance varies with session depth. While all models degrade as dialogue span increases, Semantic Anchoring sustains over 75% recall at 10 sessions, indicating stronger long-range tracking.", "5.2 Per-Dataset BreakdownnnTo test generality, we evaluate on DialogRE-L , which emphasizes relation extraction across sessions. Results in Table 2 show consistent improvements, though broader domains are needed to claim robustness.", "Figure 2: Factual Recall by session depth on MultiWOZ-Long. Semantic Anchoring exhibits the slowest degradation, maintaining > 75% recall at 10-session distance. Error bars denote standard deviation across three runs.nnFactual Recall vs. Session Depth (MultiWOZ-Long)nnStateless LLM Vector RAG Entity-RAG Semantic Anchoring Session Depthnn|---:|:-------------------|---------:|---------:|-----------:|n| 0 | Stateless LLM | 49.8 | 44.1 | 2 |n| 1 | Vector RAG | 68.7 | 62.5 | 3.2 |n| 2 | Entity-RAG | 72.1 | 68.3 | 3.6 |n| 3 | Semantic Anchoring | 81.4 | 77.9 | 4.2 |nnTable 2: Performance on DialogRE-L. Semantic Anchoring achieves consistent gains across metrics, suggesting effectiveness in relation extraction tasks that require long-range entity tracking.", "5.3 Ablation StudiesnnTable 3 examines the role of linguistic components. Removing discourse tagging reduces FR by 4.7 points, while excluding coreference resolution reduces DC by 6.2 points. Eliminating all symbolic features collapses performance to Vector RAG levels. These results align with observed error patterns (§5.6), underscoring the worth of symbolic features.", "5.4 Qualitative ExamplesnnIn MultiWOZ-Long, when the user later asks 'Did he confirm the time for the taxi?' , Semantic Anchoring retrieves:nn[Entity: John Smith][CorefID: E17] confirmed the taxi is booked for 9 AM.", "In contrast, Vector RAG surfaces unrelated mentions of 'taxi.' Additional examples, including cases where Semantic Anchoring fails, are shown in Appendix C.", "| | Variant | FR (%) | DC (%) | UCS (/5) |n|---:|:-------------------------|---------:|---------:|-----------:|n| 0 | Full Model | 83.5 | 80.8 | 4.3 |n| 1 | - Discourse Tagging | 78.8 | 75.6 | 4 |n| 2 | - Coreference Resolution | 80.1 | 74.6 | 4.1 |n| 3 | - Dependency Parsing | 81.2 | 78.5 | 4.1 |n| 4 | Dense-only (Vector RAG) | 71.6 | 66.4 | 3.4 |nnTable 3: Ablation results on MultiWOZ-Long. Removing discourse or coreference modules significantly reduces FR and DC, respectively. Without all symbolic features, performance falls to the dense-only baseline.", "5.5 Human EvaluationnnFive trained annotators rated 50 randomly sampled conversations for User Continuity Satisfaction (UCS). Agreement was high ( α = 0 . 81 ). As Table 1 shows, Semantic Anchoring achieves the very best UCS (4.3), with annotators noting higher consistency in entity references. Full protocol details are in Appendix B.", "5.6 Error AnalysisnnTable 4 categorizes common failures. Coreference mistakes (27%) and parsing errors (19%) are essentially the most frequent, consistent with ablation findings. Discourse mislabeling (15%) often arises in sarcasm or overlapping speech. While overall error frequency is lower than dense retrieval, these remain open challenges.", "| | Error Type | Proportion of Failures |n|---:|:----------------------|:-------------------------|n| 0 | Parsing errors | 19% |n| 1 | Coreference mistakes | 27% |n| 2 | Discourse mislabeling | 15% |n| 3 | Other / miscellaneous | 39% |nnTable 4: Error evaluation on MultiWOZ-Long. Coreference mistakes are essentially the most frequent error type, followed by parsing and discourse issues. These patterns align with ablation results."],
"seed_texts": ["Entity LinkingnnEntity Linking (EL) has evolved from text-only methods to Multimodal Entity Linking (MEL), and more recently to Cross-Modal Entity Linking (CMEL), which supports crossmodal reasoning. Traditional EL methods associate textual entities with their corresponding entries in a knowledge base, but overlook non-textual information (Shen, Wang, and Han 2015; Shen et al. 2023). MEL extends EL by incorporating visual information as auxiliary attributes to enhance alignment between entities and knowledge base entries (Gan et al. 2021; Liu et al. 2024b; Song et al. 2024).", "Insights from this pilot experiment revealed two major categories of errors LLMs tend to make in this task: semantic inaccuracies and structural inconsistencies. Semantic inaccuracies occur when LLMs fail to link the correct properties and entities in ORKG, despite generating SPARQL queries with correct structure. Our observations reveal that LLMs tend to rely on the example provided in the one-shot learning process to generate the correct structure for a certain type", "We visualized a local subgraph of the constructed KG with COVID-19 as the central node, highlighting five surrounding relationship triples, as shown in Figure 2 d . Each node is characterized by six key attributes: the Identifier, which uniquely references the node and normalizes multiple synonymous mentions to a standardized terminology entry; the Entity Type, which classifies the entity; the Terminology, which maps the entity type to its corresponding standard terminology; the Page Link, providing a reference to the entity in the Terminology; the Exact Keywords, which lists common names and aliases of the entity in lowercase; and the Semantic Embedding, a vector representation of the entity. In practice, these attributes facilitate entity linking within a query by matching entities to their corresponding nodes in the KG. When the Identifier of an entity in the query is available, entity linking can be efficiently performed using this unique reference. In the absence of an Identifier, precise matching", "Given a user query q , we first extract the set of medical entities present in the question, denoted as E q = { e q 1 , e q 2 , · · · } . When using PubTator3 [80]-the same entity recognition tool employed during KG constructioneach extracted entity is assigned a novel identifier. This permits for efficient entity linking by matching these identifiers to the corresponding nodes N q = { n q 1 , n q 2 , · · · } inside the graph. Alternatively, if medical entities are extracted using other methods-such as prompting a LLM-they may lack standardized identifiers. In such cases, the extracted entity mentions are first converted to lowercase and matched against the Exact Keywords attribute of every node within the KG. A successful match enables linkage of the entity to the corresponding graph node. In each approaches, if an entity can't be linked via its identifier or if its surface form doesn't appear in any node's Exact Keywords list, we apply a semantic similarity strategy to finish the entity linking process. Specifically, the embedding of the query entity is computed using the identical model employed for generating node-level semantic representations ( i.e. , BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext [81]) and is compared against the Semantic Embedding of all nodes within the KG. The entity is then linked to the node with the very best semantic similarity rating, which can correspond to either the precise concept or a semantically related medical entity. This entity linking framework-combining identifier-based matching, lexical normalization, and semantic embedding-ensures robust and versatile integration of KG-derived knowledge into downstream QA tasks.", "5.6 Error AnalysisnnTable 4 categorizes common failures. Coreference mistakes (27%) and parsing errors (19%) are essentially the most frequent, consistent with ablation findings. Discourse mislabeling (15%) often arises in sarcasm or overlapping speech. While overall error frequency is lower than dense retrieval, these remain open challenges."]

Here we are able to see the corpus didn’t actually contain direct evidence for what was asked however the seed chunks were more contained so faithfulness got the next rating when only providing the seed chunks.

Answer relevancy was punished unusually low for not having the ability to answer the direct query.

This one was an outlier though and usually full faithfulness is higher than seed faithfulness.

You too can check an excellent one to grasp the way it estimates those that passes.

"query": "What retrieval settings help reduce needle-in-a-haystack issues for medical QA?",
"answer": "To cut back "needle‑in‑a‑haystack" problems in medical QA, the retrieval setup should give attention to bringing back a small, very focused set of highly relevant items as a substitute of enormous, noisy context.nn1. Control context size and granularityn- Use abstracts or snippets quite than full articles; long, unfocused context diluted model performance in biomedical QA, so reranking all the way down to a smaller, high‑quality subset was needed for good answers [1].n- Limit the variety of documents passed to the model (e.g., rerank from 1000's all the way down to a number of hundred or less) to avoid disorienting the model with an excessive amount of text [1].nn2. Use multi‑stage retrievaln- Start with high‑recall lexical retrieval (e.g., BM25 over PubMed) to get a broad set of candidates (as much as 1000's) [1].n- Apply a robust semantic reranker to choose only the highest slice for RAG (e.g., top 300) [1].n- If evidence is missing, fall back to broader web‑or multi‑source search over trusted medical sites to fill gaps, quite than over‑stuffing the primary pass with every little thing [1].nn3. Improve query qualityn- Use an LLM to rewrite or expand queries (synonyms, related terms, less rare jargon) when initial recall is poor, boosting the prospect that the important thing paper is retrieved without over‑expanding every query by default [1].nn4. Use structured, corpus‑aware retrievaln- Retrieve from structured medical knowledge sources akin to knowledge graphs and prepend only essentially the most relevant subgraph, which consistently improved accuracy across several medical QA benchmarks [3].n- In multimodal/heterogeneous setups, use corpus‑specific retrieval for various knowledge sources so each corpus is searched with methods tuned to its content, as a substitute of 1 generic retriever over every little thing [2].",
"full_faithfulness": 1.0,
"seed_faithfulness": 0.8636363636000001,
"answer_relevancy": 0.9135841092,
"context_relevance": 0.8976322813389481,
"context_relevance_reason": "The context passages provide comprehensive insights into retrieval settings that may mitigate needle-in-a-haystack issues in medical QA. Specifically, the discussion on the combination of LLMs for information retrieval, using semantic reranking, and the multi-stage retrieval approach directly addresses the user's query. The emphasis on maintaining relevance while expanding query coverage and the mention of ensemble methods highlight effective strategies for improving retrieval accuracy in complex biomedical queries. Nevertheless, while the knowledge is extremely relevant, a more explicit connection to specific 'needle-in-a-haystack' challenges could enhance clarity.",
"hallucination_score": 0.8893376167284271,
"full_contexts": ["AbstractnnBiomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.", "3. Our methodologynn3.1. Information Retrieval PipelinennTo support high-quality RAG for Phase A+, we developed an IR pipeline that integrates traditional lexical search with LLM-based query generation and semantic reranking (Fig. 1).", "If the initial query returns fewer than five documents, we invoke Gemini 2.5 Pro Preview (05-06) to automatically revise the query. The model is prompted to enhance retrieval recall by enabling approximate matching and omitting overly rare or domain-specific terms. This refinement step is done to improve the query coverage while maintaining relevance. Our experiments have shown that this process is required in less than 5% of the queries in the BioASQ 13 test set.", "We index all PubMed article titles and abstracts in an Elasticsearch instance, using BM25 retrieval as the ranking function. For each input question, we use Gemini 2.0 Flash to generate a structured Elasticsearch query that captures the semantic intent of the question using synonyms, related terms, and full boolean query string syntax rules supported by Elasticsearch. This query is validated using regular expressions and then is used to retrieve up to 10,000 documents.", "Following document retrieval, we apply a semantic reranking model (Google semantic-ranker-default004) to reduce the number of candidate documents [11]. This model re-scores the initially retrieved documents based on semantic similarity to the unique query, allowing us to pick out the highest 300 most relevant documents. This reranked subset is used for downstream RAG-based QA, since despite really long context supported by modern Transformer architectures [12, 13], we couldn't get adequate QA results on full article abstracts without this step.", "Finally, we have now added additional IR searches to handle the cases where a QA step doesn't return a response based on the evidence retrieved from Elasticsearch. We have now observed that Elasticsearch context may not provide sufficient evidence for QA in 3-7% of test cases for Phase A+, depending on the batch. An automatic process is used to expand IR sources to handle these cases. First, we're using a Google search restricted to PubMed sources to aim to search out latest matches. If that fails, we extend our sources to incorporate Home of the Office of Health Promotion and Disease Prevention, WebMD,nnThis multi-stage retrieval approach, combining LLM-generated queries, a standard BM25 search, and semantic reranking, enables flexible, high-recall, and high-precision document selection tailored to complex biomedical queries.", "Figure 1: IR processnnPubMed corpus in Elasticsearch Query Generation (Gemini 2.0 Flash) Query Valida- tion and IR (BM25, ≤ 10,000 docs) Results < Refinement 2.5 Pro) Reranking (semantic- reranker-4) Top 300 Articles for RAG No Yes RefinennHealthline, and Wikipedia. This ensures that we have now a solution candidate for all questions in Phase A+ test sets.", "3.2. Query Answering PipelinennWe adopt a unified, zero-shot QA framework for each Phase A+ and Phase B of the challenge. While the core QA procedure stays consistent across phases, Phase A+ incorporates a further IR step to confirm the presence of candidate answers inside relevant documents (described at the top of Section 3.1). This ensures that chosen documents contain sufficient information to support answer generation.", "To generate candidate answers, we leverage several large language models (LLMs): Gemini 2.0 Flash, Gemini 2.5 Flash Preview (2025-04-17), and Claude 3.7 Sonnet (2025-02-19). Prompts are adjusted using examples derived from the BioASQ 11 test set, improving the response structure and quality.", "The system uses zero-shot prompting, tailored to the query type: Yes/No, Factoid, or List. We experiment with multiple forms of input context: (1) IR-derived results from Phase A+, (2) curated snippets provided in Phase B, and (3) full abstracts of articles chosen during Phase B. This permits us to look at the influence of context granularity on answer accuracy and completeness.", "To consolidate candidate answers, we perform a secondary synthesis step using Gemini 2.0 Flash. This model is prompted to resolve any contradictions, select essentially the most precise and specific answer components, and integrate complementary information right into a single, unified response. As a part of this step, the model also returns a confidence rating estimating the reliability of the synthesized answer. If the rating is below a predefined threshold (0.5, determined empirically), the synthesis is re-run with reduced sampling temperature (from 0.1 to 0.0) to enhance determinism. This synthesis process is evaluated using the BioASQ 12 dataset to make sure consistency with benchmark standards.", "Table 1nnResults of our runs on BioASQ 13 Phase A+, Yes/No questions.", "| | Batch | System | Accuracy | Rating |n|---:|:--------|:------------------|-----------:|----------:|n| 0 | 3 | Extractive | 0.73 | 41 |n| 1 | | (last) | 0.23 | 58 |n| 2 | 4 | Extractive | 0.92 | 1 |n| 3 | | Easy truncation | 0.88 | 11 |n| 4 | | Kmeans | 0.65 | 67 |n| 5 | | (last) | 0.65 | 67 |nnTable 2nnResults of our runs on BioASQ 13 Phase A+, Factoid questions.", "| | Batch | System | MRR | Rating |n|---:|:--------|:------------------|------:|----------:|n| 0 | 3 | Extractive | 0.14 | 41 |n| 1 | | (last) | 0.05 | 47 |n| 2 | 4 | Extractive | 0.43 | 17 |n| 3 | | Easy truncation | 0.29 | 51 |n| 4 | | Kmeans | 0.05 | 62 |n| 5 | | (last) | 0.05 | 62 |", "2 Related WorknnMedical Report Retrieval for Generation. Existing Medical MMRAG approaches primarily utilize the medical images to retrieve relevant reports (He et al. 2024; Sun et al. 2025; Xia et al. 2024, 2025). For example, FactMM-RAG (Sun et al. 2025) enhances report generation by incorporating high-quality reference reports. Similarly, RULE (Xia et al. 2024) and MMed-RAG (Xia et al. 2025) integrate reference reports and employ preference fine-tuning to enhance model utilization of retrieved reports. Although these approaches improve the factual accuracy of responses, they neglect the retrieval of medical documents, that are crucial for Med-LVLM's reliable inference.", "Medical Document Retrieval for Generation. Acknowledging the restrictions of report-only retrieval, recent studies have increasingly emphasized medical documents as knowledge sources (Choi et al. 2025; Shaaban et al. 2025; Wu et al. 2025; Hamza et al. 2025). Amongst them, MKGF (Wu et al. 2025) and K-LLaVA (Hamza et al. 2025) each employ multimodal retrievers to fetch documents from the database, aiming to mitigate hallucination issues in language models. ChatCAD+ (Zhao et al. 2024b) and MIRA (Wang et al. 2025) utilize a zero-shot query rewriting module for retrieval. Nevertheless, these retrieval methods overlook the substantial content differences amongst various corpora, lacking corpus-specific retrieval mechanisms.", "6 ConclusionnnThis work addresses the critical challenges of effective retrieval and multi-aspect alignment for heterogeneous knowledge within the Medical MMRAG field. MedAtlas provides a wealthy, multi-source knowledge base for medical multimodal tasks. The HeteroRAG framework enables precise report retrieval and multi-corpus retrieval, followed by aligning heterogeneous retrieval results through Heterogeneous Knowledge Preference Tuning. Extensive experiments exhibit that our framework achieves state-of-the-art performance across multiple medical VQA and report generation benchmarks. Our work paves the way in which for effectively integrating multi-source medical knowledge, advancing the reliability and applicability of Med-LVLMs in clinical scenarios.", "2 Resultsnn2.3 Quality Assessment of Extracted Relationship TriplesnnFor automated evaluation, two state-of-the-art LLMs, GPT-4.1 [74] and DeepSeek-v3 [75], were employed. A random subset comprising 1% of the abstracts (n = 34,725), leading to 83,438 extracted triples, was chosen for evaluation. Each abstract and its corresponding triples were formatted into structured prompts and independently assessed by each models in keeping with a standardized four-tier rubric: Correct (3.0), Likely Correct (2.0), Likely Incorrect (1.0), and Incorrect (0.0) (the particular evaluation prompt is illustrated in Prolonged Data Figure 3 a ). Triples receiving scores of ≥ 2 . 0 were deemed valid. The evaluation outcomes are presented in Figure 3 a and b , illustrating the proportion of valid triples across relation types for GPT-4.1 and DeepSeek-v3, respectively. Each models demonstrated high overall accuracy, with 85.44% and 88.10% of triples rated as valid bynn2 https://pubmed.ncbi.nlm.nih.gov/", "GPT-4.1 and DeepSeek-v3, respectively. For many relation types, validity was roughly 90%, aside from Negative Correlate, which exhibited barely lower agreement. These findings underscore the high precision of the Extractor Agent across diverse biomedical relation types and support its utility for downstream analyses.", "In parallel, a manual evaluation was conducted to further validate extraction accuracy. Three domain experts with doctoral-level training in artificial intelligence and medicine independently reviewed a randomly chosen subset of 400 abstracts, comprising 1,060 extracted triples. Each abstract and its associated triples were evaluated using the identical standardized scoring rubric. Triples receiving scores of ≥ 2.0 were considered valid. As shown in Figure 3 c , all three reviewers demonstrated high consistency, with overall validity rates exceeding 86% across assessors. The close concordance between manual and automatic evaluations further substantiates the robustness of the Extractor Agent in accurately capturing biomedical relationships, providing strong support for the appliance of the extracted knowledge in large-scale medical analyses.", "To further validate the reliability of the LLM-based assessments, we used three expert annotations as reference standards to guage GPT-4.1 and DeepSeek-v3 on the identical subset of 400 abstracts, respectively. As shown in Figure 3 d -f , each models exhibited strong concordance with expert evaluations, achieving precision, recall, and F1 scores of roughly 95% across metrics. These results further corroborate the accuracy of the automated scoring framework and its alignment with expert judgment.", "Finally, inter-rater agreement was assessed across all evaluators-including three human experts and two LLMs-by computing pairwise Cohen's kappa coefficients on a shared evaluation subset (Figure 3 g ) [82]. Most pairwise comparisons (80%) yielded kappa values exceeding 0.6, indicating substantial agreement-an accepted threshold for reliable concordance in domains involving subjective judgment, including medicine, psychology, and natural language processing [83]. The coefficients between expert 1 and expert 2 (0.5663), and between expert 2 and expert 3 (0.5446), fell barely below this threshold but still reflected moderate agreement, closely approaching the substantial range. These findings exhibit strong inter-rater reliability across each human and automatic evaluators, underscoring the robustness and reproducibility of the evaluation framework.", "2.4 Evaluating Downstream Utility in Medical Query AnsweringnnWe evaluated the downstream utility of our constructed KG as a RAG information source across seven multiplechoice medical QA datasets. These included 4 widely used benchmarks [76]-MMLU-Med, MedQA-US, PubMedQA*, and BioASQ-Y/N-spanning a broad spectrum of clinical and biomedical reasoning tasks. To further assess diagnostic reasoning under various complexity, we introduce MedDDx, a newly developed benchmark suite focused on differential diagnosis [77]. Questions are stratified into three levels-MedDDx-Basic, MedDDxIntermediate, and MedDDx-Expert-based on the variance in semantic similarity amongst answer decisions. All MedDDx subsets were designed to cut back training data leakage and more closely reflect authentic clinical reasoning. Detailed dataset statistics are shown in Figure 4 a . We systematically evaluated five state-of-the-art LLMs to measure the impact of KG-based retrieval. Each model was tested in a zero-shot setting under two conditions: (1) direct answering using internal knowledge alone, and (2) RAG, with relevant KG subgraphs prepended as external context. The models-GPT-4-turbo, GPT-3.5-turbo (OpenAI) [78], DeepSeek-v3 (DeepSeek) [75], Qwen-Max, and Qwen-Plus (Qwen) [79]-span diverse architectures and training regimes, representing each proprietary and open-source systems. All models were accessed via publicly available APIs without additional fine-tuning. Version details and access endpoints are summarized in Figure 4 b .", "Figures 4 c -i present model performance across the seven medical QA datasets using radar plots, each depicting the five LLMs under each direct answering (w/o RAG) and RAG conditions (w/ RAG). Notably, the background shading within the radar plots is lighter for the MedDDx suite (Figure 4 g -i ) than for the 4 widely used benchmarks (Figure 4 c -f ), reflecting the general lower accuracy of all models on these recently introduced and semantically more difficult datasets. This contrast highlights the greater complexity and reduced risk of coaching data leakage inherent to the MedDDx design. Across all datasets, RAG with our KG consistently outperformed direct answering. Probably the most substantial improvements were observed in tasks requiring deeper clinical reasoning, akin to MedQA-US and the MedDDx suite. For instance, on MedQA-US, GPT-3.5-turbo improved from 0.5986 to 0.6834 (+8.5 percentage points), and Qwen-Max from 0.7306 to 0.7636. On MedDDx-Expert, RAG yielded absolute gains of as much as +8.6 points for GPT-3.5-turbo and +5.7 points for Qwen-Max. Even in knowledge-intensive but semantically simpler tasks akin to MMLU-Med and BioASQ-Y/N, RAG offered modest yet consistent advantages. On MMLU-Med, GPT-4-turbo improved from 0.8724 to 0.9054, while DeepSeek-v3 achieved the very best rating overall at 0.9183 with KG support. In BioASQ-Y/N, RAG further enhanced already strong performance, with 4 models exceeding 0.85 accuracy following augmentation. Notably, several models performed higher on MedDDx-Expert than on MedDDx-Basic, despite the previous being constructed with higher semantic complexity. This counterintuitive trend could also be related to differences in distractor framing, where Expert-level distractors-", "Figure 4: Overview of evaluation datasets, model configurations, and performance across medical QA tasks. a . Dataset statistics for the seven medical QA benchmarks utilized in this study. The benchmark suite includes 4 widely adopted datasets [76] (MMLU-Med, MedQA-US, PubMedQA*, and BioASQ-Y/N) and three newly developed differential diagnosis datasets [77] (MedDDx-Basic, MedDDx-Intermediate, and MedDDx-Expert). For every dataset, we report the variety of multiple-choice questions and the corresponding answer option formats. b . Configuration of the five LLMs evaluated: GPT-4-turbo, GPT-3.5-turbo (OpenAI) [78], DeepSeek-v3 (DeepSeek) [75], Qwen-Max, and Qwen-Plus (Qwen) [79]. All models were accessed through public APIs of their zero-shot settings without fine-tuning. The particular version identifiers and access platforms are indicated. c -i . Model performance across the seven QA datasets, shown as radar plots. Each chart compares zero-shot accuracy for five LLMs under two conditions: direct answering without retrieval (w/o RAG) and RAG with our KG (w/ RAG). Across all datasets, RAG with our KG consistently outperformed direct answering.nnDatasets Size Options MMLU-Med 1,089 A/B/C/D MedQA-US 1,273 PubMedQA* Yes/No/Possibly BioASQ-Y/N Yes/No MedDDx-Basic MedDDx-Intermediate 1,041 MedDDx-Expert Provider Model Version Accessed URL OpenAI GPT-4-turbonnhttps://platform.openai.com/docs/models/gpt-4-turbonnGPT-3.5-turbonnhttps://platform.openai.com/docs/models/gpt-3.5-turbonnDeepSeeknDeepSeek-v3", "https://huggingface.co/deepseek-ai/DeepSeek-V3nnQwennQwen-Maxnnhttps://www.alibabacloud.com/help/en/model-nnstudio/what-is-qwen-llm Qwen-Plus b BioASQ-YIN w/o RAG RAG 0.9054 0.8130 0.5780 0.8625 0.5660 0,5720 0.5520 0.7401 0.7880 0.4940 0.831 0.5300 0.8953 0.8834 0.9183 0.8036 h wlo RAG 0.5197 0.5437 0,5714 0.5207 0.5347 0.4890 0,4265 506- 0.3685 0.4204 0,.4688 0.5020 0,4720 0.5259 0.4990 0.5043 0.5592 0,5878 0.8935 0.8576 7855| 0.8398 DeepSe -Max Seek-v3 0,5135 ) 5673 0.5469 0.4700", "Figure 5: Case study of tocilizumab for literature-based discovery and drug repurposing inside the KG. a . Known association between tocilizumab and rheumatoid arthritis, supported by multiple publications, with the earliest reported date defined by the primary extracted supporting paper. b . Two multi-hop reasoning paths linking tocilizumab to COVID-19 via intermediate genes FGB and TNF. The inferred Treat relation (red arrow) was derived solely from earlier literature, while later studies validated this prediction (green arrow). The temporal order of evidence highlights the KG's capability to anticipate therapeutic connections prior to their recognition within the literature.nntociliz-numabnnIdentifier:nnMESH:C502936nnEntity Type:nnChemicalnnTerminology:nnNCBI MeSHnPage Linknn: meshb.nlm.nih.gov/record/ui?ui=C502936nnTreat Arthritis Rheum atoid MESH:D001172 Diseasenn: meshb.nlm.nih.gov/record/ui?ui=D001172nnConfidencen: 0.999999925nPubMed IDsnn:nn26374404,27958380,29146040,30859494,308nn88472,32844216,35713462,36688476nnEarliest Reported Daten: 2016-07-01nnmeshb.nlm.nih.gov/record/ui?ui=C502936nnFGB Gene Terminology NCBI Genenn: www.ncbi.nlm.nih.gov/gene/2244nnCOVID -19 Identifier : MESH:D000086382 : NCBI MeSHnnmeshb.nlm.nih.gov/record/ui?ui=D000086382nnTNF"],
"seed_texts": ["AbstractnnBiomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly fine-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These findings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with effective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.", "Finally, we have added additional IR searches to handle the cases where a QA step does not return a response based on the evidence retrieved from Elasticsearch. We have observed that Elasticsearch context might not provide sufficient evidence for QA in 3-7% of test cases for Phase A+, depending on the batch. An automated process is used to expand IR sources to address these cases. First, we are using a Google search restricted to PubMed sources to attempt to find new matches. If that fails, we extend our sources to include Home of the Office of Health Promotion and Disease Prevention, WebMD,nnThis multi-stage retrieval approach, combining LLM-generated queries, a traditional BM25 search, and semantic reranking, enables flexible, high-recall, and high-precision document selection tailored to complex biomedical queries.", "Medical Document Retrieval for Generation. Acknowledging the limitations of report-only retrieval, recent studies have increasingly emphasized medical documents as knowledge sources (Choi et al. 2025; Shaaban et al. 2025; Wu et al. 2025; Hamza et al. 2025). Among them, MKGF (Wu et al. 2025) and K-LLaVA (Hamza et al. 2025) both employ multimodal retrievers to fetch documents from the database, aiming to mitigate hallucination issues in language models. ChatCAD+ (Zhao et al. 2024b) and MIRA (Wang et al. 2025) utilize a zero-shot query rewriting module for retrieval. Nevertheless, these retrieval methods overlook the substantial content differences among various corpora, lacking corpus-specific retrieval mechanisms.", "6 ConclusionnnThis work addresses the critical challenges of effective retrieval and multi-aspect alignment for heterogeneous knowledge in the Medical MMRAG field. MedAtlas provides a rich, multi-source knowledge base for medical multimodal tasks. The HeteroRAG framework enables precise report retrieval and multi-corpus retrieval, followed by aligning heterogeneous retrieval results through Heterogeneous Knowledge Preference Tuning. Extensive experiments demonstrate that our framework achieves state-of-the-art performance across multiple medical VQA and report generation benchmarks. Our work paves the way for effectively integrating multi-source medical knowledge, advancing the reliability and applicability of Med-LVLMs in clinical scenarios.", "2.4 Evaluating Downstream Utility in Medical Question AnsweringnnWe evaluated the downstream utility of our constructed KG as a RAG information source across seven multiplechoice medical QA datasets. These included four widely used benchmarks [76]-MMLU-Med, MedQA-US, PubMedQA*, and BioASQ-Y/N-spanning a broad spectrum of clinical and biomedical reasoning tasks. To further assess diagnostic reasoning under various complexity, we introduce MedDDx, a newly developed benchmark suite focused on differential diagnosis [77]. Questions are stratified into three levels-MedDDx-Basic, MedDDxIntermediate, and MedDDx-Expert-based on the variance in semantic similarity amongst answer decisions. All MedDDx subsets were designed to cut back training data leakage and more closely reflect authentic clinical reasoning. Detailed dataset statistics are shown in Figure 4 a . We systematically evaluated five state-of-the-art LLMs to measure the impact of KG-based retrieval. Each model was tested in a zero-shot setting under two conditions: (1) direct answering using internal knowledge alone, and (2) RAG, with relevant KG subgraphs prepended as external context. The models-GPT-4-turbo, GPT-3.5-turbo (OpenAI) [78], DeepSeek-v3 (DeepSeek) [75], Qwen-Max, and Qwen-Plus (Qwen) [79]-span diverse architectures and training regimes, representing each proprietary and open-source systems. All models were accessed via publicly available APIs without additional fine-tuning. Version details and access endpoints are summarized in Figure 4 b ."]

Here, the additional (full) context actually incorporates the missing evidence which helped and subsequently got the next rating than the seeds.

But frankly though I find the reply is a bit lackluster still, there may be nothing all that interesting in the reply. But this points to issues on the retrieval side though.

Lastly, check a number of random ones and see the way it is evaluating these.

"query": "how do i measure if iu2019m getting diverse sources as a substitute of 10 chunks from the identical pdf", 
"answer": "You may treat u201cdiverse sourcesu201d as something you measure directly, not only eyeball.nn1. Easy source-diversity rating (like within the paper)nIn Document 1, they define a retrieval diversity metric because the fraction of unique review IDs among the many topu201110 chunks: Diversity = (variety of distinct IDs) / 10 [1].nYou can copy this concept for documents:n- Give every PDF (or document) a novel ID.n- Once you retrieve K chunks (say K=10), take a look at which document each chunk got here from.n- Compute: Diversity = (variety of distinct document IDs) / K.n  u2022 If all 10 chunks come from the identical PDF, Diversity = 1 / 10 = 0.1 (very low).n  u2022 If all 10 chunks come from different PDFs, Diversity = 10 / 10 = 1.0 (perfect diversity).nnYou can log this number for every query to observe whether youu2019re repeatedly pulling from the identical file or actually spreading across your corpus.nn2. Mix with relevancenIn Document 1, additionally they track how close the chunks are to the query using a mean cosine similarity between the query embedding and every chunk [1]. In practice you wish each:n- High similarity (onu2011topic).n- High diversity (many alternative source IDs).nn3. Optional: richer diversity laternDocument 2 shows that diversity will also be measured at different levels (lexical, semantic, ingredient) when you may have multiple outputs for one input [2]. By analogy, once youu2019re completely satisfied with source diversity, you may later add:n- Semantic diversity between chunks (are they saying various things?).nBut the simplest place to begin is the per-query u2018distinct document IDs / Ku2019 metric.",
"full_faithfulness": 0.8823529412000001, 
"seed_faithfulness": 0.5294117647000001,
"answer_relevancy": 0.8588673985, 
"context_relevance": 0.5303662744007874,
"context_relevance_reason": "The context passages provide relevant details about measuring diversity in retrieval systems, particularly within the context of recipe adaptation and LLMs. Nevertheless, while some passages discuss diversity metrics and retrieval methods, they do indirectly address the user's specific query about measuring diverse sources versus multiple chunks from the identical PDF. The relevance of the context is somewhat indirect, resulting in a moderate rating.",
"hallucination_score": 0.7209711030557213,
"hallucination_reason": "The response effectively outlines a way for measuring source diversity by introducing a straightforward source-diversity rating and providing a transparent formula. It aligns well with the context, which discusses retrieval diversity metrics. Nevertheless, while it mentions combining relevance with diversity, it doesn't explicitly connect this to the context's give attention to average cosine similarity, which could enhance the completeness of the reply. Overall, the claims are mostly supported, with minor gaps in direct references to the context."
"full_context": ["D. Question and Answering (QA)nnFor retrieval of reviews, we sampled five Spotify-centric queries and retrieved the top K = 10 review chunks for each. We measured two unsupervised metrics:nnAverage Cosine Similarity : the mean cosine similarity between each query embedding and its top-10 chunk embeddings.", "Retrieval Diversity : the fraction of unique review IDs among all retrieved chunks (distinct IDs / 10).nnOur retriever achieved perfect diversity and cosine scores from 0.618 to 0.754, demonstrating reliable, on-topic retrieval. Table IX summarizes these proxy metrics.", "For generation of answers, we randomly sampled 20 generated answers (each paired with its cited snippets) and annotated them ourselves, confirming that each answer (1) reflected the cited excerpts, (2) covered the main points of those excerpts, and (3) was written in clear, reader-friendly prose. We found the responses to be accurate and comprehensive.", "| | Query | Avg. Cosine Sim. | Diversity |n|---:|:-------------------------------------------------------------------------------|-------------------:|------------:|n| 0 | What complaints do users have about | 0.713 | 1 |n| 1 | What do listeners say about Spotify crashing or freezing on startup? | 0.754 | 1 |n| 2 | How do listeners describe the app's offline playback experience? | 0.696 | 1 |n| 3 | How do users report errors or failures when downloading songs for offline use? | 0.618 | 1 |n| 4 | What do users say about Spotify's crossfade and track-transition experience? | 0.65 | 1 |nnTABLE IX RETRIEVAL PROXY METRICS (K=10) FOR SELECTED SPOTIFY QUERIES (HIGHER DIVERSITY IS BETTER)", "2 Related WorknnRecipe Cross-Cultural Adaptation Recipe cross-cultural adaptation (Cao et al., 2024) involves modifying recipes to suit the dietary preferences and writing styles of the target culture. This includes not just translation, but also adjusting formats, ingredients, and cooking methods to align with cultural norms. Previous studies (Cao et al., 2024; Pandey et al., 2025; Zhang et al., 2024) often treat recipe adaptation as a cross-cultural translation task, exploring how prompt-based LLMs can be used for Chinese-English recipe adaptation.", "However, LLM-based recipe adaptation still faces challenges. Magomere et al.'s (2024) show that such methods can be misleading and may reinforce regional stereotypes. Hu et al.'s (2024) further identify two main challenges: First, LLMs lack culinary cultural knowledge, leading to insufficient cultural appropriateness. Second, the adapted recipes have quality issues, such as changing ingredients without adjusting the cooking steps accordingly. They propose another way to address these issues, namely through cross-cultural recipe retrieval, which sources recipes from real cooking practices within the target culture, generally offering better quality and cultural alignment. However, compared to directly using LLMs, the retrieved recipes often have low similarity to the original.", "All the above-mentioned studies primarily focus on the quality of generated results, including cultural appropriateness and their preservation of the original . However, they overlook the diversity of the results and do not explore the use of RAG for cross-cultural recipe adaptation. Our study emphasizes the trade-off between diversity and quality, with a particular focus on RAG-based approaches.", "Diversity in text generation, IR, and RAG Previous studies (Lanchantin et al., 2025) have shown that post-training LLMs tend to sharpen their output probability distribution, leading to reduced response diversity. This has raised a common concern about the insufficient diversity of LLMs, particularly in creative tasks. Several stochastic sampling-based decoding methods are widely used to control the level of diversity, most notably by adjusting hyperparameters such as temperature (Shi et al., 2024). However, these methods often still fall short in achieving sufficient diversity and may lead to a rapid decline in output quality, which is another important factor to consider when measuring diversity (Lanchantin et al., 2025).", "Figure 2: Overview of CARRIAGE . Diversity components are highlighted. We first enhance the diversity of retrieved results, then we enable more diverse use of contextual information via dynamic context selection, and inject contrastive context to prevent the LLM from generating outputs similar to previously generated recipes.nnMulti-Query Retrieval Source Culture Recipe Target Culture Diversity-aware Reranking Query Rewriting Dynamic Context Organization Pool of Previously Generated Recipes LLM Generation Contrastive Context Injection Previously : Diversity component Reference Recipes Selection Relevance DiversitynnMay generate multiple timesnnIn IR, retrieving text with high diversity can cover a wider range of subtopics, thereby accommodating the potentially diverse preferences of different users. Methods such as diverse query rewriting (Mohankumar et al., 2021) and diversity-aware re-ranking (Carbonell and Goldstein, 1998; Krestel and Fankhauser, 2012) can effectively enhance the diversity of retrieval results. Some recent works (Carraro and Bridge, 2024) have explored using LLMs to enhance diversity in re-ranking.", "In RAG, prior works have mainly focused on retrieving diverse results to obtain more comprehensive information, such as mitigating context window limitations (Wang et al., 2025) and addressing multi-hop question answering tasks (Rezaei and Dieng, 2025). These works are primarily framed as question answering, aiming to acquire comprehensive knowledge to produce a single correct answer. Consequently, the evaluation metrics emphasize answer accuracy rather than diversity. In contrast, our task naturally permits multiple valid answers. Therefore, we adopt different strategies to encourage answer diversity and use metrics that explicitly evaluate the diversity of final outputs. While prior works have largely focused on retrieving diverse contexts, our approach goes a step further by investigating how to utilize such diverse contexts to produce diverse outputs.", "5 MetricsnnOur evaluation metrics focus on two key aspects: diversity and quality . To assess diversity, we consider factors such as lexical , semantic , and ingredient diversity from a per-input perspective. As a trade-off, we evaluate quality from two dimensions: the preservation of the source recipe, and cultural appropriateness for users in the target culture.", "5.1 DiversitynnKirk et al.'s (2023) have proposed two paradigms for measuring diversity: across-input (over pairs of one input and one output) and per-input diversity (one input, several outputs). Per-input diversity helps us investigate whether a single recipe can be adapted into multiple variants to meet different dietary preferences, while across-input diversity assesses whether the generated recipes collectively exhibit a diverse range of linguistic patterns. Because our investigation primarily focuses on whether a single recipe can be adapted into diverse variations to meet a broader range of needs, we adopt the per-input diversity setting as our main experimental focus. The across-input diversity setting is discussed further in Section 7.", "For a diversity metric D , under model configuration c , A denotes a set of adapted recipes,", "containing N source recipes, we define A i c = { a i c, 1 , a i c, 2 , . . . , a i c,K } as the set of K adaptations for the i -th source recipe under configuration c . The per-input diversity is defined as follows:nnLexical Diversity Lexical diversity is a measure of the variety of vocabulary used within a set of text. High lexical diversity indicates using a broad range of unique words, which may correspond to a wider variety of ingredients, cooking methods, and flavors. We employ Unique-n (Johnson, 1944) to evaluate lexical diversity, calculated as the ratio of unique n -grams to the total number of n -grams, reflecting the proportion of distinct n -grams and indicates vocabulary richness. Following prior work (Guo et al., 2024), we report the average Unique-n across unigrams, bigrams, and trigrams.", "Semantic Diversity Semantic diversity refers to the variety of meanings within a set of texts. High semantic diversity suggests a wide range of culinary ideas. We measure per-input semantic diversity using the average pairwise cosine distance between Sentence-BERT embeddings because embedding-based semantic diversity enables a more fine-grained evaluation of variation beyond surface-level vocabulary (Stasaski and Hearst, 2023). Specifically, for a set of K adapted recipes, we define the sum of their average semantic similarity and semantic diversity to be 1. In this formulation, higher semantic similarity implies lower semantic diversity. We define semantic diversity, scaled to the range [0 , 1] , as follows:nnwhere e represents embeddings of the recipe.", "Ingredient Diversity Ingredient diversity measures the variation in sets of ingredients across different recipes. Ingredient alternative plays a vital role in recipe diversity (Borghini, 2015). In comparison with general lexical variation, ingredient changes offer a more precise signal for capturing the important thing aspects driving diversity in recipes.", "Recipes often describe the identical ingredient in various ways, akin to differences in quantity or units of measurement. To mitigate this, we introduce Standard Ingredients , which retain only the ingredient name by stripping away non-essential details. Since ingredient descriptions typically follow the format < quantity > < unit > < ingredient name >, we extract only the < ingredient name > to compute ingredient diversity. The detailed procedure is provided in Appendix B.", "To avoid the influence of differing ingredient counts across recipes, we define ingredient diversity because the ratio of unique standardized ingredients to the overall variety of ingredients. For a set of K adapted recipes, let the set of standardized ingredients for every recipe be I 1 , I 2 , . . . , I K . We define ingredient diversity as follows:", "5.2 QualitynnWe define automatic quality metrics to function a trade-off when evaluating recipe diversity. Further details on the training and evaluation of the CultureScore model are provided in Appendix B.", "Source Recipe Preservation Following prior work (Cao et al., 2024; Hu et al., 2024), we employ BERTScore (Zhang* et al., 2020), a standard cosine embedding-based method for measuring the similarity between source and output recipes. Previous studies have shown that BERTScore aligns well with human evaluations when it comes to source recipe preservation (Hu et al., 2024).", "Cultural Appropriateness We propose a novel metric, the Recipe Cultural Appropriateness Rating (CultureScore), to evaluate how well the output recipes align with the goal culture. Specifically, we employ a BERT-based classifier (Devlin et al., 2019; Cau00f1ete et al., 2020) to predict the country of origin of a recipe using its title and list of ingredients as input. The CultureScore is defined as the common predicted probability assigned by the model to the goal culture across all adapted recipes, with higher scores indicating higher cultural alignment. Since Latin American and Spanish recipes share the identical language, the model cannot depend on linguistic cues; as a substitute, it must learn to differentiate them based on culturally relevant features akin to ingredients, flavors, and writing styles. On condition that the classification model achieves an F1-score of over 90% in distinguishing between Latin American and Spanish recipes, we consider CultureScore a reliable proxy for assessing cultural appropriateness.", "| | | Method. | Diversity ( u2191 ).Lexical | Diversity ( u2191 ).Ingredient | Diversity ( u2191 ).Semantic | Quality ( u2191 ).CultureScore | Quality ( u2191 ).BERTScore |n|---:|:------------------|:----------------------------------------------------------------------------|:--------------------------|:-----------------------------|:---------------------------|:-----------------------------|:--------------------------|n| 0 | Closed- Book LLMs | Llama3.1-8B Qwen2.5-7B Gemma2-9B | 0.557 0.551 0.538 | 0.667 0.531 0.639 | 0.232 0.247 0.196 | 0.451 0.404 0.468 | 0.404 0.439 0.370 |n| 1 | IR | JINA-ES CARROT CARROT-MMR | 0.742 0.735 0.741 | 0.937 0.925 0.941 | 0.459 0.462 0.527 | 0.511 0.512 0.503 | 0.295 0.301 0.298 |n| 2 | RAG | Vanilla-LLaMA RAG CARROT-LLaMA RAG CARROT-MMR-LLaMA RAG CARROT-MMR-Qwen RAG | 0.518 0.525 0.520 0.532 | 0.748 0.765 0.748 0.536 | 0.155 0.152 0.164 0.212 | 0.383 0.385 0.393 0.402 | 0.551 0.545 0.545 0.448 |n| 3 | Ours | CARRIAGE -LLaMA CARRIAGE -Qwen | 0.577 0.628 | 0.739 0.676 | 0.269 0.303 | 0.463 0.590 | 0.442 0.342 |", "Table 1: Evaluation of diversity and quality on the RecetasDeLaAbuel@ dataset shows that our proposed CARRIAGE -LLaMA outperforms all closed-book LLMs when it comes to Pareto efficiency across each diversity and quality metrics. In contrast, IR-based methods struggle with preserving the source recipe, while other RAG-based approaches are inclined to underperform when it comes to diversity and cultural appropriateness."

This above is interesting as you see that the evaluator is taking an affordable generalization and treats it as “kinda supported” or “meh.”

Evaluating this item above with one other LLM, it said that it thought comment was a bit whiny.

But as you see, low scores don’t need to mean that the system is bad. You've gotten to look at why they're low and likewise why they're high to grasp how the judge works or why the pipeline is failing.

example is context relevance here. Context relevance is measuring how much of the retrieved context was useful. In the event you’re doing neighbor expansion, you'll almost all the time pull in some irrelevant text, so context precision will look worse, especially if the corpus can’t answer the query in the primary place.

The query is whether or not the additional context actually helps grounding (faithfulness / hallucination rate) enough to be well worth the noise.

Some cautious notes

Okay, some notes before I round this off.

Testing seeds here is clearly biased, and it doesn’t tell us whether or not they were actually useful on their very own. We’d need to construct two different pipelines and compare them side by side to say that properly.

I’ll try to do that in the longer term, with this exact uses case.

I also needs to note that the system has only a few docs within the pipeline: only about 150 PDF files together with some Excel files, which is a number of thousand pages. But I actually have to demo this in public, and this was the one way.

Remember we used only metrics on the generation side here, the context that was retrieved. If the context retrieved is lying or has conflicting information, these metrics may not show it, you may have to measure that before.

Moreover many teams also construct their very own custom metrics, that’s unique to their pipeline and to what they need to test, and even should you start like this, with general ones, you may spot what you wish along the road to construct higher targeted ones.

The very last thing to notice is LLM judge bias. I’m using OpenAI models each for the RAG pipeline and for the evaluator. This is mostly not really useful, but so long as the models are different from the generator and judge it is mostly accepted.

Hopefully it was a fun read (should you’re a dork about data like me).

Stay tuned for the last article where I attempt to test a more naive pipeline against this one (hopefully I actually have time to complete it).

If you need to stay updated or simply connect you’ll find me at LinkedIn, my website, or Medium (and here too).

❤

The best way to Do Evals on a Bloated RAG Pipeline

Why we perform evals

What to think about

Deciding on metrics & datasets

Running the evals

Eval results of datasets & models

What to examine

Some cautious notes

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

Tools for Your LLM: a Deep Dive into MCP

Hugging Face models in Amazon Bedrock

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

an open source initiative to speed up materials discovery and research

The best way to Do Evals on a Bloated RAG Pipeline

Why we perform evals

What to think about

Deciding on metrics & datasets

Running the evals

Eval results of datasets & models

What to examine

Some cautious notes

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.