A recent study from the US has found that the real-world performance of popular Retrieval Augmented Generation (RAG) research systems corresponding to Perplexity and Bing Copilot falls far wanting each the marketing hype and popular adoption that has garnered headlines during the last 12 months.
The project, which involved extensive survey participation featuring 21 expert voices, found a minimum of 16 areas wherein the studied RAG systems (You Chat, Bing Copilot and Perplexity) produced cause for concern:
1: An absence of objective detail within the generated answers, with generic summaries and scant contextual depth or nuance.
2. Reinforcement of perceived user bias, where a RAG engine ceaselessly fails to present a variety of viewpoints, but as an alternative infers and reinforces user bias, based on the way in which that the user phrases an issue.
3. Overly confident language, particularly in subjective responses that can not be empirically established, which may lead users to trust the reply greater than it deserves.
4: Simplistic language and an absence of critical considering and creativity, where responses effectively patronize the user with ‘dumbed-down’ and ‘agreeable’ information, as an alternative of thought-through cogitation and evaluation.
5: Misattributing and mis-citing sources, where the reply engine uses cited sources that don’t support its response/s, fostering the illusion of credibility.
6: Cherry-picking information from inferred context, where the RAG agent appears to be searching for answers that support its generated contention and its estimation of what the user , as an alternative of basing its answers on objective evaluation of reliable sources (possibly indicating a conflict between the system’s ‘baked’ LLM data and the info that it obtains on-the-fly from the web in response to a question).
7: Omitting citations that support statements, where source material for responses is absent.
8: Providing no logical schema for its responses, where users cannot query why the system prioritized certain sources over other sources.
9: Limited variety of sources, where most RAG systems typically provide around three supporting sources for a press release, even where a greater diversity of sources can be applicable.
10: Orphaned sources, where data from all or a number of the system’s supporting citations shouldn’t be actually included in the reply.
11: Use of unreliable sources, where the system appears to have preferred a source that’s popular (i.e., in search engine optimisation terms) moderately than factually correct.
12: Redundant sources, where the system presents multiple citations wherein the source papers are essentially the identical in content.
13: Unfiltered sources, where the system offers the user no option to evaluate or filter the offered citations, forcing users to take the choice criteria on trust.
14: Lack of interactivity or explorability, wherein several of the user-study participants were frustrated that RAG systems didn’t ask clarifying questions, but assumed user-intent from the primary query.
15: The necessity for external verification, where users feel compelled to perform independent verification of the supplied response/s, largely removing the supposed convenience of RAG as a ‘substitute for search’.
16: Use of educational citation methods, corresponding to or ; that is standard practice in scholarly circles, but will be unintuitive for a lot of users.
For the work, the researchers assembled 21 experts in artificial intelligence, healthcare and medicine, applied sciences and education and social sciences, all either post-doctoral researchers or PhD candidates. The participants interacted with the tested RAG systems whilst speaking their thought processes out loud, to make clear (for the researchers) their very own rational schema.
The paper extensively quotes the participants’ misgivings and concerns concerning the performance of the three systems studied.
The methodology of the user-study was then systematized into an automatic study of the RAG systems, using browser control suites:
The authors argue at length (and assiduously, in the great 27-page paper) that each latest and experienced users should exercise caution when using the category of RAG systems studied. They further propose a brand new system of metrics, based on the shortcomings present in the study, that would form the inspiration of greater technical oversight in the long run.
Nonetheless, the growing public usage of RAG systems prompts the authors also to advocate for apposite laws and a greater level of enforceable governmental policy in regard to agent-aided AI search interfaces.
The study comes from five researchers across Pennsylvania State University and Salesforce, and is titled . The work covers RAG systems as much as the state-of-the-art in August of 2024
The RAG Trade-Off
The authors preface their work by reiterating 4 known shortcomings of Large Language Models (LLMs) where they’re used inside Answer Engines.
Firstly, they’re vulnerable to hallucinate information, and lack the aptitude to detect factual inconsistencies. Secondly, they’ve difficulty assessing the accuracy of a citation within the context of a generated answer. Thirdly, they have an inclination to favor data from their very own pre-trained weights, and will resist data from externally retrieved documentation, although such data could also be more moderen or more accurate.
Finally, RAG systems tend towards people-pleasing, sycophantic behavior, often on the expense of accuracy of data of their responses.
All these tendencies were confirmed in each features of the study, amongst many novel observations concerning the pitfalls of RAG.
The paper views OpenAI’s SearchGPT RAG product (released to subscribers last week, after the brand new paper was submitted), as prone to to encourage the user-adoption of RAG-based search systems, regardless of the foundational shortcomings that the survey results hint at*:
The Study
The authors first tested their study procedure on three out of 24 chosen participants, all invited by means corresponding to LinkedIn or email.
The primary stage, for the remaining 21, involved , where participants averaged around six search enquiries over a 40-minute session. This section focused on the gleaning and verification of questions and answers, with potential empirical solutions.
The second phase concerned , which dealt as an alternative with subjective matters, including ecology, vegetarianism and politics.
Source: https://arxiv.org/pdf/2410.22349
Since all the systems allowed at the least some level of interactivity with the citations provided as support for the generated answers, the study subjects were encouraged to interact with the interface as much as possible.
In each cases, the participants were asked to formulate their enquiries each through a RAG system (on this case, Google).
The three Answer Engines – You Chat, Bing Copilot, and Perplexity – were chosen because they’re publicly accessible.
Nearly all of the participants were already users of RAG systems, at various frequencies.
As a result of space constraints, we cannot break down each of the exhaustively-documented sixteen key shortcomings present in the study, but here present a collection of a number of the most interesting and enlightening examples.
Lack of Objective Detail
The paper notes that users found the systems’ responses ceaselessly lacked objective detail, across each the factual and subjective responses. One commented:
One other observed:
Lack of Holistic Viewpoint
The authors express concern about this lack of nuance and specificity, and state that the Answer Engines ceaselessly did not present multiple perspectives on any argument, tending to side with a perceived bias inferred from the user’s own phrasing of the query.
One participant said:
One other commented:
Confident Language
The authors observe that each one three tested systems exhibited using over-confident language, even for responses that cover subjective matters. They contend that this tone will are inclined to encourage unjustified confidence within the response.
A participant noted:
One other commented:
Incorrect Citations
One other frequent problem was misattribution of sources cited as authority for the RAG systems’ responses, with considered one of the study subjects asserting:
The brand new paper’s authors comment †:
Cherrypicking Information to Suit the Query
Returning to the notion of people-pleasing, sycophantic behavior in RAG responses, the study found that many answers highlighted a specific point-of-view as an alternative of comprehensively summarizing the subject, as one participant observed:
One other opined:
For further in-depth examples (and multiple critical quotes from the survey participants), we refer the reader to the source paper.
Automated RAG
Within the second phase of the broader study, the researchers used browser-based scripting to systematically solicit enquiries from the three studied RAG engines. They then used an LLM system (GPT-4o) to research the systems’ responses.
The statements were analyzed for and (i.e., whether the response is for, against, or neutral, in regard to the implicit bias of the query.
An was also evaluated on this automated phase, based on the Likert scale psychometric testing method. Here the LLM judge was augmented by two human annotators.
A 3rd operation involved using web-scraping to acquire the full-text content of cited web-pages, through the Jina.ai Reader tool. Nonetheless, as noted elsewhere within the paper, most web-scraping tools aren’t any more in a position to access paywalled sites than most persons are (though the authors observe that Perplexity.ai has been known to bypass this barrier).
Additional considerations were whether or not the answers cited a source (computed as a ‘citation matrix’), in addition to a ‘factual support matrix’ – a metric verified with the assistance of 4 human annotators.
Thus 8 overarching metrics were obtained: ; ; ; ; ; ; ; and .
The fabric against which these metrics were tested consisted of 303 curated questions from the user-study phase, leading to 909 answers across the three tested systems.

Regarding the outcomes, the paper states:
The authors also note that Perplexity is probably to make use of confident language (90% of answers), and that, against this, the opposite two systems are inclined to use more cautious and fewer confident language where subjective content is at play.
You Chat was the one RAG framework to attain zero uncited sources for a solution, with Perplexity at 8% and Bing Chat at 36%.
All models evidenced a ‘significant proportion’ of unsupported statements, and the paper declares†:
Moreover, all of the tested systems had difficulty in supporting their statements with citations:
The authors conclude:
†