Popularity of RAG
Over the past two years while working with financial firms, I’ve observed firsthand how they discover and prioritize Generative AI use cases, balancing complexity with potential value.
Retrieval-Augmented Generation (RAG) often stands out as a foundational capability across many LLM-driven solutions, striking a balance between ease of implementation and real-world impact. By combining a retriever that surfaces relevant documents with an LLM that synthesizes responses, RAG streamlines knowledge access, making it invaluable for applications like customer support, research, and internal knowledge management.
Defining clear evaluation criteria is essential to making sure LLM solutions meet performance standards, just as Test-Driven Development (TDD) ensures reliability in traditional software. Drawing from TDD principles, an evaluation-driven approach sets measurable benchmarks to validate and improve AI workflows. This becomes especially necessary for LLMs, where the complexity of open-ended responses demands consistent and thoughtful evaluation to deliver reliable results.
For RAG applications, a typical evaluation set includes representative input-output pairs that align with the intended use case. For instance, in chatbot applications, this might involve Q&A pairs reflecting user inquiries. In other contexts, comparable to retrieving and summarizing relevant text, the evaluation set could include source documents alongside expected summaries or extracted key points. These pairs are sometimes generated from a subset of documents, comparable to those which can be most viewed or incessantly accessed, ensuring the evaluation focuses on essentially the most relevant content.
Key Challenges
Creating evaluation datasets for RAG systems has traditionally faced two major challenges.
- The method often relied on subject material experts (SMEs) to manually review documents and generate Q&A pairs, making it time-intensive, inconsistent, and dear.
- Limitations stopping LLMs from processing visual elements inside documents, comparable to tables or diagrams, as they’re restricted to handling text. Standard OCR tools struggle to bridge this gap, often failing to extract meaningful information from non-textual content.
Multi-Modal Capabilities
The challenges of handling complex documents have evolved with the introduction of multimodal capabilities in foundation models. Industrial and open-source models can now process each text and visual content. This vision capability eliminates the necessity for separate text-extraction workflows, offering an integrated approach for handling mixed-media PDFs.
By leveraging these vision features, models can ingest entire pages without delay, recognizing layout structures, chart labels, and table content. This not only reduces manual effort but in addition improves scalability and data quality, making it a robust enabler for RAG workflows that depend on accurate information from quite a lot of sources.
Dataset Curation for Wealth Management Research Report
To reveal an answer to the issue of manual evaluation set generation, I tested my approach using a sample document — the 2023 Cerulli report. One of these document is typical in wealth management, where analyst-style reports often mix text with complex visuals. For a RAG-powered search assistant, a knowledge corpus like this may likely contain many such documents.
My goal was to reveal how a single document might be leveraged to generate Q&A pairs, incorporating each text and visual elements. While I didn’t define specific dimensions for the Q&A pairs on this test, a real-world implementation would involve providing details on forms of questions (comparative, evaluation, multiple alternative), topics (investment strategies, account types), and plenty of other elements. The first focus of this experiment was to make sure the LLM generated questions that incorporated visual elements and produced reliable answers.
My workflow, illustrated within the diagram, leverages Anthropic’s Claude Sonnet 3.5 model, which simplifies the means of working with PDFs by handling the conversion of documents into images before passing them to the model. This built-in functionality eliminates the necessity for added third-party dependencies, streamlining the workflow and reducing code complexity.
I excluded preliminary pages of the report just like the table of contents and glossary, specializing in pages with relevant content and charts for generating Q&A pairs. Below is the prompt I used to generate the initial question-answer sets.
You might be an authority at analyzing financial reports and generating question-answer pairs. For the provided PDF, the 2023 Cerulli report:1. Analyze pages {start_idx} to {end_idx} and for **each** of those 10 pages:
- Discover the **exact page title** because it appears on that page (e.g., "Exhibit 4.03 Core Market Databank, 2023").
- If the page features a chart, graph, or diagram, create a matter that references that visual element. Otherwise, create a matter in regards to the textual content.
- Generate two distinct answers to that query ("answer_1" and "answer_2"), each supported by the page’s content.
- Discover the right page number as indicated in the underside left corner of the page.
2. Return exactly 10 results as a legitimate JSON array (a listing of dictionaries). Each dictionary must have the keys: “page” (int), “page_title” (str), “query” (str), “answer_1” (str), and “answer_2” (str). The page title typically includes the word "Exhibit" followed by a number.
Q&A Pair Generation
To refine the Q&A generation process, I implemented a comparative learning approach that generates two distinct answers for every query. Throughout the evaluation phase, these answers are assessed across key dimensions comparable to accuracy and clarity, with the stronger response chosen as the ultimate answer.
This approach mirrors how humans often find it easier to make decisions when comparing alternatives fairly than evaluating something in isolation. It’s like a watch examination: the optometrist doesn’t ask in case your vision has improved or declined but as a substitute, presents two lenses and asks, Which is clearer, option 1 or option 2? This comparative process eliminates the anomaly of assessing absolute improvement and focuses on relative differences, making the alternative simpler and more actionable. Similarly, by presenting two concrete answer options, the system can more effectively evaluate which response is stronger.
This technique can be cited as a best practice within the article “What We Learned from a Yr of Constructing with LLMs” by leaders within the AI space. They highlight the worth of pairwise comparisons, stating: I highly recommend reading their three-part series, because it provides invaluable insights into constructing effective systems with LLMs!
LLM Evaluation
For evaluating the generated Q&A pairs, I used Claude Opus for its advanced reasoning capabilities. Acting as a “judge,” the LLM compared the 2 answers generated for every query and chosen the higher option based on criteria comparable to directness and clarity. This approach is supported by extensive research (Zheng et al., 2023) that showcases LLMs can perform evaluations on par with human reviewers.
This approach significantly reduces the quantity of manual review required by SMEs, enabling a more scalable and efficient refinement process. While SMEs remain essential through the initial stages to spot-check questions and validate system outputs, this dependency diminishes over time. Once a sufficient level of confidence is established within the system’s performance, the necessity for frequent spot-checking is reduced, allowing SMEs to concentrate on higher-value tasks.
Lessons Learned
Claude’s PDF capability has a limit of 100 pages, so I broke the unique document into 4 50-page sections. After I tried processing each 50-page section in a single request — and explicitly instructed the model to generate one Q&A pair per page — it still missed some pages. The token limit wasn’t the actual problem; the model tended to concentrate on whichever content it considered most relevant, leaving certain pages underrepresented.
To handle this, I experimented with processing the document in smaller batches, testing 5, 10, and 20 pages at a time. Through these tests, I discovered that batches of 10 pages (e.g., pages 1–10, 11–20, etc.) provided the perfect balance between precision and efficiency. Processing 10 pages per batch ensured consistent results across all pages while optimizing performance.
One other challenge was linking Q&A pairs back to their source. Using tiny page numbers in a PDF’s footer alone didn’t consistently work. In contrast, page titles or clear headings at the highest of every page served as reliable anchors. They were easier for the model to select up and helped me accurately map each Q&A pair to the correct section.
Example Output
Below is an example page from the report, featuring two tables with numerical data. The next query was generated for this page:
How has the distribution of AUM modified across different-sized Hybrid RIA firms?

Answer: Mid-sized firms ($25m to <$100m) experienced a decline in AUM share from 2.3% to 1.0%.
In the primary table, the 2017 column shows a 2.3% share of AUM for mid-sized firms, which decreases to 1.0% in 2022, thereby showcasing the LLM’s ability to synthesize visual and tabular content accurately.
Advantages
Combining caching, batching and a refined Q&A workflow led to 3 key benefits:
Caching
- In my experiment, processing a singular report without caching would have cost $9, but by leveraging caching, I reduced this cost to $3 — a 3x cost savings. Per Anthropic’s pricing model, making a cache costs $3.75 / million tokens, nonetheless, reads from the cache are only $0.30 / million tokens. In contrast, input tokens cost $3 / million tokens when caching shouldn’t be used.
- In a real-world scenario with multiple document, the savings grow to be much more significant. For instance, processing 10,000 research reports of comparable length without caching would cost $90,000 in input costs alone. With caching, this cost drops to $30,000, achieving the identical precision and quality while saving $60,000.
Discounted Batch Processing
- Using Anthropic’s Batches API cuts output costs in half, making it a less expensive option for certain tasks. Once I had validated the prompts, I ran a single batch job to judge all of the Q&A answer sets without delay. This method proved far more cost effective than processing each Q&A pair individually.
- For instance, Claude 3 Opus typically costs $15 per million output tokens. By utilizing batching, this drops to $7.50 per million tokens — a 50% reduction. In my experiment, each Q&A pair generated a median of 100 tokens, leading to roughly 20,000 output tokens for the document. At the usual rate, this may have cost $0.30. With batch processing, the fee was reduced to $0.15, highlighitng how this approach optimizes costs for non-sequential tasks like evaluation runs.
Time Saved for SMEs
- With more accurate, context-rich Q&A pairs, Subject Matter Experts spent less time sifting through PDFs and clarifying details, and more time specializing in strategic insights. This approach also eliminates the necessity to hire additional staff or allocate internal resources for manually curating datasets, a process that could be time-consuming and expensive. By automating these tasks, corporations save significantly on labor costs while streamlining SME workflows, making this a scalable and cost-effective solution.