Using LLM’s for Retrieval and Reranking Summary Introduction and Background LLM Retrieval and Reranking Initial Experimental Results Conclusion

This blog post outlines a few of the core abstractions we’ve got created in LlamaIndex around LLM-powered retrieval and reranking, which helps to create enhancements to document retrieval beyond naive top-k embedding-based lookup.

LLM-powered retrieval can return more relevant documents than embedding-based retrieval, with the tradeoff being much higher latency and value. We show how using embedding-based retrieval as a first-stage pass, and second-stage retrieval as a reranking step might help provide a glad medium. We offer results over the Great Gatsby and the Lyft SEC 10-k.

Two-stage retrieval pipeline: 1) Top-k embedding retrieval, then 2) LLM-based reranking

There was a wave of “Construct a chatbot over your data” applications up to now few months, made possible with frameworks like LlamaIndex and LangChain. Plenty of these applications use an ordinary stack for retrieval augmented generation (RAG):

Use a vector store to store unstructured documents (knowledge corpus)
Given a question, use a to retrieve relevant documents from the corpus, and a to generate a response.
The fetchesthe top-k documents by embedding similarity to the query.

On this stack, the retrieval model shouldn’t be a novel idea; the concept of top-k embedding-based semantic search has been around for at the very least a decade, and doesn’t involve the LLM in any respect.

There are loads of advantages to embedding-based retrieval:

It’s very fast to compute dot products. Doesn’t require any model calls during query-time.
Even when not perfect, embeddings can encode the semantics of the document and query reasonably well. There’s a category of queries where embedding-based retrieval returns very relevant results.

Yet for quite a lot of reasons, embedding-based retrieval could be imprecise and return irrelevant context to the query, which in turn degrades the standard of the general RAG system, whatever the quality of the LLM.

This can also be not a latest problem: one approach to resolve this in existing IR and advice systems is to create a . The primary stage uses embedding-based retrieval with a high top-k value to maximise recall while accepting a lower precision. Then the second stage uses a rather more computationally expensive process that’s higher precision and lower recall (for example with BM25) to “rerank” the present retrieved candidates.

Covering the downsides of embedding-based retrieval is price a whole series of blog posts. This blog post is an initial exploration of another retrieval method and the way it might (potentially) augment embedding-based retrieval methods.

Over the past week, we’ve developed quite a lot of initial abstractions across the concept of “LLM-based” retrieval and reranking. At a high-level, this approach uses the LLM to make your mind up which document(s) / text chunk(s) are relevant to the given query. The input prompt would consist of a set of candidate documents, and the LLM is tasked with choosing the relevant set of documents in addition to scoring their relevance with an internal metric.

Easy diagram of how LLM-based retrieval works

An example prompt would appear like the next:

An inventory of documents is shown below. Each document has a number next to it together with a summary of the document. An issue can also be provided.
Respond with the numbers of the documents you must seek the advice of to reply the query, so as of relevance, as well
because the relevance rating. The relevance rating is a number from 1–10 based on how relevant you think that the document is to the query.
Don't include any documents that should not relevant to the query.
Example format:
Document 1:

Document 2:

…
Document 10:

Query: 
Answer:
Doc: 9, Relevance: 7
Doc: 3, Relevance: 4
Doc: 7, Relevance: 3
Let's do that now:
{context_str}
Query: {query_str}
Answer:

The prompt format implies that the text for every document ought to be relatively concise. There are two ways of feeding within the text to the prompt corresponding to every document:

You may directly feed within the raw text corresponding to the document. This works well if the document corresponds to a bite-sized text chunk.
You may feed in a condensed summary for every document. This could be preferred if the document itself corresponds to a long-piece of text. We do that under the hood with our latest document summary index, but you too can decide to do it yourself.

Given a set of documents, we are able to then create document “batches” and send each batch into the LLM input prompt. The output of every batch can be the set of relevant documents + relevance scores inside that batch. The ultimate retrieval response would aggregate relevant documents from all batches.

You should utilize our abstractions in two forms: as a standalone retriever module (ListIndexLLMRetriever) or a reranker module (LLMRerank). The rest of this blog primarily focuses on the reranker module given the speed/cost.

`ListIndexLLMRetriever)`

This module is defined over an inventory index, which simply stores a set of nodes as a flat list. You may construct the list index over a set of documents after which use the LLM retriever to retrieve the relevant documents from the index.

from llama_index import GPTListIndex
from llama_index.indices.list.retrievers import ListIndexLLMRetriever
index = GPTListIndex.from_documents(documents, service_context=service_context)
# high - level API
query_str = "What did the creator do during his time in college?"
retriever = index.as_retriever(retriever_mode="llm")
nodes = retriever.retrieve(query_str)
# lower-level API
retriever = ListIndexLLMRetriever()
response_synthesizer = ResponseSynthesizer.from_args()
query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=response_synthesizer)
response = query_engine.query(query_str)

This might potentially be used rather than our vector store index. You employ the LLM as an alternative of embedding-based lookup to pick the nodes.

This module is defined as a part of our NodePostprocessor abstraction, which is defined for second-stage processing after an initial retrieval pass.

The postprocessor could be used by itself or as a part of a RetrieverQueryEngine call. Within the below example we show how you can use the postprocessor as an independent module after an initial retriever call from a vector index.

from llama_index.indices.query.schema import QueryBundle
query_bundle = QueryBundle(query_str)
# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=vector_top_k,
)
retrieved_nodes = retriever.retrieve(query_bundle)
# configure reranker
reranker = LLMRerank(choice_batch_size=5, top_n=reranker_top_n, service_context=service_context)
retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)

There are particular limitations and caveats to LLM-based retrieval, especially with this initial version.

LLM-based retrieval is orders of magnitude slower than embedding-based retrieval. Embedding search over 1000’s and even tens of millions of embeddings can take lower than a second. Each LLM prompt of 4000 tokens to OpenAI can take minutes to finish.
Using third-party LLM API’s costs money.
The present approach to batching documents might not be optimal, since it relies on an assumption that document batches could be scored independently of one another. This lacks a worldwide view of the rating for all documents.

Using the LLM to retrieve and rank every node within the document corpus could be prohibitively expensive. For this reason using the LLM as a second-stage reranking step, after a first-stage embedding pass, could be helpful.

Let’s take a have a look at how well LLM reranking works!

We show some comparisons between naive top-k embedding-based retrieval in addition to the two-stage retrieval pipeline with a first-stage embedding-retrieval filter and second-stage LLM reranking. We also showcase some results of pure LLM-based retrieval (though we don’t showcase as many results provided that it tends to run so much slower than either of the primary two approaches).

We analyze results over two very different sources of information: the Great Gatsby and the 2021 Lyft SEC 10-k. We only analyze results over the “retrieval” portion and never synthesis to raised isolate the performance of various retrieval methods.

The outcomes are presented in a qualitative fashion. A next step would definitely be more comprehensive evaluation over a whole dataset!

In our first example, we load within the Great Gatsby as a Document object, and construct a vector index over it (with chunk size set to 512).

# LLM Predictor (gpt-3.5-turbo) + service context
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)
# load documents
documents = SimpleDirectoryReader('../../../examples/gatsby/data').load_data()
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

We then define a get_retrieved_nodes function — this function can either do exactly embedding-based retrieval over the index, or embedding-based retrieval + reranking.

def get_retrieved_nodes(
query_str, vector_top_k=10, reranker_top_n=3, with_reranker=False
):
query_bundle = QueryBundle(query_str)
# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=vector_top_k,
)
retrieved_nodes = retriever.retrieve(query_bundle)
if with_reranker:
# configure reranker
reranker = LLMRerank(choice_batch_size=5, top_n=reranker_top_n, service_context=service_context)
retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)
return retrieved_nodes

We then ask some questions. With embedding-based retrieval we set k=3. With two-stage retrieval we set k=10 for embedding retrieval and n=3 for LLM-based reranking.

(For those of you who should not acquainted with the Great Gatsby, the narrator finds out afterward from Gatsby that Daisy was actually the one driving the automotive, but Gatsby takes the blame for her).

The highest retrieved contexts are shown in the pictures below. We see that in embedding-based retrieval, the highest two texts contain semantics of the automotive crash but give no details as to who was actually responsible. Only the third text incorporates the right answer.

Retrieved context using top-k embedding lookup (baseline)

In contrast, the two-stage approach returns only one relevant context, and it incorporates the right answer.

Retrieved context using two-stage pipeline (embedding lookup then rerank)

We would like to ask some questions over the 2021 Lyft SEC 10-K, specifically concerning the COVID-19 impacts and responses. The Lyft SEC 10-K is 238 pages long, and a ctrl-f for “COVID-19” returns 127 matches.

We use an identical setup because the Gatsby example above. The foremost differences are that we set the chunk size to 128 as an alternative of 512, we set k=5 for the embedding retrieval baseline, and an embedding k=40 and reranker n=5 for the two-stage approach.

We then ask the next questions and analyze the outcomes.

Results for the baseline are shown within the image above. We see that results corresponding to indices 0, 1, 3, 4, are about measures directly in response to Covid-19, although the query was specifically about company initiatives that were independent of the COVID-19 pandemic.

We get more relevant ends in approach 2, by widening the top-k to 40 after which using an LLM to filter for the top-5 contexts. The independent company initiatives include “expansion of Light Vehicles” (1), “incremental investments in brand/marketing” (2), international expansion (3), and accounting for misc. risks reminiscent of natural disasters and operational risks when it comes to financial performance (4).

That’s it for now! We’ve added some initial functionality to assist support LLM-augmented retrieval pipelines, but after all there’s a ton of future steps that we couldn’t quite get to. Some questions we’d like to explore:

How our LLM reranking implementation compares to other reranking methods (e.g. BM25, Cohere Rerank, etc.)
What the optimal values of embedding top-k and reranking top-n are for the 2 stage pipeline, accounting for latency, cost, and performance.
Exploring different prompts and text summarization methods to assist determine document relevance
Exploring if there’s a category of applications where LLM-based retrieval by itself would suffice, without embedding-based filtering (possibly over smaller document collections?)

Resources

You may mess around with the notebooks yourself!

Great Gatsby Notebook

2021 Lyft 10-K Notebook

Using LLM’s for Retrieval and Reranking Summary Introduction and Background LLM Retrieval and Reranking Initial Experimental Results Conclusion

`ListIndexLLMRetriever)`

Resources

What are your thoughts on this topic?
Let us know in the comments below.

140 COMMENTS

Share this article

Recent posts

Three OpenClaw Mistakes to Avoid and Tips on how to Fix Them

Implementing Falcon-H1 Hybrid Architecture in NVIDIA Megatron Core

Compact, Multilingual, and Built for the Edge

How AI is popping the Iran conflict into theater

Removing the Guesswork from Disaggregated Serving

Using LLM’s for Retrieval and Reranking Summary Introduction and Background LLM Retrieval and Reranking Initial Experimental Results Conclusion

ListIndexLLMRetriever)

Resources

What are your thoughts on this topic? Let us know in the comments below.

140 COMMENTS

Share this article

Recent posts

`ListIndexLLMRetriever)`

What are your thoughts on this topic?
Let us know in the comments below.