Home Artificial Intelligence Using LLM’s for Retrieval and Reranking Summary Introduction and Background LLM Retrieval and Reranking Initial Experimental Results Conclusion

Using LLM’s for Retrieval and Reranking Summary Introduction and Background LLM Retrieval and Reranking Initial Experimental Results Conclusion

23
Using LLM’s for Retrieval and Reranking
Summary
Introduction and Background
LLM Retrieval and Reranking
Initial Experimental Results
Conclusion

LlamaIndex Blog

This blog post outlines a few of the core abstractions we’ve got created in LlamaIndex around LLM-powered retrieval and reranking, which helps to create enhancements to document retrieval beyond naive top-k embedding-based lookup.

LLM-powered retrieval can return more relevant documents than embedding-based retrieval, with the tradeoff being much higher latency and value. We show how using embedding-based retrieval as a first-stage pass, and second-stage retrieval as a reranking step might help provide a glad medium. We offer results over the Great Gatsby and the Lyft SEC 10-k.

Two-stage retrieval pipeline: 1) Top-k embedding retrieval, then 2) LLM-based reranking

There was a wave of “Construct a chatbot over your data” applications up to now few months, made possible with frameworks like LlamaIndex and LangChain. Plenty of these applications use an ordinary stack for retrieval augmented generation (RAG):

  • Use a vector store to store unstructured documents (knowledge corpus)
  • Given a question, use a to retrieve relevant documents from the corpus, and a to generate a response.
  • The fetchesthe top-k documents by embedding similarity to the query.

On this stack, the retrieval model shouldn’t be a novel idea; the concept of top-k embedding-based semantic search has been around for at the very least a decade, and doesn’t involve the LLM in any respect.

There are loads of advantages to embedding-based retrieval:

  • It’s very fast to compute dot products. Doesn’t require any model calls during query-time.
  • Even when not perfect, embeddings can encode the semantics of the document and query reasonably well. There’s a category of queries where embedding-based retrieval returns very relevant results.

Yet for quite a lot of reasons, embedding-based retrieval could be imprecise and return irrelevant context to the query, which in turn degrades the standard of the general RAG system, whatever the quality of the LLM.

This can also be not a latest problem: one approach to resolve this in existing IR and advice systems is to create a . The primary stage uses embedding-based retrieval with a high top-k value to maximise recall while accepting a lower precision. Then the second stage uses a rather more computationally expensive process that’s higher precision and lower recall (for example with BM25) to “rerank” the present retrieved candidates.

Covering the downsides of embedding-based retrieval is price a whole series of blog posts. This blog post is an initial exploration of another retrieval method and the way it might (potentially) augment embedding-based retrieval methods.

Over the past week, we’ve developed quite a lot of initial abstractions across the concept of “LLM-based” retrieval and reranking. At a high-level, this approach uses the LLM to make your mind up which document(s) / text chunk(s) are relevant to the given query. The input prompt would consist of a set of candidate documents, and the LLM is tasked with choosing the relevant set of documents in addition to scoring their relevance with an internal metric.

Easy diagram of how LLM-based retrieval works

An example prompt would appear like the next:

An inventory of documents is shown below. Each document has a number next to it together with a summary of the document. An issue can also be provided.
Respond with the numbers of the documents you must seek the advice of to reply the query, so as of relevance, as well
because the relevance rating. The relevance rating is a number from 1–10 based on how relevant you think that the document is to the query.
Don't include any documents that should not relevant to the query.
Example format:
Document 1:

Document 2:


Document 10:

Query:
Answer:
Doc: 9, Relevance: 7
Doc: 3, Relevance: 4
Doc: 7, Relevance: 3
Let's do that now:
{context_str}
Query: {query_str}
Answer:

The prompt format implies that the text for every document ought to be relatively concise. There are two ways of feeding within the text to the prompt corresponding to every document:

  • You may directly feed within the raw text corresponding to the document. This works well if the document corresponds to a bite-sized text chunk.
  • You may feed in a condensed summary for every document. This could be preferred if the document itself corresponds to a long-piece of text. We do that under the hood with our latest document summary index, but you too can decide to do it yourself.

Given a set of documents, we are able to then create document “batches” and send each batch into the LLM input prompt. The output of every batch can be the set of relevant documents + relevance scores inside that batch. The ultimate retrieval response would aggregate relevant documents from all batches.

You should utilize our abstractions in two forms: as a standalone retriever module (ListIndexLLMRetriever) or a reranker module (LLMRerank). The rest of this blog primarily focuses on the reranker module given the speed/cost.

ListIndexLLMRetriever)

This module is defined over an inventory index, which simply stores a set of nodes as a flat list. You may construct the list index over a set of documents after which use the LLM retriever to retrieve the relevant documents from the index.

from llama_index import GPTListIndex
from llama_index.indices.list.retrievers import ListIndexLLMRetriever
index = GPTListIndex.from_documents(documents, service_context=service_context)
# high - level API
query_str = "What did the creator do during his time in college?"
retriever = index.as_retriever(retriever_mode="llm")
nodes = retriever.retrieve(query_str)
# lower-level API
retriever = ListIndexLLMRetriever()
response_synthesizer = ResponseSynthesizer.from_args()
query_engine = RetrieverQueryEngine(retriever=retriever, response_synthesizer=response_synthesizer)
response = query_engine.query(query_str)

This might potentially be used rather than our vector store index. You employ the LLM as an alternative of embedding-based lookup to pick the nodes.

This module is defined as a part of our NodePostprocessor abstraction, which is defined for second-stage processing after an initial retrieval pass.

The postprocessor could be used by itself or as a part of a RetrieverQueryEngine call. Within the below example we show how you can use the postprocessor as an independent module after an initial retriever call from a vector index.

from llama_index.indices.query.schema import QueryBundle
query_bundle = QueryBundle(query_str)
# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=vector_top_k,
)
retrieved_nodes = retriever.retrieve(query_bundle)
# configure reranker
reranker = LLMRerank(choice_batch_size=5, top_n=reranker_top_n, service_context=service_context)
retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)

There are particular limitations and caveats to LLM-based retrieval, especially with this initial version.

  • LLM-based retrieval is orders of magnitude slower than embedding-based retrieval. Embedding search over 1000’s and even tens of millions of embeddings can take lower than a second. Each LLM prompt of 4000 tokens to OpenAI can take minutes to finish.
  • Using third-party LLM API’s costs money.
  • The present approach to batching documents might not be optimal, since it relies on an assumption that document batches could be scored independently of one another. This lacks a worldwide view of the rating for all documents.

Using the LLM to retrieve and rank every node within the document corpus could be prohibitively expensive. For this reason using the LLM as a second-stage reranking step, after a first-stage embedding pass, could be helpful.

Let’s take a have a look at how well LLM reranking works!

We show some comparisons between naive top-k embedding-based retrieval in addition to the two-stage retrieval pipeline with a first-stage embedding-retrieval filter and second-stage LLM reranking. We also showcase some results of pure LLM-based retrieval (though we don’t showcase as many results provided that it tends to run so much slower than either of the primary two approaches).

We analyze results over two very different sources of information: the Great Gatsby and the 2021 Lyft SEC 10-k. We only analyze results over the “retrieval” portion and never synthesis to raised isolate the performance of various retrieval methods.

The outcomes are presented in a qualitative fashion. A next step would definitely be more comprehensive evaluation over a whole dataset!

In our first example, we load within the Great Gatsby as a Document object, and construct a vector index over it (with chunk size set to 512).

# LLM Predictor (gpt-3.5-turbo) + service context
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo"))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=512)
# load documents
documents = SimpleDirectoryReader('../../../examples/gatsby/data').load_data()
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)

We then define a get_retrieved_nodes function — this function can either do exactly embedding-based retrieval over the index, or embedding-based retrieval + reranking.

def get_retrieved_nodes(
query_str, vector_top_k=10, reranker_top_n=3, with_reranker=False
):
query_bundle = QueryBundle(query_str)
# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=vector_top_k,
)
retrieved_nodes = retriever.retrieve(query_bundle)
if with_reranker:
# configure reranker
reranker = LLMRerank(choice_batch_size=5, top_n=reranker_top_n, service_context=service_context)
retrieved_nodes = reranker.postprocess_nodes(retrieved_nodes, query_bundle)
return retrieved_nodes

We then ask some questions. With embedding-based retrieval we set k=3. With two-stage retrieval we set k=10 for embedding retrieval and n=3 for LLM-based reranking.

(For those of you who should not acquainted with the Great Gatsby, the narrator finds out afterward from Gatsby that Daisy was actually the one driving the automotive, but Gatsby takes the blame for her).

The highest retrieved contexts are shown in the pictures below. We see that in embedding-based retrieval, the highest two texts contain semantics of the automotive crash but give no details as to who was actually responsible. Only the third text incorporates the right answer.

Retrieved context using top-k embedding lookup (baseline)

In contrast, the two-stage approach returns only one relevant context, and it incorporates the right answer.

Retrieved context using two-stage pipeline (embedding lookup then rerank)

We would like to ask some questions over the 2021 Lyft SEC 10-K, specifically concerning the COVID-19 impacts and responses. The Lyft SEC 10-K is 238 pages long, and a ctrl-f for “COVID-19” returns 127 matches.

We use an identical setup because the Gatsby example above. The foremost differences are that we set the chunk size to 128 as an alternative of 512, we set k=5 for the embedding retrieval baseline, and an embedding k=40 and reranker n=5 for the two-stage approach.

We then ask the next questions and analyze the outcomes.

Results for the baseline are shown within the image above. We see that results corresponding to indices 0, 1, 3, 4, are about measures directly in response to Covid-19, although the query was specifically about company initiatives that were independent of the COVID-19 pandemic.

Retrieved context using top-k embedding lookup (baseline)

We get more relevant ends in approach 2, by widening the top-k to 40 after which using an LLM to filter for the top-5 contexts. The independent company initiatives include “expansion of Light Vehicles” (1), “incremental investments in brand/marketing” (2), international expansion (3), and accounting for misc. risks reminiscent of natural disasters and operational risks when it comes to financial performance (4).

Retrieved context using two-stage pipeline (embedding lookup then rerank)

That’s it for now! We’ve added some initial functionality to assist support LLM-augmented retrieval pipelines, but after all there’s a ton of future steps that we couldn’t quite get to. Some questions we’d like to explore:

  • How our LLM reranking implementation compares to other reranking methods (e.g. BM25, Cohere Rerank, etc.)
  • What the optimal values of embedding top-k and reranking top-n are for the 2 stage pipeline, accounting for latency, cost, and performance.
  • Exploring different prompts and text summarization methods to assist determine document relevance
  • Exploring if there’s a category of applications where LLM-based retrieval by itself would suffice, without embedding-based filtering (possibly over smaller document collections?)

Resources

You may mess around with the notebooks yourself!

Great Gatsby Notebook

2021 Lyft 10-K Notebook

23 COMMENTS

  1. We are a gaggle of volunteers and opening a new scheme in our community.
    Your site provided us with valuable information to work on. You’ve performed an impressive task and our entire community will likely be thankful to you.

  2. Great goods from you, man. I’ve understand your stuff previous to and you’re just extremely excellent.

    I really like what you have acquired here, really like what you’re saying and the way in which you say it.

    You make it entertaining and you still care for to
    keep it smart. I can not wait to read far more from you.

    This is really a terrific website.

  3. Please let me know if you’re looking for a author for your site.
    You have some really good articles and I believe I
    would be a good asset. If you ever want to take some
    of the load off, I’d absolutely love to write some
    articles for your blog in exchange for a link back to mine.
    Please blast me an email if interested. Kudos!

  4. I think this is among the most important info for me.
    And i am glad reading your article. But want to remark on few general things, The site style is perfect, the articles is really nice :
    D. Good job, cheers

  5. Hi! I know this is somewhat off topic but I was wondering if
    you knew where I could find a captcha plugin for my comment form?
    I’m using the same blog platform as yours and I’m having difficulty finding one?
    Thanks a lot!

  6. A fascinating discussion is definitely worth comment.

    I do think that you should publish more about this subject matter,
    it may not be a taboo subject but usually people do not
    talk about these issues. To the next! Kind regards!!

  7. Hey! I know this is kinda off topic but I’d figured I’d ask.
    Would you be interested in exchanging links or maybe guest writing
    a blog post or vice-versa? My site discusses a lot of the same
    topics as yours and I believe we could greatly benefit from each other.
    If you happen to be interested feel free to send me an e-mail.
    I look forward to hearing from you! Fantastic blog by the way!

LEAVE A REPLY

Please enter your comment!
Please enter your name here