RAG-ing Success: Guide to decide on the correct components in your RAG solution on AWS Embedding component Vector Store Large Language model Conclusion


With the rise of Generative AI, Retrieval Augmented Generation(RAG) has grow to be a highly regarded approach for using the facility of Large Language Models (LLMs). It simplifies the entire Generative AI approach while reducing the necessity to fine-tune or eventually train an LLM from scratch. Among the the reason why RAG has grow to be so popular are:

  • You possibly can avoid hallucinations where the model tries to be “creative” and provides false information by making things up.
  • You possibly can all the time get the most recent information/answer around a subject or query without worrying about when was the training cut off for the muse model.
  • You possibly can avoid spending time, effort and money on complex technique of fantastic tuning or eventually training in your data.
  • Your architecture becomes loosely coupled.

Below diagram depicts a simplified component architecture diagram of RAG:

Credits: Vikesh Pandey

Taking a look at the diagram above, it has the next components:

  1. An converts all of the raw text into Embedding (vector representation of text).
  2. A which stores all of the vector information and provides quick retrieval.
  3. User submits a question via a chat interface. The query gets converted into embeddings by the identical embedding component.
  4. The embeddings are searched through the vector engine. The vector engine retrieves relevant context (chunks which potentially have the reply) and sends the query and context to LLM.
  5. reads the context with query and provides a precise answer.

There are various ways by which each of , and component could be designed. Currently, there’s an absence of clear guidance on which tool/model/library to make use of for every of the components.

This blog is an try and provide guidance on tips on how to select the correct tool for every of the components of RAG while designing the answer on AWS.

: Because it’s hard to cover every tool on the market for every of the components, the tools are chosen based on popularity while implementing RAG on AWS.

The price mentioned listed here are a reference value based on default reference instance recommendations. It might change if a unique instance is chosen. Also, single-AZ deployments have been chosen here. For production workloads, the suggestion can be to have multi-AZ deployment. The calculations are based on standard instance pricing. The price comes down if using reserved instances or using some other discount plan.

With that said, lets begin the comparative evaluation starting with the embedding component.

This component is answerable for converting the raw language text into vectors also generally known as embedding. So lets jump into options.

  • : The primary embedding model chosen for comparison is BERT(base uncased). This model could be deployed directly from SageMaker Jumpstart. Its known to perform well at generating contextualized embedding. The scale of those models are around ~500 MB. Hence, to run such model on Amazon SageMaker, It may well easily be done on a . There are various fine-tuned variants also available for multi-lingual use-cases and likewise by way of parameter size. It may well be used well in cases where multi-lingual support is required and where teams are on the lookout for more flexibility in alternative of models.
Credits: Vikesh Pandey
  • : Hottest option referred in lots of RAG based blogs on AWS. Again, it may possibly even be hosted on SageMaker with a single click from SageMaker JumpStart. It’s a pretty big model, hence requires more powerful compute, but provides really top quality embedding. Actually, it’s considered one of the highest performing embedding model as per their evaluation results.
Credits: Vikesh Pandey
  • : One other highly regarded sentence-embedder provided by Hugging Face library, deployable via SageMaker JumpStart. Certainly one of the the smallest, yet very powerful option on the offering. As a consequence of its small size, it may possibly be quite low-cost to run this on SageMaker Endpoint and produces good quality embedding. Other than this, there are number of fine-tuned versions available even on domain specific data.
Credits: Vikesh Pandey

: This model can thoroughly run on AWS Lambda as well, which may reduce the fee significantly. The explanation for showing this as endpoint on SageMaker is to make sure parity of architecture where any of the components could be replaced with none change within the design.

  • Next option is to make use of Amazon Titan Text Embedding model via Amazon Bedrock which continues to be in preview. But since Amazon Bedrock is serverless, it will be quite easy to make use of this model in a RAG solution with no operational overhead. The pricing information just isn’t yet available but will likely be made available because the service becomes Generally Available.
Credits: Vikesh Pandey

To summarize, that is how all the choices stack up against one another.

Credits: Vikesh Pandey

The following component in RAG architecture is a vector store, lets explore what options we have now there.

This component stores all of the embedding in a way that makes it easy to retrieve resultant embedding for a question. Lets dive deep in the choices.

: Since a few of the options discussed here should not AWS native offerings. Hence we discuss Total Cost of Ownership (TCO) which doesn’t only include usual compute, storage cost but the fee of maintaining the answer by way of expert hours and other operational overheads around testing, environment setup etc.

  • Amazon OpenSearch is a distributed, community-driven, Apache 2.0-licensed, 100% open-source search and analytics suite used for storing and retrieval of embedding. Powered by the Apache Lucene search library, it supports a variety of search and analytics capabilities reminiscent of k-nearest neighbors (KNN) search which is good for vector retrieval. You possibly can quickly setup it up via boto3 APIs or AWS console. It may well scale very well with storage and compute via sharding and dedicated master nodes. General python knowledge is required to set it up.
Credits: Vikesh Pandey
  • One other promising option is using Amazon RDS(PostgreSQL) with open source pgvector extension. Just setup RDS cluster and install pgvector. As its a manual effort to put in pgvector, there’s some operational overhead involved by way of keeping the extension up-to date and in some cases also tune it. Also, understanding of SQL together with python will likely be needed to run this solution. So, TCO will likely be bit higher than the previous option. It uses L2, cosine and inner product as a few of the techniques to seek out relevant embedding.
Credits: Vikesh Pandey
  • : The most well-liked non-AWS alternative for storing vectors. Its open source and provides lightning fast retrievals. It uses a myriad of approaches with Product quantization, HNSW and IF. It may well scale thoroughly with even GPU support available for super fast performance. But since its a manual installation, the entire cost of ownership is sort of high. As a developer, it is advisable setup the entire cluster for FAISS, update it, patch, secure it and tune it as per your requirements. You possibly can run it on EC2, Amazon ECS, Amazon EKS or some other persistent compute offering on AWS.
@Credits: Vikesh Pandey

Special Mention: Amazon Kendra — An embedder and Vector store in a single service

There may be another option which deserves its mention attributable to its ability to exchange each embedding and vector db component.

A totally managed service that gives out-of-the-box semantic search capabilities of documents and passages. No must cope with word embeddings, document chunking and vector store etc. Amazon Kendra provides the Retrieve API, designed for the RAG use case. It also comes with pre-built connectors to popular data sources reminiscent of Amazon Easy Storage Service (Amazon S3), SharePoint, Confluence, and web sites, and supports common document formats reminiscent of HTML, Word, PowerPoint, PDF, Excel, and pure text files. Because it a serverless experience replacing two components directly, the TCO could be quite low.

Credits: Vikesh Pandey

And to finish this section, here is the summary of all the choices we discussed:

Credits: Vikesh Pandey

Other than the above, some honorable mentions which weren’t covered as a part of Vector database comparison are Weaviate, Pinecone and chroma which even have good adoption within the developer community.

Next up is the last component of RAG, which is the LLM. Lets explore the alternative of LLMs on AWS

This space is moving extremely fast so I’d cherry-pick those that are quite popular in RAG based reference implementations on AWS.

: Among the third party models mentioned below can thoroughly be accessed via their very own APIs but using them via AWS has advantage of those models being hosted inside AWS itself which improves security and networking posture.

: The primary option here is the Jurassic-2 from AI21Labs. It’s also available via SageMaker JumpStart. You possibly can deploy the model on SageMaker Endpoint with in minutes from SageMaker Studio or via SageMaker APIs. Remember here, the developer experience just isn’t serverless.. The price is manufactured from vendor model price and SageMaker instance pricing. Its biggest advantage is that its multi-lingual and have been rated as considered one of top performing LLM on Standford’ HELM benchmark so you’ll be able to expect top quality output.

Credits: Vikesh Pandey

One other highly regarded alternative is using Falcon model. Its the highest performing open source LLM on Hugging Face leaderboard as of writing this blog. As a consequence of being open source, the software pricing grow to be $0 which makes it a very compelling alternative on AWS. You simply need to pay as an illustration pricing. Its ideal for Q&A and advanced information extraction. Consider that context length is around 2k which is likely to be small for some use-cases.

Credits: Vikesh Pandey

Special mention: Amazon Bedrock —pick the LLM of your alternative

Amazon Bedrock is essentially the most unique offering on this list. It’s a totally serverless experience where a single Bedrock API could be used to invoke models from Amazon Titan, Jurassic2 from AI21Labs, Claude from Anthropic and Stable Difusion models from Stability.ai. Bedrock allows you to select the model that’s best suited in your use case. It handles all of the infrastructure and scaling behind the scenes. As a consequence of the straightforward proven fact that its serverless, makes a really strong case for the TCO to be quite low.

Credits: Vikesh Pandey

And to finish this section, here is the summary of all the choices discussed on this section:

Credits: Vikesh Pandey

To conclude, there isn’t any universally correct answer to finding the perfect components for RAG based solution. And from a technical standpoint, there shouldn’t be a universal answer as every customer problem, environment and domain data is exclusive in itself, so it will be really sub-optimal to use the identical solution all over the place. This text attempts to check different options but cannot (and doesn’t) claim to be an exhaustive study of all of the facets of all the choices mentioned here. As a Reader, you might be are advised to make use of your discretion and a number of data while making architectural decisions for RAG.

When you liked what you read, please give a clap and share it with in your network. As a technologist who loves to put in writing, I will likely be sharing lot more interesting articles here so be at liberty to follow me here and on linkedin.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x