Document-Oriented Agents: A Journey with Vector Databases, LLMs, Langchain, FastAPI, and Docker Introduction Vector Databases: The Essential Core of Semantic Search Applications Constructing a Document-Oriented Agent Experiment: Understanding the Effectiveness of Document-Oriented Agents Conclusions Large Language Models Chronicles: Navigating the NLP Frontier References

Artificial Intelligence

Document-Oriented Agents: A Journey with Vector Databases, LLMs, Langchain, FastAPI, and Docker Introduction Vector Databases: The Essential Core of Semantic Search Applications Constructing a Document-Oriented Agent Experiment: Understanding the Effectiveness of Document-Oriented Agents Conclusions Large Language Models Chronicles: Navigating the NLP Frontier References

admin

July 7, 2023

Document-Oriented Agents: A Journey with Vector Databases, LLMs, Langchain, FastAPI, and Docker
Introduction
Vector Databases: The Essential Core of Semantic Search Applications
Constructing a Document-Oriented Agent
Experiment: Understanding the Effectiveness of Document-Oriented Agents
Conclusions
Large Language Models Chronicles: Navigating the NLP Frontier
References

Leveraging ChromaDB, Langchain, and ChatGPT: Enhanced Responses and Cited Sources from Large Document Databases

Document-oriented agents are beginning to get traction within the business landscape. Corporations increasingly leverage these tools to capitalize on internal documentation, enhancing their business processes. A recent McKinsey report [1] underscores this trend, suggesting generative AI could boost the worldwide economy by $2.6–4.4 trillion annually and automate as much as 70% of current work activities. The study identifies customer support, sales and marketing, and software development because the principal sectors that will probably be affected by the transformation. A lot of the change is coming from the indisputable fact that the data that powers these areas inside an organization may be more accessible to each employees and customers through the usage of solutions akin to document-oriented agents.

With the present technology, we’re still facing some challenges. Even when you consider the brand new Large Language Models (LLMs) with 100k token limits, the models still have limited context windows. While 100k tokens appear to be a high number, it’s a tiny number once we take a look at the scale of the databases powering, for instance, a customer support department. One other problem that usually arises is the inaccuracies in model outputs. In this text, we’ll provide a step-by-step guide to constructing a document-oriented agent that may handle documents of any size and deliver verifiable answers.

We use a vector database — ChromaDB — to enhance our model context length capabilities and Langchain to facilitate integrations between the various components in our architecture. As our LLM, we use OpenAI’s chatGPT. Since we would like to serve our application, we use FastAPI to create endpoints for users to interact with our agent. Finally, our application is containerized using Docker, which allows us to simply deploy it in any variety of environment.

Figure 1: The AI agents are getting smarter on a regular basis (image source)

As all the time, the code is on the market on my Github.

Vector databases are essential to unlocking the ability of generative AI. A lot of these databases are optimized to handle vector embeddings — data representations containing wealthy semantic information from the unique data. Unlike traditional scalar-based databases, which struggle with the complexity of vector embeddings, vector databases index these embeddings, associating them with their source content and allowing for advanced features like semantic information retrieval and long-term memory in AI applications.

Vector databases usually are not similar to vector indices, akin to Facebook’s AI Similarity Search (FAISS) — which we already covered on this series in a previous article [2]. They permit data insertion, deletion, and updating, store associated metadata, and support real-time data updates with no need full re-indexing — a time-consuming and computationally expensive process.

Reasonably than exact matches, vector databases employ similarity metrics to search out vectors closest to a question. They use Approximate Nearest Neighbor (ANN) search algorithms for optimized search. Some examples of such algorithms are: Random Projection, Product Quantization, or Hierarchical Navigable Small World. These algorithms compress the unique vector, speeding up the query process. Moreover, similarity measures like Cosine similarity, Euclidean distance, and Dot product compare and discover essentially the most relevant results for a question.

Figure 2 succinctly illustrates the similarity search process in vector databases. Starting with ingestion of raw documents (i), the info is broken into manageable chunks (ii) and converted into vector embeddings (iii). These embeddings are indexed for quick retrieval (iv), and similarity metrics between the chunks vectors and the user query are computed (v). The method ends with essentially the most relevant data chunks being output (vi), offering users insights aligned with their original query.

Figure 2: The similarity search process: i) ingestion of raw documents, ii) process into chunks, iii) creation of embeddings, iv) indexing, v) compute of similarity metrics and, finally, vi) producing the output chunks (image by creator)

We start by loading all mandatory models and data at server startup.

We load our data from a predefined directory and process them into manageable chunks. These chunks are designed to be sized in order that we are able to pass the chunks to the LLM as we got the outcomes from the similarity search procedure. This process utilizes the DirectoryLoader to load documents into memory and the RecursiveCharacterTextSplitter to interrupt them down into manageable chunks. It splits documents at a personality level, with a default chunk size of 1000 characters and a piece overlap of 20 characters. The chunk overlap ensures there may be contextual continuity between chunks, minimizing the chance of losing meaningful context on the chunk borders.

def load_docs(directory: str):
"""
Load documents from the given directory.
"""
loader = DirectoryLoader(directory)
documents = loader.load()return documents
def split_docs(documents, chunk_size=1000, chunk_overlap=20):
"""
Split the documents into chunks.
"""
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_documents(documents)
return docs

Then, we generate vector embeddings from these chunks using the SentenceTransformerEmbeddings method and index them in ChromaDB, our vector database. These embeddings are stored within the database and function our searchable data. The database doesn’t live in memory; notice that we’re persisting it on disk, which reduces our memory overhead. Next, we load the chat model, specifically OpenAI’s gpt-3.5-turbo, which serves as our LLM.

@app.on_event("startup")
async def startup_event():
"""
Load all of the mandatory models and data once the server starts.
"""
app.directory = '/app/content/'
app.documents = load_docs(app.directory)
app.docs = split_docs(app.documents)app.embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
app.persist_directory = "chroma_db"
app.vectordb = Chroma.from_documents(
documents=app.docs,
embedding=app.embeddings,
persist_directory=app.persist_directory
)
app.vectordb.persist()
app.model_name = "gpt-3.5-turbo"
app.llm = ChatOpenAI(model_name=app.model_name)
app.db = Chroma.from_documents(app.docs, app.embeddings)
app.chain = load_qa_chain(app.llm, chain_type="stuff", verbose=True)

Finally, the “/query/{query}” endpoint receives user queries. It runs a similarity search on the database, using the query as input. If matching documents exist, they’re fed into the LLM, and the reply is generated. The reply and the sources (the unique documents and their metadata) are returned, ensuring that the provided information is definitely verifiable.

@app.get("/query/{query}")
async def query_chain(query: str):
"""
Queries the model with a given query and returns the reply.
"""
matching_docs_score = app.db.similarity_search_with_score(query)
if len(matching_docs_score) == 0:
raise HTTPException(status_code=404, detail="No matching documents found")matching_docs = [doc for doc, score in matching_docs_score]
answer = app.chain.run(input_documents=matching_docs, query=query)
# Prepare the sources
sources = [{
"content": doc.page_content,
"metadata": doc.metadata,
"score": score
} for doc, score in matching_docs_score]
return {"answer": answer, "sources": sources}

We containerized the appliance using Docker, which ensures isolation and environment consistency, whatever the deployment platform. The Dockerfile below details our setup:

FROM python:3.9-buster
WORKDIR /app
COPY . /app
RUN pip install - no-cache-dir -r requirements.txt
EXPOSE 1010
CMD ["uvicorn", "main:app", " - host", "0.0.0.0", " - port", "1010"]

The applying runs in a Python 3.9 environment and we’d like to put in all mandatory dependencies from a requirements.txt file:

langchain==0.0.221
uvicorn==0.22.0
fastapi==0.99.1
unstructured==0.7.12
sentence-transformers==2.2.2
chromadb==0.3.26
openai==0.27.8
python-dotenv==1.0.0

The applying is then served through Uvicorn on port 1010.

Note that we’d like to configure the environment variables. Our application requires the OPENAI_API_KEY for the ChatOpenAI model. The most effective practice for sensitive information like API keys is to store them as environment variables slightly than hardcoding them into the appliance.
We use the python-dotenv package to load environment variables from a .env file on the project root. In a production environment, we might wish to use a safer method, akin to Docker secrets or a secure vault service.

The experiment primary goal was to evaluate our document-oriented agent’s effectiveness in providing comprehensive and accurate responses to user queries.

We use a series of our Medium articles as our knowledge base. These articles, covering quite a lot of AI and machine learning topics, are ingested and indexed in our Chroma vector database. The chosen articles were:

“Whisper JAX vs PyTorch: Uncovering the Truth about ASR Performance on GPUs”
“Testing the Massively Multilingual Speech (MMS) Model that Supports 1162 Languages”
“Harnessing the Falcon 40B Model, the Most Powerful Open-Source LLM”
“The Power of OpenAI’s Function Calling in Language Learning Models: A Comprehensive Guide”

The articles were broken into manageable chunks, converted into vector embeddings, and indexed in our database, thus forming the backbone of the agent’s knowledge.

The user query was executed by calling the API endpoint of our application, which is implemented using FastAPI and deployed via Docker. The query we used for the experiment was: “What’s Falcon-40b and might I exploit it for industrial use?”.

curl --location 'http://0.0.0.0:1010/query/What's Falcon-40b and might I exploit it for industrial use'

In response to our query, the LLM explained what Falcon-40b is and confirmed that it may be used commercially. The data was backed up by 4 different source chunks, all coming from the article: “Harnessing the Falcon 40B Model, the Most Powerful Open-Source LLM”. Each source chunk was also added to the response, as we saw above, in order that the user could confirm the unique text supporting the reply of the LLM. The chunks were also scored on their relevance to the query, which provides us an extra perspective on the importance of that section to the general answer of the agent.

{
"answer": "Falcon-40B is a state-of-the-art language model developed by the Technology Innovation Institute (TII). It's a transformer-based model that performs well on various language understanding tasks. The importance of Falcon-40B is that it's now available totally free industrial and research use, as announced by TII. Which means that developers and researchers can access and modify the model based on their specific needs with none royalties. Nonetheless, it's important to notice that while Falcon-40B is on the market for industrial use, it continues to be trained on web data and will carry potential biases and stereotypes prevalent online. Due to this fact, appropriate mitigation strategies must be implemented when using Falcon-40B in a production environment.",
"sources": [
{
"content": "This is where the significance of Falcon-40B lies. In the end of last week, the Technology Innovation Institute (TII) announced that Falcon-40B is now free of royalties for commercial and research use. Thus, it breaks down the barriers of proprietary models, giving developers and researchers free access to a state-of-the-art language model that they can use and modify according to their specific needs.nnTo add to the above, the Falcon-40B model is now the top performing model on the OpenLLM Leaderboard, outperforming models like LLaMA, StableLM, RedPajama, and MPT. This leaderboard aims to track, rank, and evaluate the performance of various LLMs and chatbots, providing a clear, unbiased metric of their capabilities. Figure 1: Falcon-40B is dominating the OpenLLM Leaderboard (image source)nnAs always, the code is available on my Github. How was Falcon LLM developed?",
"metadata": {
"source": "/app/content/Harnessing the Falcon 40B Model, the Most Powerful Open-Source LLM.txt"
},
"score": 1.045290231704712
},
{
"content": "The decoder-block in Falcon-40B features a parallel attention/MLP (Multi-Layer Perceptron) design with two-layer normalization. This structure offers benefits in terms of model scaling and computational speed. Parallelization of the attention and MLP layers improves the model’s ability to process large amounts of data simultaneously, thereby reducing the training time. Additionally, the implementation of two-layer normalization helps in stabilizing the learning process and mitigating issues related to the internal covariate shift, resulting in a more robust and reliable model. Implementing Chat Capabilities with Falcon-40B-InstructnnWe are using the Falcon-40B-Instruct, which is the new variant of Falcon-40B. It is basically the same model but fine tuned on a mixture of Baize. Baize is an open-source chat model trained with LoRA, a low-rank adaptation of large language models. Baize uses 100k dialogs of ChatGPT chatting with itself and also Alpaca’s data to improve its performance.",
"metadata": {
"source": "/app/content/Harnessing the Falcon 40B Model, the Most Powerful Open-Source LLM.txt"
},
"score": 1.319214940071106
},
{
"content": "One of the core differences on the development of Falcon was the quality of the training data. The size of the pre-training data for Falcon was nearly five trillion tokens gathered from public web crawls, research papers, and social media conversations. Since LLMs are particularly sensitive to the data they are trained on, the team built a custom data pipeline to extract high-quality data from the pre-training data using extensive filtering and deduplication.nnThe model itself was trained over the course of two months using 384 GPUs on AWS. The result is an LLM that surpasses GPT-3, requiring only 75% of the training compute budget and one-fifth of the compute at inference time.",
"metadata": {
"source": "/app/content/Harnessing the Falcon 40B Model, the Most Powerful Open-Source LLM.txt"
},
"score": 1.3254718780517578
},
{
"content": "Falcon-40B is English-centric, but also includes German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish language capabilities. Be mindful that as with any model trained on web data, it carries the potential risk of reflecting the biases and stereotypes prevalent online. Therefore, please assess these risks adequately and implement appropriate mitigation strategies when using Falcon-40B in a production environment. Model Architecture and ObjectivennFalcon-40B, as a member of the transformer-based models family, follows the causal language modeling task, where the goal is to predict the next token in a sequence of tokens. Its architecture fundamentally builds upon the design principles of GPT-3 [1], with a couple of necessary tweaks.",
"metadata": {
"source": "/app/content/Harnessing the Falcon 40B Model, the Most Powerful Open-Source LLM.txt"
},
"rating": 1.3283030986785889
}
]
}

In this text, we built an answer to beat the challenges of handling large-scale documents in AI systems, leveraging vector databases and a set of open-source tools. Our approach employs ChromaDB and Langchain with OpenAI’s ChatGPT to construct a capable document-oriented agent.

Our approach enables the agent to reply complex queries by searching and processing chunks of text from large-scale databases — in our case, a series of Medium articles on various AI topics. Along with the agent’s answers, we also returned the chunks of the unique documents used to support the LLM’s claims and their rating regarding similarity to the user’s query. It’s a crucial feature since these agents can sometimes provide inaccurate information.

This text belongs to “Large Language Models Chronicles: Navigating the NLP Frontier”, a recent weekly series of articles that can explore how one can leverage the ability of huge models for various NLP tasks. By diving into these cutting-edge technologies, we aim to empower developers, researchers, and enthusiasts to harness the potential of NLP and unlock recent possibilities.

Articles published to this point:

Summarizing the most recent Spotify releases with ChatGPT
Master Semantic Search at Scale: Index Thousands and thousands of Documents with Lightning-Fast Inference Times using FAISS and Sentence Transformers
Unlock the Power of Audio Data: Advanced Transcription and Diarization with Whisper, WhisperX, and PyAnnotate
Whisper JAX vs PyTorch: Uncovering the Truth about ASR Performance on GPUs
Vosk for Efficient Enterprise-Grade Speech Recognition: An Evaluation and Implementation Guide
Testing the Massively Multilingual Speech (MMS) Model that Supports 1162 Languages
Harnessing the Falcon 40B Model, the Most Powerful Open-Source LLM
The Power of OpenAI’s Function Calling in Language Learning Models: A Comprehensive Guide

[1] https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier#introduction

[2] Master Semantic Search at Scale: Index Thousands and thousands of Documents with Lightning-Fast Inference Times using FAISS and Sentence Transformers

Keep up a correspondence: LinkedIn

4 COMMENTS

Fryd carts August 16, 2023 At 6:37 pm

… [Trackback]

[…] Find More Info here on that Topic: bardai.ai/artificial-intelligence/document-oriented-agents-a-journey-with-vector-databases-llms-langchain-fastapi-and-dockerintroductionvector-databases-the-essential-core-of-semantic-search-applicationsconstruc…

hattori japanese hiphop October 13, 2023 At 2:26 am

hattori japanese hiphop

perfect ambience October 14, 2023 At 5:34 am

perfect ambience

plaquenil January 23, 2024 At 11:16 pm

… [Trackback]

[…] Information to that Topic: bardai.ai/artificial-intelligence/document-oriented-agents-a-journey-with-vector-databases-llms-langchain-fastapi-and-dockerintroductionvector-databases-the-essential-core-of-semantic-search-applicationsconstructing-a-d…

Leveraging ChromaDB, Langchain, and ChatGPT: Enhanced Responses and Cited Sources from Large Document Databases

4 COMMENTS

LEAVE A REPLY Cancel reply