Construct a ChatGPT together with your Private Data using LlamaIndex and MongoDB Summary Background Introduction to LlamaIndex MongoDB because the Datastore MongoDB Atlas Use of LLMs The workflow Getting questions answered over your private data Relevant Resources

Artificial Intelligence

Construct a ChatGPT together with your Private Data using LlamaIndex and MongoDB Summary Background Introduction to LlamaIndex MongoDB because the Datastore MongoDB Atlas Use of LLMs The workflow Getting questions answered over your private data Relevant Resources

admin

May 20, 2023

Construct a ChatGPT together with your Private Data using LlamaIndex and MongoDB
Summary
Background
Introduction to LlamaIndex
MongoDB because the Datastore
MongoDB Atlas
Use of LLMs
The workflow
Getting questions answered over your private data
Relevant Resources

Prakul Agarwal — Senior Product Manager, Machine Learning at MongoDB
Jerry Liu — co-founder at LlamaIndex

Large Language Models (LLMs) like ChatGPT have revolutionized the best way users can get answers to their questions. Nevertheless, the “knowledge” of LLMs is restricted by what they were trained on, which for ChatGPT means publicly available information on the web till September 2021. How can LLMs answer questions using private knowledge sources like your organization’s data and unlock its true transformative power?

This blog will discuss how LlamaIndex and MongoDB can enable you to attain this final result quickly. The attached notebook provides a code walkthrough on methods to query any PDF document using English queries.

Traditionally, AI has been used to research data, discover patterns and make predictions based on existing data. The recent advancements have led to AI becoming higher at generating latest things (fairly than simply analyzing existing things). That is known as Generative AI. Generative AI is powered mainly by machine learning models called Large Language Models (LLM). LLMs are pre-trained on large quantities of publicly available text. There are numerous proprietary LLMs from corporations like OpenAI, Cohere, AI21, in addition to loads of emerging open-source LLMs like Llama, Dolly, etc.

There are 2 fundamental scenarios where the knowledge of LLMs falls short:

Private data comparable to your organization’s internal knowledge base spread across PDFs, Google Docs, Wiki pages, and applications like Salesforce and Slack
Newer data than when the LLMs were last trained. Example query: Who’s essentially the most recent UK prime minister?

There are 2 fundamental paradigms currently for extending the amazing reasoning and knowledge generation capabilities of LLMs: Model finetuning and in-context learning.

Model Finetuning will be more complex and expensive to operationalize. There are also some open questions like methods to delete information from a fine-tuned model to make sure you comply with local laws (ex. GDPR in Europe), and for changing data you should fine-tune again always.

In-context learning requires inserting the brand new data as a part of the input prompts to the LLM. To perform this data augmentation in a secure, high performance and cost-effective manner is where tools like LlamaIndex and MongoDB Developer Data Platform will help.

LlamaIndex provides a straightforward, flexible interface to attach LLMs with external data.

Offers data connectors to numerous data sources and data formats (APIs, PDFs, docs, etc).
Provides indices over the unstructured and structured data to be used with LLMs.
Structures external information in order that it may be used with the prompt window limitations of any LLM.
Exposes a question interface which takes in an input prompt and returns a knowledge-augmented output.

It’s effortless to store the ingested documents (i.e. Node objects), index metadata, etc to MongoDB using the inbuilt abstractions in LlamaIndex. There’s an choice to store the “documents” as an actual collection in MongoDB using MongoDocumentStore. There’s an choice to persist the “Indexes” using the MongoIndexStore .

Storing LlamaIndex’s documents and indexes in a database becomes needed in a few scenarios:

Use cases with large datasets may require greater than in-memory storage.
Ingesting and processing data from various sources (for instance, PDFs, Google Docs, Slack).
The requirement to constantly maintain updates from the underlying data sources.

With the ability to persist this data enables processing the info once after which with the ability to query it for various downstream applications.

MongoDB offers a free ceaselessly Atlas cluster in the general public cloud service of your alternative. This will be achieved in a short time by following this tutorial. Or you may start directly here.

LlamaIndex uses LangChain’s (one other popular framework for constructing Generative AI applications) LLM modules and allows for customizing the underlying LLM for use (default being OpenAI’s text-davinci-003 model). The chosen LLM is all the time utilized by LlamaIndex to construct the ultimate answer and is typically used during index creation as well.

Connect private knowledge sources using LlamaIndex connectors (offered through LlamaHub).
Load within the Documents. A Document represents a light-weight container around the info source.
Parse the Documents objects into Node objects. Nodes represent “chunks” of source Documents (ex. a text chunk). These node objects will be endured to a MongoDB collection or kept in memory.
Construct Index from Nodes. There are numerous sorts of indexes in LlamaIndex like “List Index” (this stores nodes as Sequential chain), “Vector Store Index” (this stores each node and a corresponding embedding in a vector store). Depending on the sort of Index, these indexes will be endured right into a MongoDB collection or a Vector Database.
Finally query the index. That is where the the query is parsed, relevant Nodes retrieved through the usage of indexes, and provided as an input to a “Large Language Model” (LLM). Various kinds of queries can use different indexes.

We would like to question the “GPT-4 Technical Report” published by OpenAI in March 2023. This was a PDF document with 100 pages. This can be a recent publication, so was not included as a part of the unique ChatGPT training data.

Here’s the summary of the assorted queries we will ask the PDF.

“How does GPT4 do on the bar exam?”

“GPT-4 performs well on the Uniform Bar Exam, with a rating in the highest 10% of test takers (Table 1, Figure 4).”

Total embedding token usage: 18 tokens
Total LLM token usage: 1889 tokens

“How significantly better is GPT-4 in reducing hallucinations over GPT-3.5?”

“GPT-4 improves on the newest GPT-3.5 model by 19 percentage points, with significant gains across all topics.”

“What issues were observed after fine-tuning GPT-4 with RHLF??”

“After fine-tuning GPT-4 with RHLF, issues observed included the model becoming overly cautious in certain ways, comparable to refusing innocuous requests and excessively hedging or “overrefusing”. Moreover, the model was still quite brittle and sometimes exhibited undesired behaviors based on prompts where instructions to labelers were underspecified.”

“What’s RBRM?”

“RBRM stands for Rule-Based Reward Model. It’s a method used to supply a further reward signal to the GPT-4 policy model during PPO fine-tuning on a subset of coaching prompts. The RBRM takes three things as input: the prompt (optional), the output from the policy model, and a human-written rubric (e.g., a algorithm in multiple-choice style) for the way this output ought to be evaluated. The RBRM then classifies the output based on the rubric.”