Leverage LLMs Like GPT to Analyze Your Documents or Transcripts tl;dr Prerequisites Loading the information Processing the information Asking questions Conclusion Sources

Artificial Intelligence

Leverage LLMs Like GPT to Analyze Your Documents or Transcripts tl;dr Prerequisites Loading the information Processing the information Asking questions Conclusion Sources

admin

April 2, 2023

Leverage LLMs Like GPT to Analyze Your Documents or Transcripts
tl;dr
Prerequisites
Loading the information
Processing the information
Asking questions
Conclusion
Sources

Use prompt engineering to investigate your documents with langchain and openai in a ChatGPT-like way

(Original) photo by Laura Rivera on Unsplash.

ChatGPT is unquestionably probably the most popular Large Language Models (LLMs). For the reason that release of its beta version at the tip of 2022, everyone can use the convenient chat function to ask questions or interact with the language model.

The goal of this text is to indicate you the best way to leverage LLMs like GPT to investigate our documents or transcripts after which ask questions and receive answers in a ChatGPT way in regards to the content within the documents.

Before writing all of the code, now we have to be sure that that each one the essential packages are installed, API keys are created, and configurations set.

API key

To utilize ChatGPT one must create an OpenAI API key first. The important thing could be created under this link after which by clicking on the
+ Create recent secret key button.

: Generally OpenAI charges you for each 1,000 tokens. Tokens are the results of processed texts and could be words or chunks of characters. The costs per 1,000 tokens vary per model (e.g., $0.002 / 1K tokens for gpt-3.5-turbo). More details in regards to the pricing options could be found here.

The great thing is that OpenAI grants you a free trial usage of $18 without requiring any payment information. An outline of your current usage could be seen in your account.

Installing the OpenAI package

We’ve to also install the official OpenAI package by running the next command

pip install openai

Since OpenAI needs a (valid) API key, we may also must set the important thing as a environment variable:

import os
os.environ["OPENAI_API_KEY"] = ""

Installing the langchain package

With the tremendous rise of interest in Large Language Models (LLMs) in late 2022 (release of Chat-GPT), a package named LangChain appeared around the identical time.

LangChain is a framework built around LLMs like ChatGPT. The aim of this package is to help in the event of applications that mix LLMs with other sources of computation or knowledge. It covers the applying areas like Query Answering over specific documents (), Chatbots, and Agents. More information could be present in the documentation.

The package could be installed with the next command:

pip install langchain

Prompt Engineering

You could be wondering what Prompt Engineering is. It is feasible to fine-tune GPT-3 by making a custom model trained on the documents you desire to to investigate. Nonetheless, besides costs for training we might also need a number of high-quality examples, ideally vetted by human experts (based on the documentation).

This is able to be overkill for just analyzing our documents or transcripts. So as an alternative of coaching or fine-tuning a model, we pass the text (commonly known as prompt) that we would love to investigate to it. Producing or creating such top quality prompts is named Prompt Engineering.

: An excellent article for further reading about Prompt Engineering could be found here

Depending in your use case, langchain offers you many “loaders” like Facebook Chat, PDF, or DirectoryLoader to load or read your (unstructured) text (files). The package also comes with a YoutubeLoader to transcribe youtube videos.

The next examples deal with the DirectoryLoader and YoutubeLoader.

Read text files with DirectoryLoader

from langchain.document_loaders import DirectoryLoaderloader = DirectoryLoader("", glob="*.txt")
docs = loader.load_and_split()

The DirectoryLoader takes as a primary argument the and as a second a to seek out the documents or document types we’re on the lookout for. In our case we might load all text files (.txt) in the identical directory because the script. The load_and_split function then initiates the loading.

Regardless that we’d only load one text document, it is smart to do a splitting in case now we have a big file and to avoid a NotEnoughElementsException (minimum 4 documents are needed). More Information could be found here.

Transcribe youtube videos with YoutubeLoader

LangChain comes with a YoutubeLoader module, which makes use of the youtube_transcript_api package. This module gathers the (generated) subtitles for a given video.

Not every video comes with its own subtitles. In these cases auto-generated subtitles can be found. Nonetheless, in some cases they’ve a foul quality. In these cases the usage of Whisper to transcribe audio files might be another.

The code below takes the and a (default: en) as parameters.

from langchain.document_loaders import YoutubeLoaderloader = YoutubeLoader(video_id="XYZ", language="en")
docs = loader.load_and_split()

Before we proceed…

In case you select to go along with , consider a of, e.g., Latin1 characters (xa0) . I experienced within the Query-Answering part differences within the answers depending on which format of the identical source I used.

LLMs like GPT can only handle a certain amount of tokens. These limitations are necessary when working with large(r) documents. Usually, there are 3 ways of coping with these limitations. One is to utilize embeddings or vector space engine. A second way is to check out different chaining methods like map-reduce or refine. And a 3rd one is a mix of each.

A fantastic article that gives more details about the various chaining methods and using a vector space engine could be found here. Also take into account: The more tokens you utilize, the more you get charged.

In the next we mix embeddings with the chaining method stuff which “stuffs” all documents in a single single prompt.

First we ingest our transcript ( docs) right into a vector space by utilizing OpenAIEmbeddings. The embeddings are then stored in an in-memory embeddings database called Chroma.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chromaembeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(docs, embeddings)

After that, we define the we would love to make use of to investigate our data. On this case we elect gpt-3.5-turbo. A full list of accessible models could be found here. The parameter defines the sampling temperature. Higher values result in more random outputs, while lower values will make the answers more focused and deterministic.

Last but not least we use theRetrievalQA (uestion/nswer) Retriever and set the respective parameters (llm, chain_type , retriever).

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAIllm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.2)
qa = RetrievalQA.from_chain_type(llm=llm, 
chain_type="stuff",
retriever=docsearch.as_retriever())

Now we’re able to ask the model questions on our documents. The code below shows the best way to define the query.

query = "What are the three most vital points within the text?"
qa.run(query)

What do to with incomplete answers?

In some cases you may experience incomplete answers. The reply text just stops after just a few words.

The rationale for an incomplete answer is more than likely the token limitation. If the provided prompt is kind of long, the model doesn’t have that many tokens left to provide an (full) answer. A method of handling this might be to modify to a special like refine.

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.2)qa = RetrievalQA.from_chain_type(llm=llm, 
chain_type="refine",
retriever=docsearch.as_retriever())

Nonetheless, I experienced that when using a specialchain_typethan stuff , I get less concrete results. One other way of handling these issues is to rephrase the query and make it more concrete.