Leverage LLMs Like GPT to Analyze Your Documents or Transcripts

Artificial Intelligence

Leverage LLMs Like GPT to Analyze Your Documents or Transcripts

admin

March 31, 2023

Leverage LLMs Like GPT to Analyze Your Documents or Transcripts

Use prompt engineering to investigate your documents with langchain and openai in a ChatGPT-like way

(Original) photo by Laura Rivera on Unsplash.

ChatGPT is certainly one of the vital popular Large Language Models (LLMs). Because the release of its beta version at the top of 2022, everyone can use the convenient chat function to ask questions or interact with the language model.

But what if we would love to ask ChatGPT questions on our own documents or a few podcast we just listened to?

The goal of this text is to indicate you how you can leverage LLMs like GPT to investigate our documents or transcripts after which ask questions and receive answers in a ChatGPT way concerning the content within the documents.

Before writing all of the code, we have now to be sure that that each one the needed packages are installed, API keys are created, and configurations set.

API key

To utilize ChatGPT one must create an OpenAI API key first. The important thing will be created under this link after which by clicking on the
+ Create recent secret key button.

Nothing is free: Generally OpenAI charges you for each 1,000 tokens. Tokens are the results of processed texts and will be words or chunks of characters. The costs per 1,000 tokens vary per model (e.g., $0.002 / 1K tokens for gpt-3.5-turbo). More details concerning the pricing options will be found here.

The nice thing is that OpenAI grants you a free trial usage of $18 without requiring any payment information. An summary of your current usage will be seen in your account.

Installing the OpenAI package

Now we have to also install the official OpenAI package by running the next command

pip install openai

Since OpenAI needs a (valid) API key, we will even should set the important thing as a environment variable:

import os
os.environ["OPENAI_API_KEY"] = ""

Installing the langchain package

With the tremendous rise of interest in Large Language Models (LLMs) in late 2022 (release of Chat-GPT), a package named LangChain appeared around the identical time.

LangChain is a framework built around LLMs like ChatGPT. The aim of this package is to help in the event of applications that mix LLMs with other sources of computation or knowledge. It covers the appliance areas like Query Answering over specific documents (goal of this text), Chatbots, and Agents. More information will be present in the documentation.

The package will be installed with the next command:

pip install langchain

Prompt Engineering

You is perhaps wondering what Prompt Engineering is. It is feasible to fine-tune GPT-3 by making a custom model trained on the documents you desire to to investigate. Nonetheless, besides costs for training we might also need loads of high-quality examples, ideally vetted by human experts (in response to the documentation).

This could be overkill for just analyzing our documents or transcripts. So as a substitute of coaching or fine-tuning a model, we pass the text (commonly known as prompt) that we would love to investigate to it. Producing or creating such top quality prompts is named Prompt Engineering.

Note: A very good article for further reading about Prompt Engineering will be found here

Depending in your use case, langchain offers you many “loaders” like Facebook Chat, PDF, or DirectoryLoader to load or read your (unstructured) text (files). The package also comes with a YoutubeLoader to transcribe youtube videos.

The next examples deal with the DirectoryLoader and YoutubeLoader.

Read text files with DirectoryLoader

from langchain.document_loaders import DirectoryLoaderloader = DirectoryLoader("", glob="*.txt")
docs = loader.load_and_split()

The DirectoryLoader takes as a primary argument the path and as a second a pattern to seek out the documents or document types we’re on the lookout for. In our case we might load all text files (.txt) in the identical directory because the script. The load_and_split function then initiates the loading.

Despite the fact that we’d only load one text document, it is sensible to do a splitting in case we have now a big file and to avoid a NotEnoughElementsException (minimum 4 documents are needed). More Information will be found here.

Transcribe youtube videos with YoutubeLoader

LangChain comes with a YoutubeLoader module, which makes use of the youtube_transcript_api package. This module gathers the (generated) subtitles for a given video.

Not every video comes with its own subtitles. In these cases auto-generated subtitles can be found. Nonetheless, in some cases they’ve a nasty quality. In these cases the usage of Whisper to transcribe audio files could possibly be an alternate.

The code below takes the video id and a language (default: en) as parameters.

from langchain.document_loaders import YoutubeLoaderloader = YoutubeLoader(video_id="XYZ", language="en")
docs = loader.load_and_split()

Before we proceed…

In case you select to go along with transcribed youtube videos, consider a proper cleansing of, e.g., Latin1 characters (xa0) first. I experienced within the Query-Answering part differences within the answers depending on which format of the identical source I used.

LLMs like GPT can only handle a certain amount of tokens. These limitations are necessary when working with large(r) documents. Usually, there are 3 ways of coping with these limitations. One is to utilize embeddings or vector space engine. A second way is to check out different chaining methods like map-reduce or refine. And a 3rd one is a mix of each.

An important article that gives more details about the various chaining methods and using a vector space engine will be found here. Also have in mind: The more tokens you employ, the more you get charged.

In the next we mix embeddings with the chaining method stuff which “stuffs” all documents in a single single prompt.

First we ingest our transcript ( docs) right into a vector space through the use of OpenAIEmbeddings. The embeddings are then stored in an in-memory embeddings database called Chroma.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chromaembeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(docs, embeddings)

After that, we define the model_name we would love to make use of to investigate our data. On this case we decide gpt-3.5-turbo. A full list of obtainable models will be found here. The temperature parameter defines the sampling temperature. Higher values result in more random outputs, while lower values will make the answers more focused and deterministic.

Last but not least we use theRetrievalQA (Question/Answer) Retriever and set the respective parameters (llm, chain_type , retriever).

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAIllm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.2)
qa = RetrievalQA.from_chain_type(llm=llm, 
chain_type="stuff",
retriever=docsearch.as_retriever())

Now we’re able to ask the model questions on our documents. The code below shows how you can define the query.

query = "What are the three most significant points within the text?"
qa.run(query)

What do to with incomplete answers?

In some cases you may experience incomplete answers. The reply text just stops after just a few words.

The rationale for an incomplete answer is almost definitely the token limitation. If the provided prompt is kind of long, the model doesn’t have that many tokens left to provide an (full) answer. A technique of handling this could possibly be to change to a distinct chain-type like refine.

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.2)qa = RetrievalQA.from_chain_type(llm=llm, 
chain_type="refine",
retriever=docsearch.as_retriever())

Nonetheless, I experienced that when using a distinctchain_typethan stuff , I get less concrete results. One other way of handling these issues is to rephrase the query and make it more concrete.