Home Artificial Intelligence AI-Driven Insights: Leveraging LangChain and Pinecone with GPT-4

AI-Driven Insights: Leveraging LangChain and Pinecone with GPT-4

6
AI-Driven Insights: Leveraging LangChain and Pinecone with GPT-4

Empowering Next-Gen Product Managers— Vol. 1

Working effectively with qualitative data is one of the essential skills a product manager can have; collecting data, analyzing it and communicating it in an efficient way, by coming up with actionable and invaluable insights.

You’ll be able to get qualitative data from many places — user interviews, competitor feedback, or comments from people using your product. Depending on what you’re trying to realize, you may analyze this data immediately or put it aside up for later. Sometimes, you may only need a number of user interviews to verify a hypothesis. Other times, you possibly can need feedback from a thousand users to identify trends or test ideas. So, your approach to analyzing this data can change depending on the situation.

With Large Language Models like GPT-4, and AI tools similar to LangChain and Pinecone, we are able to handle various situations and plenty of data more effectively. On this guide, I’ll share my experience with these tools. My goal is to point out product managers and anyone else who works with qualitative data find out how to use these AI tools to get more useful insights from their data.

What’s going to you discover on this guide-style article?

  1. I’ll start by introducing you to those AI tools, and a few current limitations of Large Language Models (LLMs).
  2. I’ll discuss other ways you may benefit from these tools for real-life use cases.
  3. Using user feedback evaluation for example, I’ll provide code snippets and examples to point out you ways these tools work in practice.

Please note: To make use of tools like GPT-4, LangChain, and Pinecone, you should be comfortable with data and have some basic coding skills. It’s also essential to grasp your customers and find a way to show data insights into real actions. Knowledge of AI and machine learning is a plus, but not a must.

Assuming you’re already acquainted with GPT-4, it’s essential to know some concepts as we discuss tools that work with LLMs. One major challenge with current LLMs like GPT-4 is their ‘context window’ — that is how much information they’ll process and remember at one time.

Currently, there are two versions of GPT-4. The usual one has an 8k token context, while the prolonged version has a 32k context window. To provide you an idea, a 32k token is about 24,000 words, which is roughly reminiscent of 48 pages of text. But keep in mind, the 32k version isn’t available to everyone, even when you’ve access to GPT-4.

Also, OpenAI announced recently a couple of recent ChatGPT model, called gpt-3.5-turbo-16k, which offers 4 times the context length of gpt-3.5-turbo. When working with insight evaluation I might suggest working with gpt-4, because it has higher reasoning than GPT-3.5. But you may mess around and see what works in your use case.

Why am I mentioning this?

When coping with insight evaluation, a giant challenge comes up if you’ve a number of data, otherwise you’re fascinated with greater than only one prompt. Let’s say you’ve one user interview. You must dig deeper and get more insights from it, using GPT-4. On this case, you may just take the interview transcript and provides it to ChatGPT, selecting GPT-4. You would possibly must split the text once, but that’s it. You don’t need every other fancy tools for this. So you’ll really need these fancy, recent tools when working with a number of qualitative data. Let’s understand what those tools are, then we’ll move to some specific use cases and examples.

So what’s LangChain?

LangChain is a framework that revolves around LLMs and offers various functionalities like chatbots, Generative Query-Answering (GQA), and summarization. Its versatility lies in the flexibility to attach different components together, including prompt templates, LLMs, agents, and memory systems.

Prompt templates are pre-made prompts for various situations, while LLMs process and generate responses. Agents help make decisions based on the LLM’s output, and memory systems store information for later use.

In this text I’ll share some capabilities of it in my examples.

High-level Overview of LangChain Modules

What’s Pinecone?

Pinecone.ai is a robust tool designed to simplify the management of high-dimensional data representations often called vectors.

Vectors are particularly useful when coping with a number of text data, like while you’re attempting to extract information from it. Consider a situation where you’re analyzing feedback and you need to discover various details a couple of product. This type of deep insight gathering wouldn’t be possible with just keyword searches like “great”, “improve”, or “i suggest”, as you may miss out on a number of context.

Now, I won’t delve into the technical elements of text vectorization (which might be word-based, sentence-based, etc.). The important thing thing you should understand is that words get converted into numbers through machine learning models, and these numbers are then stored in arrays.

Let’s take an example:

The word “seafood” could be translated right into a series of numbers like this: [1.2, -0.2, 7.0, 19.9, 3.1, …, 10.2].

Once I search for one more word, that word also gets transformed right into a number series (or vector). If our machine learning model is doing its job properly, words which have an identical context to “seafood” must have a number series that’s near the series for “seafood”. Here’s an example:

“shrimp” could be translated as: [1.1, -0.3, 7.1, 19.8, 3.0, …, 10.5], where the numbers are near numbers “seafood” has.

With Pinecone.ai, you may efficiently store and search these vectors, enabling quick and accurate similarity comparisons.

Through the use of its capabilities, you may organize and index vectors derived from LLM models, opening the door to deeper insights and the invention of meaningful patterns inside extensive datasets.

In simpler words, Pinecone.ai permits you to store the vector representations of your qualitative data in a convenient way. You’ll be able to easily search through these vectors and apply LLM models to extract invaluable insights from them. It simplifies the technique of managing your data and deriving meaningful information from it.

Representation of Vector Databases

When would you really need tools like LangChain and Pinecone?

Short answer: when you’re working with a number of qualitative data.

Let me share some use cases from my experience to present you an idea:

  • Imagine you’ve hundreds of written feedback entries out of your product channels. You must discover patterns in the information and track how the feedback has evolved over time.
  • Suppose you’ve reviews in numerous languages and you need to translate them in your chosen language, after which extract insights.
  • You aim to conduct competitive evaluation by analyzing customer reviews, feedback, and sentiment regarding your competitors’ products.
  • Your organization conducts surveys or user studies, generating a major volume of qualitative responses. Extracting meaningful insights, uncovering trends, and informing services or products improvements are your goals.

These are only a number of examples of situations where tools like LangChain and Pinecone will be invaluable for product managers working with qualitative data.

As a product manager, my job involves improving our meeting notes and transcription features. To do that, we take heed to what our users say about them.

For our meeting notes feature, users give us a rating between 1 and 5 for quality, tell us which template they used, and in addition send us their comments. Here is the flow:

On this project, I looked closely at two things: what users said about our feature and which templates they used. I ended up coping with an enormous amount of information — over 20,000 words, which was greater than 38,000 “tokens” (or pieces of information) after I used a special tool to interrupt it down. That’s a lot data that it’s greater than what some advanced models can handle all of sudden!

To assist me analyze this extensive data, I turned to 2 advanced tools: LangChain and Pinecone, supplemented with GPT-4. With these in our arsenal, let’s delve deeper into the project and see what these high-tech tools enabled us to do.

This project’s primary objective was extracting insights from the gathered data, which required:

  1. The power to create specific queries related to our dataset.
  2. The usage of LLMs for handling vast information volumes.

First, I’ll provide you with an summary of how I carried out the project. After that, I’ll share some examples of the code I used.

We start with a group of text files. Each file comprises user feedback paired with the name of the template they used. You’ll be able to process this data to suit your needs — I needed to do some post-processing for my project. Remember, your files and data could be different, so be happy to tweak the data for your individual project.

For instance you need to understand users’ feedback on meeting notes structure:

query =  "Please list all feedback regarding sentence structures in a table 
in markdown and get a single insight for every one, and provides a general summary for all."

Here’s a high-level diagram showcasing the method flow when utilizing LLM and Pinecone. You ask GPT-4 a matter, or what we call a ‘query’. Meanwhile, Pinecone, our library stuffed with all feedback, provides the context to your query, while you send the query itself to it (“embed query”). Together, they assist us make sense of our data efficiently:

Below is a more simplified version of the diagram:

Let’s do it! On this script, we arrange a pipeline to investigate user feedback data using OpenAI’s GPT-4, Pinecone, and LangChain. Essentially, it imports needed libraries, sets the trail to the feedback data, and establishes the OpenAI API key for processing this data.

import os
import openai
import pinecone
import certifi
import nltk
from tqdm.autonotebook import tqdm
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

directory = 'path to your directory with text files, containing feedback'
OPENAI_API_KEY = "your key"

Then we define and call a function load_docs() that loads user feedback documents from a specified directory using LangChain’s DirectoryLoader. It then counts and displays the full variety of loaded documents.

def load_docs(directory):
loader = DirectoryLoader(directory)
documents = loader.load()
return documents

documents = load_docs(directory)
len(documents)

Next define and execute the split_docs() function, which divides the loaded documents into smaller chunks of a selected size and overlap using LangChain’s RecursiveCharacterTextSplitter. It then counts and prints the full variety of resulting chunks.

def split_docs(documents, chunk_size=500, chunk_overlap=20):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
docs = text_splitter.split_documents(documents)
return docs

docs = split_docs(documents)
print(len(docs))

To work with Pinecone, which is essentially a vector database we’d like to get embeddings out of our docs, that’s why we should always introduce a function for that. There are a lot of ways to do it, but let’s go along with OpenAI’s embedding function:

# Assuming OpenAIEmbeddings class is imported above
embeddings = OpenAIEmbeddings()

# Let's define a function to generate an embedding for a given query
def generate_embedding(query):
query_result = embeddings.embed_query(query)
print(f"Embedding length for the query is: {len(query_result)}")
return query_result

For storing those vectors into Pinecone, you should create an account there and create an index as well. That’s quite straightforward to do. Then you definately will get an API key, environment name and the index name from there.

MY_API_KEY_p= "the_key"
MY_ENV_p= "the_environment"

pinecone.init(
api_key=MY_API_KEY_p,
environment=MY_ENV_p
)

index_name = "your_index_name"

index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

The subsequent step is to find a way to seek out answers. It’s like finding the closest points to your query in a field of possible answers, giving us probably the most relevant results.

def get_similiar_docs(query, k=40, rating=False):
if rating:
similar_docs = index.similarity_search_with_score(query, k=k)
else:
similar_docs = index.similarity_search(query, k=k)
return similar_docs

On this code, we arrange a question-answering system using OpenAI’s GPT-4 model and LangChain. The get_answer() function takes a matter as input, finds similar documents, and uses the question-answering chain to generate a solution.

from langchain.chat_models import ChatOpenAI
model_name = "gpt-4"

llm = OpenAI(model_name=model_name, temperature =0)

chain = load_qa_chain(llm, chain_type="stuff")

def get_answer(query):
similar_docs = get_similiar_docs(query)
answer = chain.run(input_documents=similar_docs, query=query)
return answer

We got to the query! Or questions. You’ll be able to ask as many questions as you would like.

query =  "Please list all feedback regarding sentence structures in a table 
in markdown and get a single insight for every one, and provides a general summary for all."

answer = get_answer(query)
print(answer)

Implementing Retrieval Q&A Chain:

To implement the retrieval question-answering system, we use the RetrievalQA class from LangChain. It uses an OpenAI LLM to reply questions and relies on a “stuff” chain type. The retriever is connected to a previously created index and is stored within the ‘qa’ variable. For higher understanding, you may learn more about retrieval techniques.

from langchain.chains import RetrievalQA
retriever = index.as_retriever()

qa_stuff = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
verbose=True
)

response = qa_stuff.run(query)

So we got the response, let’s present the content stored within the response variable in a visually appealing format using Markdown text. It makes the displayed text more organized and easier to read.

from IPython.display import display, Markdown

display(Markdown(response))

Example output

Go ahead and experiment each with input files and queries to get the very best out of this approach and tools.

Briefly, GPT-4, LangChain, and Pinecone make it easy to handle big chunks of qualitative data. They assist us dig into this data and find invaluable insights, guiding higher decisions. This text gave a sneak peek into their use, but there’s lots more they’ll do.

As these tools proceed to advance and develop into more common, learning to make use of them now will provide you with a major advantage in the long run. So, keep exploring and learning about these tools because they’re shaping the current and the long run of information evaluation.

Stay tuned for more ways to explore these handy tools in the long run!

All images, unless otherwise noted, are by the creator

References

LangChain Documentation

Pinecone Documentation

LangChain for LLM Application Development short course

6 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here