Topic Modelling in production

Artificial Intelligence

Topic Modelling in production

admin

October 30, 2023

Within the previous article, we discussed find out how to do Topic Modelling using ChatGPT and got excellent results. The duty was to take a look at customer reviews for hotel chains and define the important topics mentioned within the reviews.

Within the previous iteration, we used standard ChatGPT completions API and sent raw prompts ourselves. Such an approach works well once we are performing some ad-hoc analytical research.

Nonetheless, in case your team is actively using and monitoring customer reviews, it’s value considering some automatisation. A great automatisation is not going to only provide help to construct an autonomous pipeline, but it’ll even be more convenient (even team members unfamiliar with LLMs and coding will have the option to access this data) and cheaper (you’ll send all texts to LLM and pay just once).

Suppose we’re constructing a sustainable production-ready service. In that case, it’s value leveraging existing frameworks to cut back the quantity of glue code and have a more modular solution (in order that we could easily switch, for instance, from one LLM to a different).

In this text, I would love to let you know about probably the most popular frameworks for LLM applications — LangChain. Also, we’ll understand intimately find out how to evaluate your model’s performance because it’s a vital step for business applications.

Revising initial approach

First, let’s revise our previous approach for ad-hoc Topic Modelling with ChatGPT.

Step 1: Get a representative sample.

We wish to find out the list of topics we’ll use for our markup. Probably the most straightforward way is to send all reviews and ask LLM to define the list of 20–30 topics mentioned in our reviews. Unfortunately, we won’t have the option to do it because it won’t fit the context size. We could use a map-reduce approach, however it could possibly be costly. That’s why we would love to define a representative sample.

For this, we built a BERTopic topic model and got essentially the most representative reviews for every topic.

Step 2: Determine the list of topics we’ll use for markup.

The following step is to pass all the chosen texts to ChatGPT and ask it to define a listing of topics mentioned in these reviews. Then, we are able to use these topics for later markup.

Step 3: Doing topics’ markup in batches.

The last step is essentially the most straightforward — we are able to send customer reviews in batches that fit the context size and ask LLM to return topics for every customer review.

Finally, with these three steps, we could determine the list of relevant topics for our texts and classify all of them.

It really works perfectly for one-time research. Nonetheless, we’re missing some bits for a wonderful production-ready solution.

From ad-hoc to production

Let’s discuss what improvements we could make to our initial ad-hoc approach.

Within the previous approach, now we have a static list of topics. But in real-life examples, latest topics might arise over time, for instance, if you happen to launch a latest feature. So, we want a feedback loop to update the list of topics we’re using. The best approach to do it’s to capture the list of reviews with none assigned topics and repeatedly run topic modelling on them.
If we’re doing one-time research, we are able to validate the outcomes of the topics’ assignments manually. But for the method that’s running in production, we want to take into consideration a continuous evaluation.
If we’re constructing a pipeline for customer review evaluation, we should always consider more potential use cases and store other related information we’d need. For instance, it’s helpful to store translated versions of customer reviews in order that our colleagues don’t must use Google Translate on a regular basis. Also, sentiment and other features (for instance, products mentioned in the shopper review) may be helpful for evaluation and filtering.
The LLM industry is progressing quite quickly immediately, and the whole lot is changing on a regular basis. It’s value considering a modular approach where we are able to quickly iterate and take a look at latest approaches over time without rewriting the entire service from scratch.

We’ve got a number of ideas on what to do with our topic modelling service. But let’s give attention to the important parts: modular approach as a substitute of API calls and evaluation. The LangChain framework will help us with each topics, so let’s learn more about it.

LangChain is a framework for constructing applications powered by Language Models. Listed here are the important components of LangChain:

Schema is essentially the most basic classes like Documents, Chat Messages and Texts.
Models. LangChain provides access to LLMs, Chat Models and Text Embedding models that you can easily use in your applications and switch between them if needed. It goes without saying it supports such popular models like ChatGPT, Anthropic and Llama.
Prompts is a functionality to assist work with prompts, including prompt templates, output parsers and example selectors for few-shot prompting.
Chains are the core of LangChain (as you may guess by the name). Chains provide help to to construct a sequence of blocks that can be executed. You’ll be able to truly appreciate this functionality if you happen to’re constructing a posh application.
Indexes: document loaders, text splitters, vector stores and retrievers. This module provides tools that help LLMs to interact along with your documents. This functionality could be helpful if you happen to’re constructing a Q&A use case. We won’t be using this functionality much in our example today.
LangChain provides an entire set of methods to administer and limit memory. This functionality is primarily needed for ChatBot scenarios.
One in every of the newest and strongest features is agents. For those who are a heavy ChatGPT user, you should have heard in regards to the plugins. It’s the identical idea that you would be able to empower LLM with a set of custom or predefined tools (like Google Search or Wikipedia), after which the agent can use them while answering your questions. On this setup, LLM is acting like a reasoning agent and decides what it must do to realize the result and when it gets the ultimate answer that it could share. It’s exciting functionality, so it’s definitely value a separate discussion.

So, LangChain may also help us construct modular applications and have the option to change between different components (for instance, from ChatGPT to Anthropic or from CSV as data input to Snowflake DB). LangChain has greater than 190 integrations, in order that it may possibly prevent quite a number of time.

Also, we could reuse ready-made chains for some use cases as a substitute of ranging from scratch.

When calling ChatGPT API manually, now we have to administer quite a number of Python glue code to make it work. It’s not an issue while you’re working on a small, straightforward task, however it might turn into unmanageable when it is advisable construct something more complex and convoluted. In such cases, LangChain may provide help to eliminate this glue code and create more easy-to-maintain modular code.

Nonetheless, LangChain has its own limitations:

It’s primarily focused on OpenAI models, so it may not work so easily with on-premise open-source models.
The flip side of convenience is that it’s hard to grasp what’s happening under the hood and when and the way the ChatGPT API you’re paying for is executed. You should utilize debug mode, but it is advisable specify it and undergo the whole logs for a clearer view.
Despite pretty good documentation, I struggle sometimes to seek out answers to my questions. There aren’t so many other tutorials and resources on the web aside from the official documentation, quite ceaselessly you may see only official pages in Google.
The Langchain library is progressing loads, and the team always ship latest features. So, the library just isn’t mature, and you may have to change from the functionality you’re using. For instance, the SequentialChain class is taken into account legacy now and may be deprecated in the longer term since they’ve introduced LCEL — we’ll discuss it in additional detail afterward.

We’ve gotten a birds-eye overview of LangChain functionality, but practice makes perfect. Let’s start using LangChain.

Let’s refactor the subject project since it’ll be essentially the most common operation in our regular process, and it’ll help us understand find out how to use LangChain in practice.

To start with, we want to put in the package.

!pip install --upgrade langchain

Loading documents

To work with the shoppers’ reviews, we first must load them. For that, we could use Document Loaders. In our case, customer reviews are stored as a set of .txt files in a Directory, but you may effortlessly load docs from third-party tools. For instance, there’s an integration with Snowflake.

We’ll use DirectoryLoader to load all files within the directory since now we have separate files from hotels. For every file, we’ll specify TextLoader as a loader (by default, a loader for unstructured documents is used). Our files are encoded in ISO-8859–1, so the default call returns an error. Nonetheless, LangChain can routinely detect used encoding. With such a setup, it really works okay.

from langchain.document_loaders import TextLoader, DirectoryLoadertext_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader('./hotels/london', show_progress=True, 
loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
docs = loader.load()
len(docs)
82

Splitting documents

Now, we would love to separate our documents. We all know that every file consists of a set of customer comments delimited by n. Since our case may be very straightforward, we’ll use essentially the most basic CharacterTextSplitter that splits documents by character. When working with real documents (whole long texts as a substitute of independent short comments), it’s higher to make use of Recursive split by character because it means that you can split documents into chunks smarter.

Nonetheless, LangChain is more fitted to fuzzy text splitting. So, I needed to hack it a bit to make it work the best way I wanted.

How it really works:

You specify chunk_size and chunk_overlap, and it tries to make the minimal variety of splits in order that each chunk is smaller than chunk_size. If it fails to create a sufficiently small chunk, it prints a message to the Jupyter Notebook output.
For those who specify too big chunk_size, not all comments can be separated.
For those who specify too small chunk_size, you should have print statements for every comment in your output, resulting in the Notebook reloading. Unfortunately, I couldn’t find any parameters to change it off.

To beat this problem, I specified length_function as a relentless equal to chunk_size. Then I got just a regular split by character. LangChain provides enough flexibility to do what you wish, but only in a somewhat hacky way.

from langchain.text_splitter import CharacterTextSplittertext_splitter = CharacterTextSplitter(
separator = "n",
chunk_size = 1,
chunk_overlap  = 0,
length_function = lambda x: 1, # often len is used 
is_separator_ = False
)
split_docs = text_splitter.split_documents(docs)
len(split_docs) 
12890

Also, let’s add the document ID to the metadata — we’ll use it later.

for i in range(len(split_docs)):
split_docs[i].metadata['id'] = i

The advantage of using Documents is that we now have automatic data sources and may filter data by it. For instance, we are able to filter only comments related to Travelodge Hotel.

list(filter(
lambda x: 'travelodge' in x.metadata['source'],
split_docs
))

Next, we want a model. As we discussed earlier in LangChain, there are LLMs and Chat Models. The important difference is that LLMs take texts and return texts, while Chat Models are more suitable for conversational use cases and may get a set of messages as input. In our case, we’ll use the ChatModel for OpenAI since we would love to pass system messages as well.

from langchain.chat_models import ChatOpenAIchat = ChatOpenAI(temperature=0.0, model="gpt-3.5-turbo", 
openai_api_key = "your_key")

Prompts

Let’s move on to crucial part — our prompt. In LangChain, there’s an idea of Prompt Templates. They assist to reuse prompts parametrised by variables. It’s helpful since, in real-life applications, prompts may be very detailed and complicated. So, prompt templates is usually a useful high-level abstraction that may provide help to to administer your code effectively.

Since we’re going to use the Chat Model, we’ll need ChatPromptTemplate.

But before jumping into prompts, let’s briefly discuss a helpful feature — an output parser. Surprisingly, they may also help us to create an efficient prompt. We will define the specified output, generate an output parser after which use the parser to create instructions for the prompt.

Let’s define what we would love to see within the output. First, we would love to have the option to pass a listing of customer reviews to the prompt to process them in batches, so within the result, we would love to get a listing with the next parameters:

id to discover documents,
list of topics from the predefined list (we can be using the list from our previous iteration),
sentiment (negative, neutral or positive).

Let’s specify our output parser. Since we want a reasonably complex JSON structure, we’ll use Pydantic Output Parser as a substitute of essentially the most commonly used Structured Output Parser.

For that, we want to create a category inherited from BaseModel and specify all fields we want with names and descriptions (in order that LLM could understand what we expect within the response).

from langchain.output_parsers import PydanticOutputParser
from langchain.pydantic_v1 import BaseModel, Field
from typing import Listclass CustomerCommentData(BaseModel):
doc_id: int = Field(description="doc_id from the input parameters")
topics: List[str] = Field(description="List of the relevant topics 
for the shopper review. Please, include only topics from 
the provided list.")
sentiment: str = Field(description="sentiment of the comment (positive, neutral or negative")
output_parser = PydanticOutputParser(pydantic_object=CustomerCommentData)

Now, we could use this parser to generate formatting instructions for our prompt. That’s a unbelievable case when you can use prompting best practices and spend less time on prompt engineering.

format_instructions = output_parser.get_format_instructions()
print(format_instructions)

Then, it’s time to maneuver on to our prompt. We took a batch of comments and formatted them into the expected format. Then, we created a prompt message with a bunch of variables: topics_descr_list, format_instructions and input_data. After that, we created chat prompt messages consisting of a relentless system message and a prompt message. The last step is to format chat prompt messages with actual values.

from langchain.prompts import ChatPromptTemplatedocs_batch_data = []
for rec in docs_batch:
docs_batch_data.append(
{
'id': rec.metadata['id'],
'review': rec.page_content
}
)
topic_assignment_msg = '''
Below is a listing of customer reviews in JSON format with the next keys:
1. doc_id - identifier for the review
2. review - text of customer review
Please, analyse provided reviews and discover the important topics and sentiment. Include only topics from the provided below list.
List of topics with descriptions (delimited with ":"):
{topics_descr_list}
Output format:
{format_instructions}
Customer reviews:
```
{input_data}
```
'''
topic_assignment_template = ChatPromptTemplate.from_messages([
("system", "You're a helpful assistant. Your task is to analyse hotel reviews."),
("human", topic_assignment_msg)
])
topics_list = 'n'.join(
map(lambda x: '%s: %s' % (x['topic_name'], x['topic_description']), 
topics))
messages = topic_assignment_template.format_messages(
topics_descr_list = topics_list,
format_instructions = format_instructions,
input_data = json.dumps(docs_batch_data)
)

Now, we are able to pass these formatted messages to LLM and see a response.

response = chat(messages)
type(response.content)
strprint(response.content)

We got the response as a string object, but we could leverage our parser and get the list of CustomerCommentData class objects in consequence.

response_dict = list(map(lambda x: output_parser.parse(x), 
response.content.split('n')))
response_dict

So, we’ve leveraged LangChain and a few of its features and have already built a bit smarter solution that would assign topics to the comments in batches (it could save us some costs) and began to define not only topics but additionally sentiment.

Up to now, we’ve built only single LLM calls with none relations and sequencing. Nonetheless, in real life, we frequently need to split our tasks into multiple steps. For that, we are able to use Chains. Chain is the basic constructing block for LangChain.

LLMChain

Probably the most basic sort of chain is an LLMChain. It’s a mixture of LLM and prompt.

So we are able to rewrite our logic into a series. This code will give us absolutely the identical result as before, however it’s pretty convenient to have one method that defines all of it.

from langchain.chains import LLMChaintopic_assignment_chain = LLMChain(llm=chat, prompt=topic_assignment_template)
response = topic_assignment_chain.run(
topics_descr_list = topics_list,
format_instructions = format_instructions,
input_data = json.dumps(docs_batch_data)
)

Sequential Chains

LLM chain may be very basic. The ability of chains is in constructing more complex logic. Let’s attempt to create something more advanced.

The thought of sequential chains is to make use of the output of 1 chain because the input for one more.

For outlining chains, we can be using LCEL (LangChain Expression Language). This latest language was introduced just a few months ago, and now all of the old approaches with SimpleSequentialChain or SequentialChain are considered legacy. So, it’s value spending a while understanding the LCEL concept.

Let’s rewrite the previous chain in LCEL.

chain = topic_assignment_template | chat
response = chain.invoke(
{
'topics_descr_list': topics_list,
'format_instructions': format_instructions,
'input_data': json.dumps(docs_batch_data)
}
)

If you desire to learn it first-hand, I suggest you watch this video about LCEL from the LangChain team.

Using sequential chains

In some cases, it may be helpful to have several sequential calls in order that the output of 1 chain is utilized in the opposite ones.

In our case, we are able to first translate reviews into English after which do topic modelling and sentiment evaluation.

from langchain.schema import StrOutputParser
from operator import itemgetter# translation
translate_msg = '''
Below is a listing of customer reviews in JSON format with the next keys:
1. doc_id - identifier for the review
2. review - text of customer review
Please, translate review into English and return the identical JSON back. Please, return within the output ONLY valid JSON without another information.
Customer reviews:
```
{input_data}
```
'''
translate_template = ChatPromptTemplate.from_messages([
("system", "You're an API, so you return only valid JSON without any comments."),
("human", translate_msg)
])
# topic project & sentiment evaluation
topic_assignment_msg = '''
Below is a listing of customer reviews in JSON format with the next keys:
1. doc_id - identifier for the review
2. review - text of customer review
Please, analyse provided reviews and discover the important topics and sentiment. Include only topics from the provided below list.
List of topics with descriptions (delimited with ":"):
{topics_descr_list}
Output format:
{format_instructions}
Customer reviews:
```
{translated_data}
```
'''
topic_assignment_template = ChatPromptTemplate.from_messages([
("system", "You're a helpful assistant. Your task is to analyse hotel reviews."),
("human", topic_assignment_msg)
])
# defining chains
translate_chain = translate_template | chat | StrOutputParser()
topic_assignment_chain = {'topics_descr_list': itemgetter('topics_descr_list'), 
'translated_data': translate_chain, 
'format_instructions': itemgetter('format_instructions')} 
| topic_assignment_template | chat 
# execution
response = topic_assignment_chain.invoke(
{
'topics_descr_list': topics_list,
'format_instructions': format_instructions,
'input_data': json.dumps(docs_batch_data)
}
)

We similarly defined prompt templates for translation and topic project. Then, we determined the interpretation chain. The one latest thing here is the usage of StrOutputParser(), which converts response objects into strings (no rocket science).

Then, we defined the total chain, specifying the input parameters, prompt template and LLM. For input parameters, we took translated_data from the output of translate_chain while other parameters from the invoke input using the itemgetter function.

Nonetheless, in our case, such an approach with a combined chain may not be so convenient since we would love to save lots of the output of the primary chain as well to have translated values.

With chains, the whole lot becomes a bit more convoluted in order that we’d need some debugging capabilities. There are two options for debugging.
The primary one is that you would be able to activate debugging locally.

import langchain
langchain.debug = True

The opposite option is to make use of the LangChain platform — LangSmith. Nonetheless, it’s still in beta-tester mode, so you may must wait to get access.

Routing

Some of the complex cases of chains is routing while you use different prompts for various use cases. For instance, we could save different customer review parameters depending on the sentiment:

If the comment is negative, we’ll store the list of problems mentioned by the shopper.
Otherwise, we’ll get the list of excellent points from the review.

To make use of a routing chain, we’ll must pass comments one after the other as a substitute of batching them as we did before.

So our chain on a high level will seem like this.

First, we want to define the important chain that determines the sentiment. This chain consists of prompt, LLM and already familiar StrOutputParser().

sentiment_msg = '''
Given the shopper comment below please classify whether it's negative. If it's negative, return "negative", otherwise return "positive".
Don't respond with multiple word.Customer comment:
```
{input_data}
```
'''
sentiment_template = ChatPromptTemplate.from_messages([
("system", "You're an assistant. Your task is to markup sentiment for hotel reviews."),
("human", sentiment_msg)
])
sentiment_chain = sentiment_template | chat | StrOutputParser()

For positive reviews, we’ll ask the model to extract good points, while for negative ones — problems. So, we’ll need two different chains.
We’ll use the identical Pydantic output parsers as before to specify the intended output format and generate instructions.

We used partial_variables on top of the final topic project prompt message to specify different format instructions for positive and negative cases.

from langchain.prompts import PromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate, SystemMessagePromptTemplate# defining structure for positive and negative cases 
class PositiveCustomerCommentData(BaseModel):
topics: List[str] = Field(description="List of the relevant topics for the shopper review. Please, include only topics from the provided list.")
benefits: List[str] = Field(description = "List the great points from that customer mentioned")
sentiment: str = Field(description="sentiment of the comment (positive, neutral or negative")
class NegativeCustomerCommentData(BaseModel):
topics: List[str] = Field(description="List of the relevant topics for the shopper review. Please, include only topics from the provided list.")
problems: List[str] = Field(description = "List the issues that customer mentioned.")
sentiment: str = Field(description="sentiment of the comment (positive, neutral or negative")
# defining output parsers and generating instructions
positive_output_parser = PydanticOutputParser(pydantic_object=PositiveCustomerCommentData)
positive_format_instructions = positive_output_parser.get_format_instructions()
negative_output_parser = PydanticOutputParser(pydantic_object=NegativeCustomerCommentData)
negative_format_instructions = negative_output_parser.get_format_instructions()
general_topic_assignment_msg = '''
Below is a customer review delimited by ```.
Please, analyse the provided review and discover the important topics and sentiment. Include only topics from the provided below list.
List of topics with descriptions (delimited with ":"):
{topics_descr_list}
Output format:
{format_instructions}
Customer reviews:
```
{input_data}
```
'''
# defining prompt templates
positive_topic_assignment_template = ChatPromptTemplate( 
messages=[ 
SystemMessagePromptTemplate.from_template("You're a helpful assistant. Your task is to analyse hotel reviews."),
HumanMessagePromptTemplate.from_template(general_topic_assignment_msg) 
], 
input_variables=["topics_descr_list", "input_data"], 
partial_variables={"format_instructions": positive_format_instructions} )
negative_topic_assignment_template = ChatPromptTemplate( 
messages=[ 
SystemMessagePromptTemplate.from_template("You're a helpful assistant. Your task is to analyse hotel reviews."),
HumanMessagePromptTemplate.from_template(general_topic_assignment_msg) 
], 
input_variables=["topics_descr_list", "input_data"], 
partial_variables={"format_instructions": negative_format_instructions} )

So, now we want just to construct the total chain. The important logic is defined using RunnableBranch and condition based on sentiment, an output of sentiment_chain.

from langchain.schema.runnable import RunnableBranchbranch = RunnableBranch(
(lambda x: "negative" in x["sentiment"].lower(), negative_chain),
positive_chain
)
full_route_chain = {
"sentiment": sentiment_chain,
"input_data": lambda x: x["input_data"],
"topics_descr_list": lambda x: x["topics_descr_list"]
} | branch
full_route_chain.invoke({'input_data': review, 
'topics_descr_list': topics_list})

Listed here are a few examples. It really works pretty much and returns different objects depending on the sentiment.

We’ve looked intimately on the modular approach to do Topic Modelling using LangChain and introduce more complex logic. Now, it’s time to maneuver on to the second part and discuss how we could assess the model’s performance.

The crucial a part of any system running in production is evaluation. When now we have an LLM model running in production, we wish to make sure quality and control it over time.

In lots of cases, you can use not only human-in-the-loop (when individuals are checking the model results for a small sample over time to regulate performance) but additionally leverage LLM for this task as well. It could possibly be an excellent idea to make use of a more complex model for runtime checks. For instance, we used ChatGPT 3.5 for our topic assignments, but we could use GPT 4 for evaluation (just like the concept of supervision in real life when you’re asking more senior colleagues for a code review).

Langchain may also help us with this task as well because it provides some tools to guage results:

String Evaluators help to guage results out of your model. There is kind of a broad set of tools, from validating the format to assessing correctness based on provided context or reference. We’ll discuss these methods intimately below.
The opposite class of evaluators are Comparison evaluators. They can be handy if you desire to assess the performance of two different LLM models (A/B testing use case). We won’t go into their details today.

Exact match

Probably the most straightforward approach is to match the model’s output to the proper answer (i.e. from experts or a training set) using a precise match. For that, we could use ExactMatchStringEvaluator, for instance, to evaluate the performance of our sentiment evaluation. On this case, we don’t need LLMs.

from langchain.evaluation import ExactMatchStringEvaluatorevaluator = ExactMatchStringEvaluator(
ignore_case=True,
ignore_numbers=True,
ignore_punctuation=True,
)
evaluator.evaluate_strings(
prediction="positive.",
reference="Positive"
)
# {'rating': 1}
evaluator.evaluate_strings(
prediction="negative",
reference="Positive"
)
# {'rating': 0}

You’ll be able to construct your personal custom String Evaluator or match output to an everyday expression.

Also, there are helpful tools to validate structured output, whether the output is a legitimate JSON, has the expected structure and is near the reference by distance. You could find more details about it in the documentation.

Embeddings distance evaluation

The opposite handy approach is to take a look at the gap between embeddings. You’re going to get a rating within the result: the lower the rating — the higher, since answers are closer to one another. For instance, we are able to compare found good points by Euclidean distance.

from langchain.evaluation import load_evaluator
from langchain.evaluation import EmbeddingDistanceevaluator = load_evaluator(
"embedding_distance", distance_metric=EmbeddingDistance.EUCLIDEAN
)
evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location", 
reference="well designed rooms, clean, great location, good atmosphere"
)
{'rating': 0.20732719121627757}

We got a distance of 0.2. Nonetheless, the outcomes of such evaluation may be harder to interpret since you will want to take a look at your data distributions and define thresholds. Let’s move on to approaches based on LLMs since we’ll have the option to interpret their results effortlessly.

Criteria evaluation

You should utilize LangChain to validate LLM’s answer against some rubric or criteria. There’s a listing of predefined criteria, or you may create a custom one.

from langchain.evaluation import Criteria
list(Criteria)[,
,
,
,
,
,
,
,
,
,
,
,
,
]

A few of them don’t require reference (for instance, harmfulness or conciseness). But for correctness, it is advisable know the reply.
Let’s try to make use of it for our data.

evaluator = load_evaluator("criteria", criteria="conciseness")
eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="List the great points that customer mentioned",
)

Because of this, we got the reply (whether the outcomes fit the desired criterion) and chain-of-thought reasoning in order that we could understand the logic behind the result and potentially tweak the prompt.

For those who’re eager about how it really works, you can activate langchain.debug = True and see the prompt sent to LLM.

Let’s have a look at the correctness criterion. To evaluate it, we want to offer a reference (the proper answer).

evaluator = load_evaluator("labeled_criteria", criteria="correctness")eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="List the great points that customer mentioned",
reference="well designed rooms, clean, great location, good atmosphere",
)

You’ll be able to even create your personal custom criteria, for instance, whether multiple points are mentioned in the reply.

custom_criterion = {"multiple": "Does the output contain multiple points?"}evaluator = load_evaluator("criteria", criteria=custom_criterion)
eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="List the great points that customer mentioned",
)

Scoring evaluation

With criteria evaluation, we got only a Yes or No answer, but in lots of cases, it just isn’t enough. For instance, in our example, the prediction has 3 out of 4 mentioned points, which is an excellent result, but we got N when evaluating it for correctness. So, using this approach, answers “well-designed rooms, clean, great location” and “fast web” can be equal when it comes to our metrics, which won’t give us enough information to grasp the model’s performance.

There’s one other pretty close strategy of scoring while you’re asking LLM to offer the rating within the output, which could help to get more granular results. Let’s try it.

from langchain.chat_models import ChatOpenAIaccuracy_criteria = {
"accuracy": """
Rating 1: The reply doesn't mention any relevant points.
Rating 3: The reply mentions only few of relevant points but have major inaccuracies or includes several not relevant options.
Rating 5: The reply has moderate quantity of relevant options but might need inaccuracies or incorrect points.
Rating 7: The reply aligns with the reference and shows most of relevant points and haven't got completely incorrect options mentioned.
Rating 10: The reply is totally accurate and aligns perfectly with the reference."""
}
evaluator = load_evaluator(
"labeled_score_string", 
criteria=accuracy_criteria, 
llm=ChatOpenAI(model="gpt-4"),
)
eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="""Below is a customer review delimited by ```. Provide the list the great points that customer mentioned in the shopper review.
Customer review:
```
Small but well designed rooms, clean, great location, good atmosphere. I might stay there again. Continental breakfast is weak but okay.
```
""",
reference="well designed rooms, clean, great location, good atmosphere"
)

We got seven as a rating, which looks pretty valid. Let’s have a look at the actual prompt used.

Nonetheless, I might treat scores from LLMs with a pinch of salt. Remember, it’s not a regression function, and scores may be pretty subjective.

We’ve been using the scoring model with the reference. But in lots of cases, we may not have the proper answers, or it could possibly be expensive for us to get them. You should utilize the scoring evaluator even without reference scores asking the model to evaluate the reply. It’s value using GPT-4 to be more confident in the outcomes.

accuracy_criteria = {
"recall": "The asisstant's answer should include all mentioned within the query. If information is missing, rating answer lower.",
"precision": "The assistant's answer shouldn't have any points not present within the query."
}evaluator = load_evaluator("score_string", criteria=accuracy_criteria,
llm=ChatOpenAI(model="gpt-4"))
eval_result = evaluator.evaluate_strings(
prediction="well designed rooms, clean, great location",
input="""Below is a customer review delimited by ```. Provide the list the great points that customer mentioned in the shopper review.
Customer review:
```
Small but well designed rooms, clean, great location, good atmosphere. I might stay there again. Continental breakfast is weak but okay.
```
"""
)

We got a reasonably close rating to the previous one.

We’ve checked out quite a number of possible ways to validate your output, so I hope you are actually able to test your models’ results.

In this text, we’ve discussed some nuances we want to consider if we wish to make use of LLMs for production processes.

We’ve checked out using the LangChain framework to make our solution more modular in order that we could easily iterate and use latest approaches (for instance, switching from one LLM to a different). Also, frameworks often help to make our code easier to take care of.
The opposite big topic we’ve discussed is different tools now we have to evaluate the model’s performance. If we’re using LLMs in production, we want to have some constant monitoring in place to make sure the standard of our service, and it’s value spending a while to create an evaluation pipeline based on LLMs or human-in-the-loop.

Thanks loads for reading this text. I hope it was insightful to you. If you have got any follow-up questions or comments, please leave them within the comments section.

Ganesan, Kavita and Zhai, ChengXiang. (2011). OpinRank Review Dataset.
UCI Machine Learning Repository. https://doi.org/10.24432/C5QW4W

This text relies on information from the course “LangChain for LLM Application Development” by DeepLearning.AI and LangChain.