Table of Contents
Relevant Links
A couple of months ago, I released the Film Search app, a Retrieval-Augmented Generation (RAG) application designed to recommend movies based on user queries. For instance, a user may ask: “Find me drama movies in English which are lower than 2 hours long and have dogs.” and receive a suggestion like:
Title of Film: Hachi: A Dog’s Tale
Runtime: 93 minutes
Release 12 months: 2009
Streaming: Not available for streaming
This film tells the poignant true story of Hachiko, an Akita dog known for his remarkable loyalty to his owner. The emotional depth and the themes of friendship and loyalty resonate strongly, making it a touching drama that showcases the profound bond between humans and dogs. It’s perfect for anyone searching for a heartfelt story that highlights the importance of companionship.…
This was not only a straightforward RAG app, nevertheless. It included what’s often known as self-querying retrieval. Which means the bot takes the user’s query and transforms it by adding metadata filters. This ensures any documents pulled into the chat model’s context respects the constraints set by the user’s query. For more information on this app, I like to recommend testing my earlier article linked above.
Unfortunately, there have been some issues with the app:
- There was no offline evaluation done, besides passing the ‘eye test’. This test is crucial, but not sufficient.
- Observability was non-existent. If a question went poorly, you needed to manually pull up the project and run some ad-hoc scripts in an try and see what went flawed.
- The Pinecone vector database needed to be pulled manually. This meant the documents would quickly be old-fashioned if, say, a movie got pulled from a streaming service.
In this text, I’ll briefly cover among the improvements made to the Film Search app. It will cover:
- Offline Evaluation using RAGAS and Weave
- Online Evaluation and Observability
- Automated Data Pulling using Prefect
One thing before we jump in: I discovered the name Film Search to be a bit generic, so I rebranded the app as Rosebud 🌹, hence the image shown above. Real film geeks will understand the reference.
It’s important to give you the chance to guage if a change made to your LLM application improves or degrades its performance. Unfortunately, evaluation of LLM apps is a difficult and novel space. There’s simply not much agreement on what constitutes a superb evaluation.
For Rosebud 🌹, I made a decision to tackle what’s often known as the RAG triad. This approach is promoted by TruLens, a platform to judge and track LLM applications.
The triad covers three points of a RAG app:
- Context Relevancy: When a question is made by the user, documents fill the context of the chat model. Is the retrieved context actually useful? If not, you could have to tweak things like document embedding, chunking, or metadata filtering.
- Faithfulness: Is the model’s response actually grounded within the retrieved documents? You don’t want the model making up facts; the entire point of RAG is to assist reduce hallucinations by utilizing retrieved documents.
- Answer Relevancy: Does the model’s response actually answer the user’s query? If the user asks for “Comedy movies made within the Nineteen Nineties?”, the model’s answer higher contain only comedy movies made within the Nineteen Nineties.
There are a couple of ways to try and assess these three functions of a RAG app. A technique can be to make use of human expert evaluators. Unfortunately, this may be expensive and wouldn’t scale. For Rosebud 🌹 I made a decision to make use of LLMs-as-a-judges. This implies using a chat model to take a look at each of the three criteria above and assigning a rating of 0 to 1 for every. This method has the advantage of being low-cost and scaling well. To perform this, I used RAGAS, a well-liked framework that helps you evaluate your RAG applications. The RAGAS framework includes the three metrics mentioned above and makes it fairly easy to make use of them to judge your apps. Below is a code snippet demonstrating how I conducted this offline evaluation:
from ragas import evaluate
from ragas.metrics import AnswerRelevancy, ContextRelevancy, Faithfulness
import weave@weave.op()
def evaluate_with_ragas(query, model_output):
# Put data right into a Dataset object
data = {
"query": [query],
"contexts": [[model_output['context']]],
"answer": [model_output['answer']]
}
dataset = Dataset.from_dict(data)
# Define metrics to guage
metrics = [
AnswerRelevancy(),
ContextRelevancy(),
Faithfulness(),
]
judge_model = ChatOpenAI(model=config['JUDGE_MODEL_NAME'])
embeddings_model = OpenAIEmbeddings(model=config['EMBEDDING_MODEL_NAME'])
evaluation = evaluate(dataset=dataset, metrics=metrics, llm=judge_model, embeddings=embeddings_model)
return {
"answer_relevancy": float(evaluation['answer_relevancy']),
"context_relevancy": float(evaluation['context_relevancy']),
"faithfulness": float(evaluation['faithfulness']),
}
def run_evaluation():
# Initialize chat model
model = rosebud_chat_model()
# Define evaluation questions
questions = [
{"query": "Suggest a good movie based on a book."}, # Adaptations
{"query": "Suggest a film for a cozy night in."}, # Mood-Based
{"query": "What are some must-watch horror movies?"}, # Genre-Specific
...
# Total of 20 questions
]
# Create Weave Evaluation object
evaluation = weave.Evaluation(dataset=questions, scorers=[evaluate_with_ragas])
# Run the evaluation
asyncio.run(evaluation.evaluate(model))
if __name__ == "__main__":
weave.init('film-search')
run_evaluation()
A couple of notes:
- With twenty questions and three criteria to guage across, you’re taking a look at sixty LLM calls for a single evaluation! It gets even worse though; with the
rosebud_chat_model
, there are two calls for each query: one to construct the metadata filter and one other to offer the reply, so really that is 120 calls for a single eval! All models used my evaluation are the brand newgpt-4o-mini
, which I strongly recommend. In my experience the calls cost $0.05 per evaluation. - Note that we’re using
asyncio.run
to run the evals. It is good to make use of asynchronous calls since you don’t want to judge each query sequentially one after the opposite. As an alternative, withasyncio
we will begin evaluating other questions as we wait for previous I/O operations to complete. - There are a complete of twenty questions for a single evaluation. These span a wide range of typical film queries a user may ask. I mostly got here up with these myself, but in practice it might be higher to make use of queries actually asked by users in production.
- Notice the
weave.init
and the@weave.op
decorator which are getting used. These are a part of the brand new Weave library from Weights & Biases (W&B). Weave is a complement to the standard W&B library, with a deal with LLM applications. It lets you capture inputs and outputs of LLMs by utilizing a the easy@weave.op
decorator. It also lets you capture the outcomes of evaluations usingweave.Evaluation(…)
. By integrating RAGAS to perform evaluations and Weave to capture and log them, we get a robust duo that helps GenAI developers iteratively improve their applications. You furthermore mght get to log the model latency, cost, and more.
In theory, one can now tweak a hyperparameter (e.g. temperature), re-run the evaluation, and see if the adjustment has a positive or negative impact. Unfortunately, in practice I discovered the LLM judging to be finicky, and I’m not the just one. LLM judges appear to be fairly bad at using a floating point value to evaluate these metrics. As an alternative, it appears they appear to do higher at classification e.g. a thumbs up/thumbs down. RAGAS doesn’t yet support LLM judges performing classification. Writing it by hand doesn’t seem too difficult, and maybe in a future update I’ll attempt this myself.
Offline evaluation is sweet for seeing how tweaking hyperparameters affects performance, but in my view online evaluation is much more useful. In Rosebud 🌹 I even have now incorporated the usage of 👍/👎 buttons at the underside of each response to offer feedback.
When a user clicks on either button they’re told that their feedback was logged. Below is a snippet of how this was achieved within the Streamlit interface:
def start_log_feedback(feedback):
print("Logging feedback.")
st.session_state.feedback_given = True
st.session_state.sentiment = feedback
thread = threading.Thread(goal=log_feedback, args=(st.session_state.sentiment,
st.session_state.query,
st.session_state.query_constructor,
st.session_state.context,
st.session_state.response))
thread.start()def log_feedback(sentiment, query, query_constructor, context, response):
ct = datetime.datetime.now()
wandb.init(project="film-search",
name=f"query: {ct}")
table = wandb.Table(columns=["sentiment", "query", "query_constructor", "context", "response"])
table.add_data(sentiment,
query,
query_constructor,
context,
response
)
wandb.log({"Query Log": table})
wandb.finish()
Note that the technique of sending the feedback to W&B runs on a separate thread reasonably than on the primary thread. That is to stop the user from getting stuck for a couple of seconds waiting for the logging to finish.
A W&B table is used to store the feedback. Five quantities are logged within the table:
- Sentiment: Whether the user clicked thumbs up or thumbs down
- Query: The user’s query, e.g. Find me drama movies in English which are lower than 2 hours long and have dogs.
- Query_Constructor: The outcomes of the query constructor, which rewrites the user’s query and includes metadata filtering if crucial, e.g.
{
"query": "drama English dogs",
"filter": {
"operator": "and",
"arguments": [
{
"comparator": "eq", "attribute": "Genre", "value": "Drama"
},
{
"comparator": "eq", "attribute": "Language", "value": "English"
}, {
"comparator": "lt", "attribute": "Runtime (minutes)", "value": 120
}
]
},
}
- Context: The retrieved context based on the reconstructed query, e.g. Title: Hachi: A Dog’s Tale. Overview: A drama based on the true story of a faculty professor’s…
- Response: The model’s response
All of that is logged conveniently in the identical project because the Weave evaluations shown earlier. Now, when a question goes south it is so simple as hitting the thumbs down button to see exactly what happened. It will allow much faster iteration and improvement of the Rosebud 🌹 suggestion application.
To make sure recommendations from Rosebud 🌹 proceed to remain accurate it was necessary to automate the technique of pulling data and uploading them to Pinecone. For this task, I selected Prefect. Prefect is a well-liked workflow orchestration tool. I used to be searching for something lightweight, easy to learn, and Pythonic. I discovered all of this in Prefect.
Prefect offers a wide range of ways to schedule your workflows. I made a decision to make use of the push work pools with automatic infrastructure provisioning. I discovered that this setup balances simplicity with configurability. It allows the user to task Prefect with routinely provisioning all the infrastructure needed to run your flow in your cloud provider of alternative. I selected to deploy on Azure, but deploying on GCP or AWS only requires changing a couple of lines of code. Seek advice from the pinecone_flow.py
file for more details. A simplified flow is provided below:
@task
def start():
"""
Start-up: check every part works or fail fast!
"""# Print out some debug info
print("Starting flow!")
# Ensure user has set the suitable env variables
assert os.environ['LANGCHAIN_API_KEY']
assert os.environ['OPENAI_API_KEY']
...
@task(retries=3, retry_delay_seconds=[1, 10, 100])
def pull_data_to_csv(config):
TMBD_API_KEY = os.getenv('TMBD_API_KEY')
YEARS = range(config["years"][0], config["years"][-1] + 1)
CSV_HEADER = ['Title', 'Runtime (minutes)', 'Language', 'Overview', ...]
for 12 months in YEARS:
# Grab list of ids for all movies made in {YEAR}
movie_list = list(set(get_id_list(TMBD_API_KEY, 12 months)))
FILE_NAME = f'./data/{12 months}_movie_collection_data.csv'
# Creating file
with open(FILE_NAME, 'w') as f:
author = csv.author(f)
author.writerow(CSV_HEADER)
...
print("Successfully pulled data from TMDB and created csv files in data/")
@task
def convert_csv_to_docs():
# Loading in data from all csv files
loader = DirectoryLoader(
...
show_progress=True)
docs = loader.load()
metadata_field_info = [
AttributeInfo(name="Title",
description="The title of the movie", type="string"),
AttributeInfo(name="Runtime (minutes)",
description="The runtime of the movie in minutes", type="integer"),
...
]
def convert_to_list(doc, field):
if field in doc.metadata and doc.metadata[field] is just not None:
doc.metadata[field] = [item.strip()
for item in doc.metadata[field].split(',')]
...
fields_to_convert_list = ['Genre', 'Actors', 'Directors',
'Production Companies', 'Stream', 'Buy', 'Rent']
...
# Set 'overview' and 'keywords' as 'page_content' and other fields as 'metadata'
for doc in docs:
# Parse the page_content string right into a dictionary
page_content_dict = dict(line.split(": ", 1)
for line in doc.page_content.split("n") if ": " in line)
doc.page_content = (
'Title: ' + page_content_dict.get('Title') +
'. Overview: ' + page_content_dict.get('Overview') +
...
)
...
print("Successfully took csv files and created docs")
return docs
@task
def upload_docs_to_pinecone(docs, config):
# Create empty index
PINECONE_KEY, PINECONE_INDEX_NAME = os.getenv(
'PINECONE_API_KEY'), os.getenv('PINECONE_INDEX_NAME')
pc = Pinecone(api_key=PINECONE_KEY)
# Goal index and check status
pc_index = pc.Index(PINECONE_INDEX_NAME)
print(pc_index.describe_index_stats())
embeddings = OpenAIEmbeddings(model=config['EMBEDDING_MODEL_NAME'])
namespace = "film_search_prod"
PineconeVectorStore.from_documents(
docs,
...
)
print("Successfully uploaded docs to Pinecone vector store")
@task
def publish_dataset_to_weave(docs):
# Initialize Weave
weave.init('film-search')
rows = []
for doc in docs:
row = {
'Title': doc.metadata.get('Title'),
'Runtime (minutes)': doc.metadata.get('Runtime (minutes)'),
...
}
rows.append(row)
dataset = Dataset(name='Movie Collection', rows=rows)
weave.publish(dataset)
print("Successfully published dataset to Weave")
@flow(log_prints=True)
def pinecone_flow():
with open('./config.json') as f:
config = json.load(f)
start()
pull_data_to_csv(config)
docs = convert_csv_to_docs()
upload_docs_to_pinecone(docs, config)
publish_dataset_to_weave(docs)
if __name__ == "__main__":
pinecone_flow.deploy(
name="pinecone-flow-deployment",
work_pool_name="my-aci-pool",
cron="0 0 * * 0",
image=DeploymentImage(
name="prefect-flows:latest",
platform="linux/amd64",
)
)
Notice how easy it’s to show Python functions right into a Prefect flow. All you wish are some sub-functions styled with @task
decorators and a @flow
decorator on the primary function. Also note that after uploading the documents to Pinecone, the last step of our flow publishes the dataset to Weave. This is essential for reproducibility purposes.
At the underside of the script we see how deployment is finished in Prefect.
- We want to offer a
name
for the deployment. This is unfair. - We also have to specify a
work_pool_name
. Push work pools in Prefect routinely send tasks to serverless computers without having a middleman. This name must match the name used to create the pool, which we’ll see below. - You furthermore mght have to specify a
cron
, which is brief for chronograph. This lets you specify how often to repeat a workflow. The worth“0 0 * * 0”
means repeat this workflow every week. Try this website for details on how thecron
syntax works. - Finally, you have to specify a
DeploymentImage
. Here you specify each aname
and aplatform
. The name is unfair, however the platform is just not. Since I would like to deploy to Azure compute instances, and these instances run Linux, it’s necessary I specify that within theDeploymentImage
.
To deploy this flow on Azure using the CLI, run the next commands:
prefect work-pool create --type azure-container-instance:push --provision-infra my-aci-pool
prefect deployment run 'get_repo_info/my-deployment'
These commands will routinely provision all the crucial infrastructure on Azure. This includes an Azure Container Registry (ACR) that can hold a Docker image containing all files in your directory in addition to any crucial libraries listed in a requirements.txt
. It’ll also include an Azure Container Instance (ACI) Identity that may have permissions crucial to deploy a container with the aforementioned Docker image. Finally, the deployment run
command will schedule the code to be run every week. You may check the Prefect dashboard to see your flow get run:
By updating my Pinecone vector store weekly, I can be certain that the recommendations from Rosebud 🌹 remain accurate.
In this text, I discussed my experience improving the Rosebud 🌹 app. This included the technique of incorporating offline and online evaluation, in addition to automating the update of my Pinecone vector store.
Another improvements not mentioned in this text:
- Including rankings from The Movie Database within the film data. You may now ask for “highly rated movies” and the chat model will filter for movies above a 7/10.
- Upgraded chat models. Now the query and summary models are using
gpt-4o-mini
. Recall that the LLM judge model can also be usinggpt-4o-mini
. - Embedding model upgraded to
text-embedding-3-small
fromtext-embedding-ada-002
. - Years now span 1950–2023, as a substitute of starting at 1920. Film data from 1920–1950 was not prime quality, and only tousled recommendations.
- UI is cleaner, with all details regarding the project relegated to a sidebar.
- Vastly improved documentation on GitHub.
- Bug fixes.
As mentioned at the highest of the article, the app is now 100% free to make use of! I’ll foot the bill for queries for the foreseeable future (hence the alternative of gpt-4o-mini
as a substitute of the costlier gpt-4o
). I really need to get the experience of running an app in production, and having my readers test out Rosebud 🌹 is an ideal approach to do that. Within the unlikely event that the app really blows up, I may have to give you another model of funding. But that may an ideal problem to have.
Enjoy discovering awesome movies! 🎥