There’s a whole lot of justifiable excitement around using LLMs as the premise for constructing a recent generation of smart apps. While tools like LangChain or GPT-Index makes it easy to create prototypes, it’s useful to look behind the scenes to see what the wizard looks like before taking these apps into production. On this post, we’ll show you the way to construct a Q&A chatbot using just base-level OSS python tooling, OpenAI, and Lance.
— should you just wish to play with the worked example, you could find a notebook on the lance github.
We’ll construct a question-answer bot with answers drawn from youtube transcripts. The result’s going to be a Jupyter Notebook where you possibly can get the reply to a matter you ask AND it shows the highest matching YouTube video starting on the relevant point.
To make things easy, we’ll just use a ready made dataset of YouTube transcripts on HuggingFace:
from datasets import load_dataset
data = load_dataset('jamescalam/youtube-transcriptions', split='train')
This dataset has 208619 transcript sentences (across 700 videos)
Dataset({
features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
num_rows: 208619
})
We’ll use pandas to create the context windows of size 20 with stride 4. Which means every 4 sentences, it creates a “context” by concatenating the following 20 sentences together. Eventually the answers will come from finding to right context and summarizing it.
import numpy as np
import pandas as pdwindow = 20
stride = 4
def contextualize(raw_df, window, stride):
def process_video(vid):
# For every video, create the text rolling window
text = vid.text.values
time_end = vid["end"].values
contexts = vid.iloc[:-window:stride, :].copy()
contexts["text"] = [' '.join(text[start_i:start_i+window])
for start_i in range(0, len(vid)-window, stride)]
contexts["end"] = [time_end[start_i+window-1]
for start_i in range(0, len(vid)-window, stride)]
return contexts
# concat result from all videos
return pd.concat([process_video(vid) for _, vid in raw_df.groupby("title")])
df = contextualize(data.to_pandas(), 20, 4)
A temporary aside: it’s annoying that pandas doesn’t provide rolling window functions on non-numeric columns because process_video
should just be easy call to pd.DataFrame.rolling
.
With this we go from 200K+ sentences to <50K contexts
>>> len(df)
48935
Here we’ll use the OpenAI embeddings API to show text into embeddings. There are other services like Cohere providing similar functionality, or you possibly can run your personal.
The OpenAI python API requires an API key for authentication. For details, you possibly can confer with their documentation to see the way to set it up.
The documentation *looks* super easy:
import openai
openai.Embedding.create(input="text", engine="text-embedding-ada-002")
But in fact, the OpenAI API is nearly all the time out of capability, otherwise you’re being throttled, or there’s some cryptic JSON decode error. To make our lives easier, we are able to use the ratelimiter
and retry
packages in python:
import functools
import openai
import ratelimiter
from retry import retryembed_model = "text-embedding-ada-002"
# API limit at 60/min == 1/sec
limiter = ratelimiter.RateLimiter(max_calls=0.9, period=1.0)
# Get the embedding with retry
@retry(tries=10, delay=1, max_delay=30, backoff=3, jitter=1)
def embed_func(c):
rs = openai.Embedding.create(input=c, engine=embed_model)
return [record["embedding"] for record in rs["data"]]
rate_limited = limiter(embed_func)
And we may request batches of embeddings as an alternative of just one after the other
from tqdm.auto import tqdm
import math# We request in batches fairly than 1 embedding at a time
def to_batches(arr, batch_size):
length = len(arr)
def _chunker(arr):
for start_i in range(0, len(df), batch_size):
yield arr[start_i:start_i+batch_size]
# add progress meter
yield from tqdm(_chunker(arr), total=math.ceil(length / batch_size))
batch_size = 1000
batches = to_batches(df.text.values.tolist(), batch_size)
embeds = [emb for c in batches for emb in rate_limited(c)]
Once the embeddings are created, we now have to make it searchable. At this point, most existing toolchains require you to spin up a separate service, or store your vectors individually from the info individually from the vector index. That is where Lance is available in. We are able to merge the embeddings (vectors) with the unique data and write it to disk. So the vectors you would like for similarity search lives with the index to make it fast lives with the info that you must filter / return to the user.
import lance
import pyarrow as pa
from lance.vector import vec_to_tabletable = vec_to_table(np.array(embeds))
combined = pa.Table.from_pandas(df).append_column("vector", table["vector"])
ds = lance.write_dataset(combined, "chatbot.lance")
That is sufficient to do vector search in Lance
ds.to_table(nearest={"column": "vector",
"q": [],
"k": }).to_pandas()
The above query retrieves the k
rows whose vector
is most just like the query vector q
.
But a brute force approach is a bit slow (~150ms). As a substitute, should you’re going to be making a bunch of requests, you possibly can create an ANN index on the vector column:
ds = ds.create_index("vector", index_type="IVF_PQ",
num_partitions=64, num_sub_vectors=96)
In the event you run the identical ANN search query, you need to get a much faster answer now.
Because Lance is columnar, you possibly can easily retrieve additional columns. By default, the to_table
returns all columns plus an additional “rating” column. You may select so as to add a columns
argument to to_table
to pick out a subset of obtainable columns. You may as well filter the ANN results by passing a SQL where clause string to the filter
parameter in to_table
.
Unless you’re already aware of how LLM apps work, you may be asking “what do embeddings need to do with Q&A bots?” That is where prompt engineering is available in. There are a number of great material dedicated to prompt engineering so I’ll spare the main points on this post. Here we’ll just undergo a practical example.
The magic happens in 3 steps:
1. Embed the query text
Let’s say you must ask “Which training method should I exploit for sentence transformers after I only have pairs of related sentences?” Since we’ve converted all of the transcripts into embeddings, we’ll also do the identical with the query text itself
query = ("Which training method should I exploit for sentence transformers "
"after I only have pairs of related sentences?")
openai.Embedding.create(input=query, engine="text-embedding-ada-002")
2. Seek for most similar context
We then use the query vector to search out essentially the most similar context
context = ds.to_table(
nearest={
"column": "vector",
"k": 3,
"q": query_vector
}).to_pandas()
3. Create a prompt for the OpenAI completion API
LangChain and similar tools have a number of Prompt templates in your use case. For the duty at hand, we create our own
"Answer the query based on the context below.nn"+
"Context:n"
And we plug within the text from step 2 above into the context. We then use OpenAI again to get the reply
def complete(prompt):
# query text-davinci-003
res = openai.Completion.create(
engine='text-davinci-003',
prompt=prompt,
temperature=0,
max_tokens=400,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
return res['choices'][0]['text'].strip()
An example usage would appear to be:
>> query = "who was the twelfth person on the moon and when did they land?"
>> complete(query)
'The twelfth person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'
Putting all of it together
Remember our query was “Which training method should I exploit for sentence transformers after I only have pairs of related sentences?” Using the steps outlined above, I get back:
“NLI with multiple negative rating loss”
And since Lance means that you can retrieve additional columns easily, I can’t only show essentially the most relevant YouTube video, but start it at the fitting place within the video:
On this post we saw the way to use Lance as a critical component to power an LLM-based app. As well as, we went through an end-to-end workflow peeling back the covers. For some of these search workflows, Lance is an incredible fit because 1) it’s super easy to make use of for ANN, 2) it’s columnar so you possibly can add a ton of additional features, and three) it has lightning fast random access speed so the index will be disk-based and very scalable.
In the event you’d like to offer it a shot, you could find the Lance repo on github. In the event you like us, we’d really appreciate a ⭐️ and your feedback!
Special due to Rob Meng for uplifting this post
night jazz