How we leveraged distilabel to create an Argilla 2.0 Chatbot

Discover easy methods to construct a Chatbot for a tool of your selection (Argilla 2.0 on this case) that may understand technical documentation and chat with users about it.

In this text, we’ll show you easy methods to leverage distilabel and fine-tune a domain-specific embedding model to create a conversational model that is each accurate and interesting.

This text outlines the strategy of making a Chatbot for Argilla 2.0. We’ll:

create an artificial dataset from the technical documentation to fine-tune a domain-specific embedding model,
create a vector database to store and retrieve the documentation and
deploy the ultimate Chatbot to a Hugging Face Space allowing users to interact with it, storing the interactions in Argilla for continuous evaluation and improvement.

Click here to go to the app.

Generating Synthetic Data for Advantageous-Tuning Custom Embedding Models

Need a fast recap on RAG? Brush up on the fundamentals with this handy intro notebook. We’ll wait so that you can get in control!

Downloading and chunking data

Chunking data means dividing your text data into manageable chunks of roughly 256 tokens each (chunk size utilized in RAG later).

Let’s dive into step one: processing the documentation of your goal repository. To simplify this task, you’ll be able to leverage libraries like llama-index to read the repository contents and parse the markdown files. Specifically, langchain offers useful tools like MarkdownTextSplitter and llama-index provides MarkdownNodeParser to allow you to extract the obligatory information. In the event you prefer a more streamlined approach, think about using the corpus-creator app from davanstrien.

To make things easier and more efficient, we have developed a custom Python script that does the heavy lifting for you. You could find it in our repository here.

This script automates the strategy of retrieving documentation from a GitHub repository and storing it as a dataset on the Hugging Face Hub. And one of the best part? It’s incredibly easy to make use of! Let’s examine how we will run it:

python docs_dataset.py 
    "argilla-io/argilla-python" 
    --dataset-name "plaguss/argilla_sdk_docs_raw_unstructured"

While the script is simple to make use of, you’ll be able to further tailor it to your needs by utilizing additional arguments. Nevertheless, there are two essential inputs you will need to supply:

The GitHub path to the repository where your documentation is stored
The dataset ID for the Hugging Face Hub, where your dataset might be stored

Once you have provided these required arguments, the script will deal with the remaining. Here’s what happens behind the scenes:

The script downloads the documentation from the desired GitHub repository to your local directory. By default, it looks for docs within the /docs directory by default, but you’ll be able to change this by specifying a distinct path.
It extracts all of the markdown files from the downloaded documentation.
Chunks the extracted markdown files into manageable pieces.
Finally, it pushes the prepared dataset to the Hugging Face Hub, making it ready to be used.

To provide you a greater understanding of the script’s inner workings, here’s a code snippet that summarizes the core logic:


from github import Github

gh = Github()
repo = gh.get_repo("repo_name")


download_folder(repo, "/folder/with/docs", "dir/to/download/docs") 


md_files = list(docs_path.glob("**/*.md"))


data = create_chunks(md_files)


create_dataset(data, repo_name="name/of/the/dataset")

The script includes short functions to download the documentation, create chunks from the markdown files, and create the dataset. Including more functionalities or implementing a more complex chunking strategy ought to be straightforward.

You’ll be able to take a have a look at the available arguments:

Click to see docs_dataset.py help message

$ python docs_dataset.py -h
usage: docs_dataset.py [-h] [--dataset-name DATASET_NAME] [--docs_folder DOCS_FOLDER] [--output_dir OUTPUT_DIR] [--private | --no-private] repo [repo ...]

Download the docs from a github repository and generate a dataset from the markdown files. The dataset might be pushed to the hub.

positional arguments:
  repo                  Name of the repository in the hub. For instance 'argilla-io/argilla-python'.

options:
  -h, --help            show this help message and exit
  --dataset-name DATASET_NAME
                        Name to present to the brand new dataset. For instance 'my-name/argilla_sdk_docs_raw'.
  --docs_folder DOCS_FOLDER
                        Name of the docs folder in the repo, defaults to 'docs'.
  --output_dir OUTPUT_DIR
                        Path to save lots of the downloaded files from the repo (optional)
  --private, --no-private
                        Whether to maintain the repository private or not. Defaults to False.

Generating synthetic data for our embedding model using distilabel

We’ll generate synthetic questions from our documentation that might be answered by every chunk of documentation. We may also generate hard negative examples by generating unrelated questions that might be easily distinguishable. We are able to use the questions, hard negatives, and docs to construct the triples for the fine-tuning dataset.

The complete pipeline script might be seen at pipeline_docs_queries.py within the reference repository, but let’s go over the various steps:

load_data:

Step one in our journey is to amass the dataset that houses the useful documentation chunks. Upon closer inspection, we notice that the column containing these chunks is aptly named chunks. Nevertheless, for our model to operate seamlessly, we want to assign a brand new identity to this column. Specifically, we wish to rename it to anchor, as that is the input our subsequent steps might be expecting. We’ll make use of output_mappings to do that column transformation for us:

load_data = LoadDataFromHub(
    name="load_data",
    repo_id="plaguss/argilla_sdk_docs_raw_unstructured",
    output_mappings={"chunks": "anchor"},
    batch_size=10,
)

generate_sentence_pair

Now, we arrive at essentially the most fascinating a part of our process, transforming the documentation pieces into synthetic queries. That is where the GenerateSentencePair task takes center stage. This powerful task offers a wide selection of possibilities for generating high-quality sentence pairs. We encourage you to explore its documentation to unlock its full potential.

In our specific use case, we’ll harness the capabilities of GenerateSentencePair to craft synthetic queries that may ultimately enhance our model’s performance. Let’s dive deeper into how we’ll configure this task to attain our goals.

llm = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3-70B-Instruct",
    tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
)

generate_sentence_pair = GenerateSentencePair(
    name="generate_sentence_pair",
    triplet=True,  
    motion="query",
    context="The generated sentence must be related with Argilla, a knowledge annotation tool for AI engineers and domain experts.",
    llm=llm,
    input_batch_size=10,
    output_mappings={"model_name": "model_name_query"},
)

Let’s break down the code snippet above.

By setting triplet=True, we’re instructing the duty to supply a series of triplets, comprising an anchor, a positive sentence, and a negative sentence. This format is perfectly fitted to fine-tuning, as explained within the Sentence Transformers library’s training overview.

The motion="query" parameter is a vital aspect of this task, because it directs the LLM to generate queries for the positive sentences. That is where the magic happens, and our documentation chunks are transformed into meaningful queries.

To further assist the model, we have included the context argument. This provides additional information to the LLM when the anchor sentence lacks sufficient context, which is usually the case with temporary documentation chunks.

Finally, we have chosen to harness the ability of the meta-llama/Meta-Llama-3-70B-Instruct model, via the InferenceEndpointsLLM component. This selection enables us to tap into the model’s capabilities, generating high-quality synthetic queries that may ultimately enhance our model’s performance.

multiply_queries

Using the GenerateSentencePair step, we obtained as many examples for training as chunks we had, 251 on this case. Nevertheless, we recognize that this may not be sufficient to fine-tune a custom model that may accurately capture the nuances of our specific use case.

To beat this limitation, we’ll employ one other LLM to generate additional queries. This can allow us to extend the scale of our training dataset, providing our model with a richer foundation for learning.

This brings us to the following step in our pipeline: MultipleQueries, a custom Task that we have crafted to further augment our dataset.

multiply_queries = MultipleQueries(
    name="multiply_queries",
    num_queries=3,
    system_prompt=(
        "You might be an AI assistant helping to generate diverse examples. Make sure the "
        "generated queries are all in separated lines and preceded by a splash. "
        "Don't generate the rest or introduce the duty."
    ),
    llm=llm,
    input_batch_size=10,
    input_mappings={"query": "positive"},
    output_mappings={"model_name": "model_name_query_multiplied"},
)

Now, let’s delve into the configuration of our custom Task, designed to amplify our training dataset. The linchpin of this task is the num_queries parameter, set to three on this instance. This implies we’ll generate three additional “positive” queries for every example, effectively quadrupling our dataset size, assuming some examples may not succeed.

To make sure the Large Language Model (LLM) stays on the right track, we have crafted a system_prompt that gives clear guidance on our instructions. Given the strength of the chosen model and the simplicity of our examples, we didn’t have to employ structured generation techniques. Nevertheless, this could possibly be a useful approach in additional complex scenarios.

Interested by the inner workings of our custom Task? Click the dropdown below to explore the total definition:

MultipleQueries definition

multiply_queries_template = (
    "Given the next query:n{original}nGenerate {num_queries} similar queries by various "
    "the tone and the phrases barely. "
    "Make sure the generated queries are coherent with the unique reference and relevant to the context of information annotation "
    "and AI dataset development."
)

class MultipleQueries(Task):
    system_prompt: Optional[str] = None
    num_queries: int = 1

    @property
    def inputs(self) -> List[str]:
        return ["query"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        prompt = [
            {
                "role": "user",
                "content": multiply_queries_template.format(
                    original=input["query"],
                    num_queries=self.num_queries
                ),
            },
        ]
        if self.system_prompt:
            prompt.insert(0, {"role": "system", "content": self.system_prompt})
        return prompt

    @property
    def outputs(self) -> List[str]:
        return ["queries", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        queries = output.split("- ")
        if len(queries) > self.num_queries:
            queries = queries[1:]
        queries = [q.strip() for q in queries]
        return {"queries": queries}

merge_columns

As we approach the ultimate stages of our pipeline, our focus shifts to data processing. Our ultimate goal is to create a refined dataset, comprising rows of triplets fitted to fine-tuning. Nevertheless, after generating multiple queries, our dataset now incorporates two distinct columns: positive and queries. The positive column holds the unique query as a single string, while the queries column stores a listing of strings, representing the extra queries generated for a similar entity.

To merge these two columns right into a single, cohesive list, we’ll employ the MergeColumns step. This can enable us to mix the unique query with the generated queries, making a unified:

merge_columns = MergeColumns(
    name="merge_columns",
    columns=["positive", "queries"],
    output_column="positive"
)

expand_columns

Lastly, we use ExpandColumns to maneuver the previous column of positive to different lines. In consequence, each positive query will occupy a separate line, while the anchor and negative columns might be replicated to match the expanded positive queries. This data manipulation will yield a dataset with the perfect structure for fine-tuning:

expand_columns = ExpandColumns(columns=["positive"])

Click the dropdown to see the total pipeline definition:

Distilabel Pipeline

from pathlib import Path
from typing import Any, Dict, List, Union, Optional

from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub
from distilabel.llms import InferenceEndpointsLLM
from distilabel.steps.tasks import GenerateSentencePair
from distilabel.steps.tasks.base import Task
from distilabel.steps.tasks.typing import ChatType
from distilabel.steps import ExpandColumns, CombineKeys


multiply_queries_template = (
    "Given the next query:n{original}nGenerate {num_queries} similar queries by various "
    "the tone and the phrases barely. "
    "Make sure the generated queries are coherent with the unique reference and relevant to the context of information annotation "
    "and AI dataset development."
)

class MultipleQueries(Task):
    system_prompt: Optional[str] = None
    num_queries: int = 1

    @property
    def inputs(self) -> List[str]:
        return ["query"]

    def format_input(self, input: Dict[str, Any]) -> ChatType:
        prompt = [
            {
                "role": "user",
                "content": multiply_queries_template.format(
                    original=input["query"],
                    num_queries=self.num_queries
                ),
            },
        ]
        if self.system_prompt:
            prompt.insert(0, {"role": "system", "content": self.system_prompt})
        return prompt

    @property
    def outputs(self) -> List[str]:
        return ["queries", "model_name"]

    def format_output(
        self, output: Union[str, None], input: Dict[str, Any]
    ) -> Dict[str, Any]:
        queries = output.split("- ")
        if len(queries) > self.num_queries:
            queries = queries[1:]
        queries = [q.strip() for q in queries]
        return {"queries": queries}


with Pipeline(
    name="embedding-queries",
    description="Generate queries to coach a sentence embedding model."
) as pipeline:
    load_data = LoadDataFromHub(
        name="load_data",
        repo_id="plaguss/argilla_sdk_docs_raw_unstructured",
        output_mappings={"chunks": "anchor"},
        batch_size=10,
    )

    llm = InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
    )

    generate_sentence_pair = GenerateSentencePair(
        name="generate_sentence_pair",
        triplet=True,  
        motion="query",
        context="The generated sentence must be related with Argilla, a knowledge annotation tool for AI engineers and domain experts.",
        llm=llm,
        input_batch_size=10,
        output_mappings={"model_name": "model_name_query"},
    )

    multiply_queries = MultipleQueries(
        name="multiply_queries",
        num_queries=3,
        system_prompt=(
            "You might be an AI assistant helping to generate diverse examples. Make sure the "
            "generated queries are all in separated lines and preceded by a splash. "
            "Don't generate the rest or introduce the duty."
        ),
        llm=llm,
        input_batch_size=10,
        input_mappings={"query": "positive"},
        output_mappings={"model_name": "model_name_query_multiplied"},
    )

    merge_columns = MergeColumns(
        name="merge_columns",
        columns=["positive", "queries"],
        output_column="positive"
    )

    expand_columns = ExpandColumns(
        columns=["positive"],
    )

    (
        load_data
        >> generate_sentence_pair
        >> multiply_queries
        >> merge_columns
        >> expand_columns
    )


if __name__ == "__main__":

    pipeline_parameters = {
        "generate_sentence_pair": {
            "llm": {
                "generation_kwargs": {
                    "temperature": 0.7,
                    "max_new_tokens": 512,
                }
            }
        },
        "multiply_queries": {
            "llm": {
                "generation_kwargs": {
                    "temperature": 0.7,
                    "max_new_tokens": 512,
                }
            }
        }
    }

    distiset = pipeline.run(
        parameters=pipeline_parameters
    )
    distiset.push_to_hub("plaguss/argilla_sdk_docs_queries")

Explore the datasets in Argilla

Now that we have generated our datasets, it is time to dive deeper and refine them as needed using Argilla. To start, take a have a look at our argilla_datasets.ipynb notebook, which provides a step-by-step guide on easy methods to upload your datasets to Argilla.

In the event you have not arrange an Argilla instance yet, don’t fret! Follow our easy-to-follow guide within the docs to create a Hugging Face Space with Argilla. Once you have got your Space up and running, simply connect with it by updating the api_url to point to your Space:

import argilla as rg

client = rg.Argilla(
    api_url="https://plaguss-argilla-sdk-chatbot.hf.space",
    api_key="YOUR_API_KEY"
)

An Argilla dataset with chunks of technical documentation

Along with your Argilla instance up and running, let’s move on to the following step: configuring the Settings on your dataset. The excellent news is that the default Settings we’ll create should work seamlessly on your specific use case, without having for further adjustments:

settings = rg.Settings(
    guidelines="Review the chunks of docs.",
    fields=[
        rg.TextField(
            name="filename",
            title="Filename where this chunk was extracted from",
            use_markdown=False,
        ),
        rg.TextField(
            name="chunk",
            title="Chunk from the documentation",
            use_markdown=False,
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="good_chunk",
            title="Does this chunk contain relevant information?",
            labels=["yes", "no"],
        )
    ],
)

Let’s take a better have a look at the dataset structure we have created. We’ll examine the filename and chunk fields, which contain the parsed filename and the generated chunks, respectively. To further enhance our dataset, we will define an easy label query, good_chunk, which allows us to manually label each chunk as useful or not. This human-in-the-loop approach enables us to refine our automated generation process. With these essential elements in place, we’re now able to create our dataset:

dataset = rg.Dataset(
    name="argilla_sdk_docs_raw_unstructured",
    settings=settings,
    client=client,
)
dataset.create()

Now, let’s retrieve the dataset we created earlier from the Hugging Face Hub. Recall the dataset we generated within the chunking data section? We’ll download that dataset and extract the essential columns we want to maneuver forward:

from datasets import load_dataset

data = (
    load_dataset("plaguss/argilla_sdk_docs_raw_unstructured", split="train")
    .select_columns(["filename", "chunks"])
    .to_list()
)

We have reached the ultimate milestone! To bring all the things together, let’s log the records to Argilla. This can allow us to visualise our dataset within the Argilla interface, providing a transparent and intuitive method to explore and interact with our data:

dataset.records.log(records=data, mapping={"filename": "filename", "chunks": "chunk"})

These are the type of examples you could possibly expect to see:

An Argilla dataset with triplets to fine-tune an embedding model

Now, we will repeat the method with the dataset ready for fine-tuning we generated within the previous section.
Fortunately, the method is easy: simply download the relevant dataset and upload it to Argilla with its designated name. For an in depth walkthrough, consult with the Jupyter notebook, which incorporates all of the obligatory instructions:

settings = rg.Settings(
    guidelines="Review the chunks of docs.",
    fields=[
        rg.TextField(
            name="anchor",
            title="Anchor (Chunk from the documentation).",
            use_markdown=False,
        ),
        rg.TextField(
            name="positive",
            title="Positive sentence that queries the anchor.",
            use_markdown=False,
        ),
        rg.TextField(
            name="negative",
            title="Negative sentence that may use similar words but has content unrelated to the anchor.",
            use_markdown=False,
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="is_positive_relevant",
            title="Is the positive query relevant?",
            labels=["yes", "no"],
        ),
        rg.LabelQuestion(
            name="is_negative_irrelevant",
            title="Is the negative query irrelevant?",
            labels=["yes", "no"],
        )
    ],
)

Let’s take a better have a look at the structure of our dataset, which consists of three essential TextFields: anchor, positive, and negative. The anchor field represents the chunk of text itself, while the positive field incorporates a question that might be answered using the anchor text as a reference. In contrast, the negative field holds an unrelated query that serves as a negative example within the triplet. The positive and negative questions play a vital role in helping our model distinguish between these examples and learn effective embeddings.

An example might be seen in the next image:

The dataset settings we have established thus far have been focused on exploring our dataset, but we will take it a step further. By customizing these settings, we will discover and proper incorrect examples, refine the standard of generated questions, and iteratively improve our dataset to higher suit our needs.

An Argilla dataset to trace the chatbot conversations

Now, let’s create our final dataset, which might be dedicated to tracking user interactions with our chatbot. Note: You might wish to revisit this section after completing the Gradio app, as it is going to provide a more comprehensive understanding of the context. For now, let’s take a have a look at the Settings for this dataset:

settings_chatbot_interactions = rg.Settings(
    guidelines="Review the user interactions with the chatbot.",
    fields=[
        rg.TextField(
            name="instruction",
            title="User instruction",
            use_markdown=True,
        ),
        rg.TextField(
            name="response",
            title="Bot response",
            use_markdown=True,
        ),
    ],
    questions=[
        rg.LabelQuestion(
            name="is_response_correct",
            title="Is the response correct?",
            labels=["yes", "no"],
        ),
        rg.LabelQuestion(
            name="out_of_guardrails",
            title="Did the model answered something out of the peculiar?",
            description="If the model answered something unrelated to Argilla SDK",
            labels=["yes", "no"],
        ),
        rg.TextQuestion(
            name="feedback",
            title="Let any feedback here",
            description="This field ought to be used to report any feedback that might be useful",
            required=False
        ),
    ],
    metadata=[
        rg.TermsMetadataProperty(
            name="conv_id",
            title="Conversation ID",
        ),
        rg.IntegerMetadataProperty(
            name="turn",
            min=0,
            max=100,
            title="Conversation Turn",
        )
    ]
)

On this dataset, we’ll define two essential fields: instruction and response. The instruction field will store the initial query, and if the conversation is prolonged, it is going to contain all the conversation history as much as that time. The response field, however, will hold the chatbot’s most up-to-date response. To facilitate evaluation and feedback, we’ll include three questions: one to evaluate the correctness of the response, one other to find out if the model strayed off-topic, and an optional field for users to supply feedback on the response. Moreover, we’ll include two metadata properties to enable filtering and evaluation of the conversations: a novel conversation ID and the turn number throughout the conversation.

An example might be seen in the next image:

Once our chatbot has garnered significant user engagement, this dataset can function a useful resource to refine and enhance our model, allowing us to iterate and improve its performance based on real-world interactions.

Advantageous-Tune the embedding model

Now that our custom embedding model dataset is ready, it is time to dive into the training process.

To guide us through this step, we’ll be referencing the train_embedding.ipynb notebook, which pulls inspiration from Philipp Schmid’s blog post on fine-tuning embedding models for RAG. While the blog post provides a comprehensive overview of the method, we’ll deal with the important thing differences and nuances specific to our use case.

For a deeper understanding of the underlying decisions and an in depth walkthrough, be sure you take a look at the unique blog post and review the notebook for a step-by-step explanation.

Prepare the embedding dataset

We’ll begin by downloading the dataset and choosing the essential columns, which conveniently already align with the naming conventions expected by Sentence Transformers. Next, we’ll add a novel id column to every sample and split the dataset into training and testing sets, allocating 90% for training and 10% for testing. Finally, we’ll convert the formatted dataset right into a JSON file, able to be fed into the trainer for model fine-tuning:

from datasets import load_dataset


dataset = (
    load_dataset("plaguss/argilla_sdk_docs_queries", split="train")
    .select_columns(["anchor", "positive", "negative"])  
    .add_column("id", range(len(dataset)))               
    .train_test_split(test_size=0.1)                     
)
 

dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

Load the baseline model

With our dataset in place, we will now load the baseline model that may function the muse for our fine-tuning process. We’ll be using the identical model employed within the reference blog post, ensuring a consistent place to begin for our custom embedding model development:

from sentence_transformers import SentenceTransformerModelCardData, SentenceTransformer
 
model = SentenceTransformer(
    "BAAI/bge-base-en-v1.5",
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="BGE base ArgillaSDK Matryoshka",
    ),
)

Define the loss function

Given the structure of our dataset, we’ll leverage the TripletLoss function, which is best suited to handle our (anchor-positive-negative) triplets. Moreover, we’ll mix it with the MatryoshkaLoss, a strong loss function that has shown promising results (for a deeper dive into MatryoshkaLoss, take a look at this text):

from sentence_transformers.losses import MatryoshkaLoss, TripletLoss
 
inner_train_loss = TripletLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=[768, 512, 256, 128, 64]
)

Define the training strategy

Now that we’ve got our baseline model and loss function in place, it is time to define the training arguments that may guide the fine-tuning process. Since this work was done on an Apple M2 Pro, we want to make some adjustments to make sure a smooth training experience.

To accommodate the limited resources of our machine, we’ll reduce the per_device_train_batch_size and per_device_eval_batch_size in comparison with the unique blog post. Moreover, we’ll have to remove the tf32 and bf16 precision options, as they are not supported on this device. Moreover, we’ll swap out the adamw_torch_fused optimizer, which might be utilized in a Google Colab notebook for faster training. By making these modifications, we’ll have the option to fine-tune our model:

from sentence_transformers import SentenceTransformerTrainingArguments
  

args = SentenceTransformerTrainingArguments(
    output_dir="bge-base-argilla-sdk-matryoshka", 
    num_train_epochs=3,                           
    per_device_train_batch_size=8,                
    gradient_accumulation_steps=4,                
    per_device_eval_batch_size=4,                 
    warmup_ratio=0.1,                             
    learning_rate=2e-5,                           
    lr_scheduler_type="cosine",                   
    eval_strategy="epoch",                        
    save_strategy="epoch",                        
    logging_steps=5,                              
    save_total_limit=1,                           
    load_best_model_at_end=True,                  
    metric_for_best_model="eval_dim_512_cosine_ndcg@10",  
)

Train and save the ultimate model

from sentence_transformers import SentenceTransformerTrainer
 
trainer = SentenceTransformerTrainer(
    model=model,    
    args=args,      
    train_dataset=train_dataset.select_columns(
        ["anchor", "positive", "negative"]
    ),  
    loss=train_loss,
    evaluator=evaluator,
)


trainer.train()
 

trainer.save_model()
 

trainer.model.push_to_hub("bge-base-argilla-sdk-matryoshka")

And that is it! We are able to take a have a look at the brand new model: plaguss/bge-base-argilla-sdk-matryoshka. Take a better have a look at the dataset card, which is filled with useful insights and knowledge about our model.

But that is not all! In the following section, we’ll put our model to the test and see it in motion.

The vector database

We have made significant progress thus far, making a dataset and fine-tuning a model for our RAG chatbot. Now, it is time to construct the vector database that may empower our chatbot to store and retrieve relevant information efficiently.

In the case of selecting a vector database, there are many alternatives available. To maintain things easy and easy, we’ll be using lancedb, a light-weight, embedded database that does not require a server, much like SQLite. As we’ll see, lancedb allows us to create an easy file to store our embeddings, making it easy to maneuver around and retrieve data quickly, which is ideal for our use case.

To follow along, please consult with the accompanying notebook: vector_db.ipynb. On this notebook, we’ll delve into the small print of constructing and utilizing our vector database.

Connect with the database

After installing the dependencies, let’s instantiate the database:

import lancedb


db = lancedb.connect("./lancedb")

As we execute the code, a brand new folder should materialize in our current working directory, signaling the successful creation of our vector database.

Instantiate the fine-tuned model

Now that our vector database is about up, it is time to load our fine-tuned model. We’ll utilize the sentence-transformers registry to load the model, unlocking its capabilities and preparing it for motion:

import torch
from lancedb.embeddings import get_registry

model_name = "plaguss/bge-base-argilla-sdk-matryoshka"
device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"

model = get_registry().get("sentence-transformers").create(name=model_name, device=device)

Create the table with the documentation chunks

With our fine-tuned model loaded, we’re able to create the table that may store our embeddings. To define the schema for this table, we’ll employ a LanceModel, much like pydantic.BaseModel, to create a strong representation of our Docs entity.

from lancedb.pydantic import LanceModel, Vector

class Docs(LanceModel):
    query: str = model.SourceField()
    text: str = model.SourceField()
    vector: Vector(model.ndims()) = model.VectorField()

table_name = "docs"
table = db.create_table(table_name, schema=Docs)

The previous code snippet sets the stage for making a table with three essential columns:

query: dedicated to storing the synthetic query
text: housing the chunked documentation text
vector: related to the dimension from our fine-tuned model, able to store the embeddings

With this table structure in place, we will now interact with the table.

Populate the table

With our table structure established, we’re now able to populate it with data. Let’s load the ultimate dataset, which incorporates the queries, and ingest them into our database, accompanied by their corresponding embeddings. This important step will bring our vector database to life, enabling our chatbot to store and retrieve relevant information efficiently:

ds = load_dataset("plaguss/argilla_sdk_docs_queries", split="train")

batch_size = 50
for batch in tqdm.tqdm(ds.iter(batch_size), total=len(ds) // batch_size):
    embeddings = model.generate_embeddings(batch["positive"])
    df = pd.DataFrame.from_dict({"query": batch["positive"], "text": batch["anchor"], "vector": embeddings})
    table.add(df)

Within the previous code snippet, we iterated over the dataset in batches, generating embeddings for the synthetic queries within the positive column using our fine-tuned model. We then created a Pandas dataframe, to incorporate the query, text, and vector columns. This dataframe combines the positive and anchor columns with the freshly generated embeddings, respectively.

Now, let’s put our vector database to the test! For a sample query, “How can I get the present user?” (using the Argilla SDK), we’ll generate the embedding using our custom embedding model. We’ll then seek for the highest 3 most similar occurrences in our table, leveraging the cosine metric to measure similarity. Finally, we’ll extract the relevant text column, which corresponds to the chunk of documentation that best matches our query:

query = "How can I get the present user?"
embedded_query = model.generate_embeddings([query])

retrieved = (
    table
        .search(embedded_query[0])
        .metric("cosine")
        .limit(3)
        .select(["text"])  
        .to_list()
)

Click to see the result

This might be the result:

>>> retrieved
[{'text': 'pythonnuser = client.users("my_username")nnThe current user of the rg.Argilla client can be accessed using the me attribute:nnpythonnclient.mennClass Referencennrg.Usernn::: argilla_sdk.users.Usern    options:n        heading_level: 3',
  '_distance': 0.1881886124610901},
 {'text': 'pythonnuser = client.users("my_username")nnThe current user of the rg.Argilla client can be accessed using the me attribute:nnpythonnclient.mennClass Referencennrg.Usernn::: argilla_sdk.users.Usern    options:n        heading_level: 3',
  '_distance': 0.20238929986953735},
 {'text': 'Retrieve a usernnYou can retrieve an existing user from Argilla by accessing the users attribute on the Argilla class and passing the username as an argument.nn```pythonnimport argilla_sdk as rgnnclient = rg.Argilla(api_url="", api_key="")nnretrieved_user = client.users("my_username")n```',
  '_distance': 0.20401990413665771}]

>>> print(retrieved[0]["text"])
python
user = client.users("my_username")

The present user of the rg.Argilla client might be accessed using the me attribute:

python
client.me

Class Reference

rg.User

::: argilla_sdk.users.User
    options:
        heading_level: 3

Let’s dive into the primary row of our dataset and see what insights we will uncover. At first glance, it appears to contain information related to the query, which is precisely what we would expect. To get the present user, we will utilize the client.me method. Nevertheless, we also notice some extraneous content, which is probably going a results of the chunking strategy employed. This strategy, while effective, may benefit from some refinement. By reviewing the dataset in Argilla, we will gain a deeper understanding of easy methods to optimize our chunking approach, ultimately resulting in a more streamlined dataset. For now, though, it looks like a solid place to begin to construct upon.

Store the database within the Hugging Face Hub

Now that we’ve got a database, we’ll store it as one other artifact in our dataset repository. You’ll be able to visit the repo to seek out the functions that will help us, however it’s so simple as running the next function:

import Path
import os

local_dir = Path.home() / ".cache/argilla_sdk_docs_db"

upload_database(
    local_dir / "lancedb",
    repo_id="plaguss/argilla_sdk_docs_queries",
    token=os.getenv("HF_API_TOKEN")
)

The ultimate step in our database storage journey is only a command away! By running the function, we’ll create a brand latest file called lancedb.tar.gz, which can neatly package our vector database. You’ll be able to take a sneak peek on the resulting file within the plaguss/argilla_sdk_docs_queries repository on the Hugging Face Hub, where it’s stored alongside other essential files.

db_path = download_database(repo_id)

The moment of truth has arrived! With our database successfully downloaded, we will now confirm that all the things is so as. By default, the file might be stored at Path.home() / ".cache/argilla_sdk_docs_db", but might be easily customized. We are able to connect again to it and check all the things works as expected:

db = lancedb.connect(str(db_path))
table = db.open_table(table_name)

query = "how can I delete users?"

retrieved = (
    table
        .search(query)
        .metric("cosine")
        .limit(1)
        .to_pydantic(Docs)
)

for d in retrieved:
    print("======nQUERYn======")
    print(d.query)
    print("======nDOCn======")
    print(d.text)

The database for the retrieval of documents is completed, so let’s go for the app!

Creating our ChatBot

All of the pieces are ready for our chatbot; we want to attach them and make them available in an interface.

The Gradio App

Let’s bring the RAG app to life! Using gradio, we will effortlessly create chatbot apps. On this case, we’ll design an easy yet effective interface to showcase our chatbot’s capabilities. To see the app in motion, take a have a look at the app.py script within the Argilla SDK Chatbot repository on GitHub.

Before we dive into the small print of constructing our chatbot app, let’s take a step back and admire the end result. With just a number of lines of code, we have managed to create a user-friendly interface that brings our RAG chatbot to life.

import gradio as gr

gr.ChatInterface(
    chatty,
    chatbot=gr.Chatbot(height=600),
    textbox=gr.Textbox(placeholder="Ask me in regards to the latest argilla SDK", container=False, scale=7),
    title="Argilla SDK Chatbot",
    description="Ask an issue about Argilla SDK",
    theme="soft",
    examples=[
        "How can I connect to an argilla server?",
        "How can I access a dataset?",
        "How can I get the current user?"
    ],
    cache_examples=True,
    retry_btn=None,
).launch()

And there you could have it! In the event you’re wanting to learn more about creating your personal chatbot, be sure you take a look at Gradio’s excellent guide on Chatbot with Gradio. It is a treasure trove of data that may have you constructing your personal chatbot very quickly.

Now, let’s delve deeper into the inner workings of our app.py script. We’ll break down the important thing components, specializing in the essential elements that bring our chatbot to life. To maintain things concise, we’ll gloss over a number of the finer details.

First up, let’s examine the Database class, the backbone of our chatbot’s knowledge and functionality. This component plays a significant role in storing and retrieving the info that fuels our chatbot’s conversations:

Click to see Database class

class Database:

    def __init__(self, settings: Settings) -> None:

        self.settings = settings
        self._table: lancedb.table.LanceTable = self.get_table_from_db()

    def get_table_from_db(self) -> lancedb.table.LanceTable:

        lancedb_db_path = self.settings.LOCAL_DIR / self.settings.LANCEDB

        if not lancedb_db_path.exists():
            lancedb_db_path = download_database(
                self.settings.REPO_ID,
                lancedb_file=self.settings.LANCEDB_FILE_TAR,
                local_dir=self.settings.LOCAL_DIR,
                token=self.settings.TOKEN,
            )

        db = lancedb.connect(str(lancedb_db_path))
        table = db.open_table(self.settings.TABLE_NAME)
        return table

    def retrieve_doc_chunks(
        self, query: str, limit: int = 12, hard_limit: int = 4
    ) -> str:

        
        embedded_query = model.generate_embeddings([query])
        field_to_retrieve = "text"
        retrieved = (
            self._table.search(embedded_query[0])
            .metric("cosine")
            .limit(limit)
            .select([field_to_retrieve])  
            .to_list()
        )
        return self._prepare_context(retrieved, hard_limit)

    @staticmethod
    def _prepare_context(retrieved: list[dict[str, str]], hard_limit: int) -> str:

        
        
        responses = []
        unique_responses = set()

        for item in retrieved:
            chunk = item["text"]
            if chunk not in unique_responses:
                unique_responses.add(chunk)
                responses.append(chunk)

        context = ""
        for i, item in enumerate(responses[:hard_limit]):
            if i > 0:
                context += "nn"
            context += f"---n{item}"
        return context

With our Database class in place, we have successfully bridged the gap between our chatbot’s conversational flow and the knowledge stored in our database. Now, let’s bring all the things together! Once we have downloaded our embedding model (the script will do it robotically), we will instantiate the Database class, effectively deploying our database to the specified location – on this case, our Hugging Face Space.

This marks a serious milestone in our chatbot development journey. With our database integrated and prepared for motion, we’re only a step away from unleashing our chatbot’s full potential.

database = Database(settings=settings)  

context = database.retrieve_doc_chunks("How can I delete a user?", limit=2, hard_limit=1)

>>> print(context)

Click to see Settings class

@dataclass
class Settings:
    LANCEDB: str = "lancedb"
    LANCEDB_FILE_TAR: str = "lancedb.tar.gz"
    TOKEN: str = os.getenv("HF_API_TOKEN")
    LOCAL_DIR: Path = Path.home() / ".cache/argilla_sdk_docs_db"
    REPO_ID: str = "plaguss/argilla_sdk_docs_queries"
    TABLE_NAME: str = "docs"
    MODEL_NAME: str = "plaguss/bge-base-argilla-sdk-matryoshka"
    DEVICE: str = (
        "mps"
        if torch.backends.mps.is_available()
        else "cuda"
        if torch.cuda.is_available()
        else "cpu"
    )
    MODEL_ID: str = "meta-llama/Meta-Llama-3-70B-Instruct"

The ultimate piece of the puzzle is now in place – our database is able to fuel our chatbot’s conversations. Next, we want to arrange our model to handle the influx of user queries. That is where the ability of inference endpoints comes into play. These dedicated endpoints provide a seamless method to deploy and manage our model, ensuring it is usually ready to reply to user input.

Fortunately, working with inference endpoints is a breeze, because of the inference client from the huggingface_hub library:

def get_client_and_tokenizer(
    model_id: str = settings.MODEL_ID, tokenizer_id: Optional[str] = None
) -> tuple[InferenceClient, AutoTokenizer]:
    if tokenizer_id is None:
        tokenizer_id = model_id

    client = InferenceClient()
    base_url = client._resolve_url(model=model_id, task="text-generation")
    
    client = InferenceClient(model=base_url, token=os.getenv("HF_API_TOKEN"))

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
    return client, tokenizer


client, tokenizer = get_client_and_tokenizer()

With our components in place, we have reached the stage of preparing the prompt that might be fed into our client. This prompt will serve because the input that sparks the magic of our machine learning model, guiding it to generate a response that is each accurate and informative, while avoiding answering unrelated questions. On this section, we’ll delve into the small print of crafting a well-structured prompt that sets our model up for achievement. The prepare_input function will prepare the conversation, applying the prompt and the chat template to be passed to the model:

def prepare_input(message: str, history: list[tuple[str, str]]) -> str:

    
    context = database.retrieve_doc_chunks(message)

    
    conversation = []
    for human, bot in history:
        conversation.append({"role": "user", "content": human})
        conversation.append({"role": "assistant", "content": bot})

    conversation.insert(0, {"role": "system", "content": SYSTEM_PROMPT})
    conversation.append(
        {
            "role": "user",
            "content": ARGILLA_BOT_TEMPLATE.format(message=message, context=context),
        }
    )

    return tokenizer.apply_chat_template(
        [conversation],
        tokenize=False,
        add_generation_prompt=True,
    )[0]

This function will take two arguments: message and history courtesy of the gradio ChatInterface, obtain the documentation pieces from the database to assist the LLM with the response, and prepare the prompt to be passed to our LLM model.

Click to see the system prompt and the bot template

These are the system_prompt and the prompt template used. They’re heavily inspired by wandbot from Weights and Biases.

SYSTEM_PROMPT = """
You might be a support expert in Argilla SDK, whose goal is help users with their questions.
As a trustworthy expert, you will need to provide truthful answers to questions using only the provided documentation snippets, not prior knowledge.
Listed here are guidelines you will need to follow when responding to user questions:

##Purpose and Functionality**
- Answer questions related to the Argilla SDK.
- Provide clear and concise explanations, relevant code snippets, and guidance depending on the user's query and intent.
- Ensure users achieve effectively understanding and using Argilla's features.
- Provide accurate responses to the user's questions.

**Specificity**
- Be specific and supply details only when required.
- Where obligatory, ask clarifying questions to higher understand the user's query.
- Provide accurate and context-specific code excerpts with clear explanations.
- Make sure the code snippets are syntactically correct, functional, and run without errors.
- For code troubleshooting-related questions, deal with the code snippet and clearly explain the difficulty and easy methods to resolve it. 
- Avoid boilerplate code equivalent to imports, installs, etc.

**Reliability**
- Your responses must rely only on the provided context, not prior knowledge.
- If the provided context doesn't help answer the query, just say you do not know.
- When providing code snippets, make sure the functions, classes, or methods are derived only from the context and never prior knowledge.
- Where the provided context is insufficient to reply faithfully, admit uncertainty.
- Remind the user of your specialization in Argilla SDK support when an issue is outside your domain of experience.
- Redirect the user to the suitable support channels - Argilla [community](https://join.slack.com/t/rubrixworkspace/shared_invite/zt-whigkyjn-a3IUJLD7gDbTZ0rKlvcJ5g) when the query is outside your capabilities otherwise you don't have enough context to reply the query.

**Response Style**
- Use clear, concise, skilled language suitable for technical support
- Don't consult with the context within the response (e.g., "As mentioned within the context...") as an alternative, provide the knowledge directly within the response.

**Example**:

The proper answer to the user's query

 Steps to unravel the issue:
 - **Step 1**: ...
 - **Step 2**: ...
 ...

 Here's a code snippet

 ```python
 # Code example
 ...
 ```
 
 **Explanation**:

 - Point 1
 - Point 2
 ...
"""

ARGILLA_BOT_TEMPLATE = """
Please provide a solution to the next query related to Argilla's latest SDK.

You'll be able to make use of the chunks of documents within the context to allow you to generating the response.

## Query:
{message}

## Context:
{context}
"""

We have reached the culmination of our conversational AI system: the chatty function. This function serves because the orchestrator, bringing together the assorted components we have built thus far. Its primary responsibility is to invoke the prepare_input function, which crafts the prompt that might be passed to the client. Then, we yield the stream of text because it’s being generated, and once the response is finished, the conversation history might be saved, providing us with a useful resource to review and refine our model, ensuring it continues to enhance with each iteration.

def chatty(message: str, history: list[tuple[str, str]]) -> Generator[str, None, None]:
    prompt = prepare_input(message, history)

    partial_response = ""

    for token_stream in client.text_generation(prompt=prompt, **client_kwargs):
        partial_response += token_stream
        yield partial_response

    global conv_id
    new_conversation = len(history) == 0
    if new_conversation:
        conv_id = str(uuid.uuid4())
    else:
        history.append((message, None))

    
    argilla_dataset.records.log(
        [
            {
                "instruction": create_chat_html(history) if history else message,
                "response": partial_response,
                "conv_id": conv_id,
                "turn": len(history)
            },
        ]
    )

The moment of truth has arrived! Our app is now able to be put to the test. To see it in motion, simply run python app.py in your local environment. But before you do, ensure you could have access to a deployed model at an inference endpoint. In this instance, we’re using the powerful Llama 3 70B model, but be at liberty to experiment with other models that fit your needs. By tweaking the model and fine-tuning the app, you’ll be able to unlock its full potential and explore latest possibilities in AI development.

Deploy the ChatBot app on Hugging Face Spaces

Now that our app is up and running, it is time to share it with the world! To deploy our app and make it accessible to others, we’ll follow the steps outlined in Gradio’s guide to sharing your app. Our chosen platform for hosting is Hugging Face Spaces, a improbable tool for showcasing AI-powered projects.

To start, we’ll have to add a requirements.txt file to our repository, which lists the dependencies required to run our app. This is a vital step in ensuring that our app might be easily reproduced and deployed. You’ll be able to learn more about managing dependencies in Hugging Face Spaces spaces dependencies.

Next, we’ll have to add our Hugging Face API token as a secret, following the instructions in this guide. This can allow our app to authenticate with the Hugging Face ecosystem.

Once we have uploaded our app.py file, our Space might be built, and we’ll have the option to access our app at the next link:

https://huggingface.co/spaces/plaguss/argilla-sdk-chatbot-space

Take a have a look at our example Space files here to see the way it all comes together. By following these steps, you may have the option to share your personal AI-powered app with the world and collaborate with others within the Hugging Face community.

Fooling around with our ChatBot

We are able to now put the Chatbot to the test. We have provided some default queries to get you began, but be at liberty to experiment together with your own questions. For example, you could possibly ask: What are the Settings in the brand new SDK?

As you’ll be able to see from the screenshot below, our chatbot is prepared to supply helpful responses to your queries:

But that is not all! You may also challenge our chatbot to generate settings for a particular dataset, just like the one we created earlier on this tutorial. For instance, you could possibly ask it to suggest settings for a dataset designed to fine-tune an embedding model, much like the one we explored within the An Argilla dataset with triplets to fine-tune an embedding model section.

Take a have a look at the screenshot below to see how our chatbot responds to the sort of query.

Go ahead, ask your questions, and see what insights our chatbot can provide!

Next steps

On this tutorial, we have successfully built a chatbot that may provide helpful responses to questions on the Argilla SDK and its applications. By leveraging the ability of Llama 3 70B and Gradio, we have created a user-friendly interface that may assist developers in understanding easy methods to work with datasets and fine-tune embedding models.

Nevertheless, our chatbot is just the start line, and there are lots of ways we will improve and expand its capabilities. Listed here are some possible next steps to tackle:

Improve the chunking strategy: Experiment with different chunking strategies, parameters, and sizes to optimize the chatbot’s performance and response quality.
Implement deduplication and filtering: Add deduplication and filtering mechanisms to the training dataset to remove duplicates and irrelevant information, ensuring that the chatbot provides accurate and concise responses.
Include sources for responses: Enhance the chatbot’s responses by including links to relevant documentation and sources, allowing users to dive deeper into the topics and explore further.

By addressing these areas, we will take our chatbot to the following level, making it an excellent more useful resource for developers working with the Argilla SDK. The chances are countless, and we’re excited to see where this project will go from here. Stay tuned for future updates and enhancements!

Source link

How we leveraged distilabel to create an Argilla 2.0 Chatbot

Table of Contents

Generating Synthetic Data for Advantageous-Tuning Custom Embedding Models

Downloading and chunking data

Generating synthetic data for our embedding model using distilabel

Explore the datasets in Argilla

An Argilla dataset with chunks of technical documentation

An Argilla dataset with triplets to fine-tune an embedding model

An Argilla dataset to trace the chatbot conversations

Advantageous-Tune the embedding model

Prepare the embedding dataset

Load the baseline model

Define the loss function

Define the training strategy

Train and save the ultimate model

The vector database

Connect with the database

Instantiate the fine-tuned model

Create the table with the documentation chunks

Populate the table

Store the database within the Hugging Face Hub

Creating our ChatBot

The Gradio App

Deploy the ChatBot app on Hugging Face Spaces

Fooling around with our ChatBot

Next steps

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Exciting Changes Are Coming to the TDS Creator Payment Program

I checked out considered one of the largest anti-AI protests ever

OpenAI steps into Anthropic’s Pentagon void

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

How we leveraged distilabel to create an Argilla 2.0 Chatbot

Table of Contents

Generating Synthetic Data for Advantageous-Tuning Custom Embedding Models

Downloading and chunking data

Generating synthetic data for our embedding model using distilabel

Explore the datasets in Argilla

An Argilla dataset with chunks of technical documentation

An Argilla dataset with triplets to fine-tune an embedding model

An Argilla dataset to trace the chatbot conversations

Advantageous-Tune the embedding model

Prepare the embedding dataset

Load the baseline model

Define the loss function

Define the training strategy

Train and save the ultimate model

The vector database

Connect with the database

Instantiate the fine-tuned model

Create the table with the documentation chunks

Populate the table

Store the database within the Hugging Face Hub

Creating our ChatBot

The Gradio App

Deploy the ChatBot app on Hugging Face Spaces

Fooling around with our ChatBot

Next steps

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.