Introducing the Synthetic Data Generator

Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). One of the best part: An easy step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and with none code.

A brief demo video

What’s synthetic data and why is it useful?

Synthetic data is artificially generated information that mimics real-world data. It allows overcoming data limitations by expanding or enhancing datasets.

From Prompt to dataset to model

The synthetic data generator takes an outline of the information you would like (your custom prompt) and returns a dataset on your use case, using an artificial data pipeline. Within the background, that is powered by distilabel and the free Hugging Face text-generation API but we don’t have to worry about these complexities and we are able to deal with using the UI.

Supported Tasks

The tool currently supports text classification and chat datasets. These tasks will determine the kind of dataset you’ll generate, classification requires categories, while chat data requires a conversation. Based on demand, we are going to add tasks like evaluation and RAG over time.

Text Classification

Text classification is common for categorizing text like customer reviews, social media posts, or news articles. Generating a classification dataset relies on two different steps that we address with LLMs. We first generate diverse texts, after which we add labels to them. example of an artificial text classification dataset is argilla/synthetic-text-classification-news, which classifies synthetic news articles into 8 different classes.

Chat datasets

The sort of dataset will be used for supervised fine-tuning (SFT), which is the technique that enables LLMs to work with conversational data, allowing the user to interact with LLMs via a chat interface. example of an artificial chat dataset is argilla/synthetic-sft-customer-support-single-turn, which highlights an example of an LLM designed to handle customer support. In this instance, the client support topic is the synthetic data generator itself.

Generally, we are able to generate 50 and 20 samples per minute for text classification and chat, respectively. All of that is powered by the free Hugging Face API, but you possibly can scale this up by utilizing your individual account and selecting custom models, api providers or generation configurations. We are going to get back to this later but let’s dive into the fundamentals first.

Let’s generate our first dataset

We are going to create a basic chat dataset. If you visit the generator, you’ve gotten to login to permit the tool access to the organisations for which you must generate datasets. It will allow the tool to upload the generated datasets. In case of a failed authentication, you possibly can all the time reset the connection.

After the login, the UI guides you thru an easy three-step process:

1. Describe Your Dataset

Start by providing an outline of the dataset you must create, including example use cases to assist the generator understand your needs. Be sure to explain the goal and kind of assistant in as much detail as possible. If you hit the “Create” button, a sample dataset might be created, and you possibly can proceed with step 2.

2. Configure and Refine

Refine your generated sample dataset by adjusting the system prompt, which has been generated based in your description and by adjusting the task-specific settings. It will make it easier to get to the particular results you are after. You may iterate on these configurations by hitting the “Save” button and regenerating your sample dataset. When you find yourself satisfied with the config, proceed to step 3.

3. Generate and Push

Fill out general information concerning the dataset name and organisation. Moreover, you possibly can define the variety of samples to generate and the temperature to make use of for the generation. This temperature represents the creativity of the generations. Let’s hit the “Generate” button to begin a full generation. The output might be saved on to Argilla and the Hugging Face Hub.

We are able to now click the “Open in Argilla” button and directly dive into our generated dataset.

Reviewing the Dataset

Even when coping with synthetic data, it will be significant to grasp and have a look at your data, which is why we created a direct integration with Argilla, a collaboration tool for AI engineers and domain experts to construct high-quality datasets. This lets you effectively explore and evaluate the synthetic dataset through powerful features like semantic search and composable filters. You may learn more about them in this guide. Afterwards, we are able to export the curated dataset to the Hugging Face Hub, and proceed to fine-tune a model with it.

Training a Model

Don’t worry; even creating powerful AI models will be refrained from code nowadays using AutoTrain. To know AutoTrain, you possibly can have a look at its documentation. Here, we are going to create our own AutoTrain deployment and log in as we’ve done before for the synthetic data generator.

Remember the argilla/synthetic-text-classification-news dataset from the start? Let’s train a model that may accurately classify these examples. We want to pick out the duty “Text Classification” and supply the right “Dataset source”. Then, select a pleasant project name and press play! The pop-up that warns about costs will be ignored because we’re still working on the free Hugging Face CPU hardware, which is good enough hardware for this text classification example.

Et voilà, after a few minutes, we’ve got our very own model! All that is still is to deploy it as a live service or to use it as a text-classification pipeline with some minimal Python code.

Advanced Features

Although you possibly can go from prompts to dedicated models without knowing anything about coding, some people might just like the choice to customize and scale their deployment with some more advanced technical features.

Improving Speed and Accuracy

You may improve speed and accuracy by creating the own deployment of the tool and configuring it to make use of different parameters or models. First, you need to duplicate the synthetic data generator. Be sure you create is as a personal Space to make sure no one else can access it. Next, you possibly can change the default values of some environment variables. Let’s go over some scenarios:

Use a distinct free Hugging Face model. You may accomplish that by changing the MODEL from the default value of meta-llama/Llama-3.1-8B-Instruct to a distinct model, like meta-llama/Llama-3.1-70B-Instruct.
Use an OpenAI model. You may accomplish that by setting the BASE_URL to https://api.openai.com/v1/ and MODEL to gpt-4o.
Increase the batch size, which is able to generate more samples per minute. You may accomplish that by changing the BATCH_SIZE from the default value of 5 to a better value, like 10. Keep in mind that your API providers might need limits on the variety of requests per minute.
Private Argilla instance. You may accomplish that by setting the ARGILLA_URL and ARGILLA_API_KEY to the URL and API key of your free Argilla instance.

Local Deployment

Besides hosting the tool on Hugging Face Spaces, we also offer it as an open-source tool under an Apache 2 license, which suggests you possibly can go to GitHub and use, modify, and adapt it nonetheless you would like. You may install it as a Python package through an easy pip install synthetic-dataset-generator. Be sure to configure the suitable environment variables when creatin

Customising Pipelines

Each synthetic data pipeline is predicated on distilabel, the framework for synthetic data and AI feedback. distilabel is open source; the cool thing concerning the pipeline code is that it’s sharable and reproducible. You may, for instance, find the pipeline for the argilla/synthetic-text-classification-news dataset inside the repository on the Hub. Alternatively, you will discover many other distilabel datasets together with their pipelines.

What’s Next?

The Synthetic Data Generator already offers many cool features that make it useful for any data or model lover. Still, now we have some interesting directions for improvements on our GitHub, and we invite you to contribute, leave a star, and open issues too! Some things we’re working on are:

Retrieval Augmented Generation (RAG)
Custom evals with LLMs as a Judge

Start synthesizing

Source link

Introducing the Synthetic Data Generator

From Prompt to dataset to model

Supported Tasks

Text Classification

Chat datasets

Let’s generate our first dataset

1. Describe Your Dataset

2. Configure and Refine

3. Generate and Push

Reviewing the Dataset

Training a Model

Advanced Features

Improving Speed and Accuracy

Local Deployment

Customising Pipelines

What’s Next?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Benchmarking Language Model Performance on fifth Gen Xeon at GCP

Hiring specialists made sense before AI — now generalists win

Welcome to the Falcon 3 Family of Open Models!

Bamba: Inference-Efficient Hybrid Mamba2 Model

EDA in Public (Part 2): Product Deep Dive & Time-Series Evaluation in Pandas

Introducing the Synthetic Data Generator

From Prompt to dataset to model

Supported Tasks

Text Classification

Chat datasets

Let’s generate our first dataset

1. Describe Your Dataset

2. Configure and Refine

3. Generate and Push

Reviewing the Dataset

Training a Model

Advanced Features

Improving Speed and Accuracy

Local Deployment

Customising Pipelines

What’s Next?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.