What’s synthetic data?
Data created by a pc intended to duplicate or augment existing data.
Why is it useful?
We’ve all experienced the success of ChatGPT, Llama, and more recently, DeepSeek. These language models are getting used ubiquitously across society and have triggered many claims that we’re rapidly approaching Artificial General Intelligence — AI able to replicating any human function.
Before getting too excited, or scared, depending in your perspective — we’re also rapidly approaching a hurdle to the advancement of those language models. In accordance with a paper published by a bunch from the research institute, Epoch [1],. They estimate that by 2028 we may have reached the upper limit of possible data upon which to coach language models.
What happens if we run out of information?
Well, if we run out of information then we aren’t going to have anything latest with which to coach our language models. These models will then stop improving. If we would like to pursue Artificial General Intelligence then we’re going to must provide you with latest ways of improving AI without just increasing the quantity of real-world training data.
One potential saviour is synthetic data which could be generated to mimic existing data and has already been used to enhance the performance of models like Gemini and DBRX.
Synthetic data beyond LLMs
Beyond overcoming data scarcity for giant language models, synthetic data could be utilized in the next situations:
- Sensitive Data — if we don’t wish to share or use sensitive attributes, synthetic data could be generated which mimics the properties of those features while maintaining anonymity.
- Expensive data — if collecting data is pricey we are able to generate a big volume of synthetic data from a small amount of real-world data.
- Lack of information — datasets are biased when there’s a disproportionately low variety of individual data points from a specific group. Synthetic data could be used to balance a dataset.
Imbalanced datasets
Imbalanced datasets can (*but not at all times*) be problematic as they might not contain enough information to effectively train a predictive model. For instance, if a dataset comprises many more men than women, our model could also be biased towards recognising men and misclassify future female samples as men.
In this text we show the imbalance in the favored UCI Adult dataset [2], and the way we are able to use a variational auto-encoder to generate Synthetic Data to enhance classification on this instance.
We first download the Adult dataset. This dataset comprises features reminiscent of age, education and occupation which could be used to predict the goal final result ‘income’.
# Download dataset right into a dataframe
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
"age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
"occupation", "relationship", "race", "sex", "capital-gain",
"capital-loss", "hours-per-week", "native-country", "income"
]
data = pd.read_csv(url, header=None, names=columns, na_values=" ?", skipinitialspace=True)
# Drop rows with missing values
data = data.dropna()
# Split into features and goal
X = data.drop(columns=["income"])
y = data['income'].map({'>50K': 1, '<=50K': 0}).values
# Plot distribution of income
plt.figure(figsize=(8, 6))
plt.hist(data['income'], bins=2, edgecolor="black")
plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()
Within the Adult dataset, income is a binary variable, representing individuals who earn above, and below, $50,000. We plot the distribution of income over the whole dataset below. We will see that the dataset is heavily imbalanced with a far larger number of people who earn lower than $50,000.

Despite this imbalance we are able to still train a machine learning classifier on the Adult dataset which we are able to use to find out whether unseen, or test, individuals needs to be classified as earning above, or below, 50k.
# Preprocessing: One-hot encode categorical features, scale numerical features
numerical_features = ["age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"]
categorical_features = [
"workclass", "education", "marital-status", "occupation", "relationship",
"race", "sex", "native-country"
]
preprocessor = ColumnTransformer(
transformers=[
("num", StandardScaler(), numerical_features),
("cat", OneHotEncoder(), categorical_features)
]
)
X_processed = preprocessor.fit_transform(X)
# Convert to numpy array for PyTorch compatibility
X_processed = X_processed.toarray().astype(np.float32)
y_processed = y.astype(np.float32)
# Split dataset in train and test sets
X_model_train, X_model_test, y_model_train, y_model_test = train_test_split(X_processed, y_processed, test_size=0.2, random_state=42)
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_model_train, y_model_train)
# Make predictions
y_pred = rf_classifier.predict(X_model_test)
# Display confusion matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
Printing out the confusion matrix of our classifier shows that our model performs fairly well despite the imbalance. Our model has an overall error rate of 16% however the error rate for the positive class (income > 50k) is 36% where the error rate for the negative class (income < 50k) is 8%.
This discrepancy shows that the model is indeed biased towards the negative class. The model is continuously incorrectly classifying individuals who earn greater than 50k as earning lower than 50k.
Below we show how we are able to use a Variational Autoencoder to generate synthetic data of the positive class to balance this dataset. We then train the identical model using the synthetically balanced dataset and reduce model errors on the test set.

How can we generate synthetic data?
There are a number of different methods for generating synthetic data. These can include more traditional methods reminiscent of SMOTE and Gaussian Noise which generate latest data by modifying existing data. Alternatively Generative models reminiscent of Variational Autoencoders or General Adversarial networks are predisposed to generate latest data as their architectures learn the distribution of real data and use these to generate synthetic samples.
On this tutorial we use a variational autoencoder to generate synthetic data.
Variational Autoencoders
Variational Autoencoders (VAEs) are great for synthetic data generation because they use real data to learn a continuous latent space. We will view this latent space as a magic bucket from which we are able to sample synthetic data which closely resembles existing data. The continuity of this space is considered one of their big selling points because it means the model generalises well and doesn’t just memorise the latent space of specific inputs.
A VAE consists of an encoder, which maps input data right into a probability distribution (mean and variance) and a decoder, which reconstructs the info from the latent space.
For that continuous latent space, VAEs use a reparameterization trick, where a random noise vector is scaled and shifted using the learned mean and variance, ensuring smooth and continuous representations within the latent space.
Below we construct a BasicVAE class which implements this process with an easy architecture.
- The encoder compresses the input right into a smaller, hidden representation, producing each a mean and log variance that outline a Gaussian distribution aka creating our magic sampling bucket. As a substitute of directly sampling, the model applies the reparameterization trick to generate latent variables, that are then passed to the decoder.
- The decoder reconstructs the unique data from these latent variables, ensuring the generated data maintains characteristics of the unique dataset.
class BasicVAE(nn.Module):
def __init__(self, input_dim, latent_dim):
super(BasicVAE, self).__init__()
# Encoder: Single small layer
self.encoder = nn.Sequential(
nn.Linear(input_dim, 8),
nn.ReLU()
)
self.fc_mu = nn.Linear(8, latent_dim)
self.fc_logvar = nn.Linear(8, latent_dim)
# Decoder: Single small layer
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 8),
nn.ReLU(),
nn.Linear(8, input_dim),
nn.Sigmoid() # Outputs values in range [0, 1]
)
def encode(self, x):
h = self.encoder(x)
mu = self.fc_mu(h)
logvar = self.fc_logvar(h)
return mu, logvar
def reparameterize(self, mu, logvar):
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
return mu + eps * std
def decode(self, z):
return self.decoder(z)
def forward(self, x):
mu, logvar = self.encode(x)
z = self.reparameterize(mu, logvar)
return self.decode(z), mu, logvar
Given our BasicVAE architecture we construct our loss functions and model training below.
def vae_loss(recon_x, x, mu, logvar, tau=0.5, c=1.0):
recon_loss = nn.MSELoss()(recon_x, x)
# KL Divergence Loss
kld_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return recon_loss + kld_loss / x.size(0)
def train_vae(model, data_loader, epochs, learning_rate):
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
model.train()
losses = []
reconstruction_mse = []
for epoch in range(epochs):
total_loss = 0
total_mse = 0
for batch in data_loader:
batch_data = batch[0]
optimizer.zero_grad()
reconstructed, mu, logvar = model(batch_data)
loss = vae_loss(reconstructed, batch_data, mu, logvar)
loss.backward()
optimizer.step()
total_loss += loss.item()
# Compute batch-wise MSE for comparison
mse = nn.MSELoss()(reconstructed, batch_data).item()
total_mse += mse
losses.append(total_loss / len(data_loader))
reconstruction_mse.append(total_mse / len(data_loader))
print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}, MSE: {total_mse:.4f}")
return losses, reconstruction_mse
combined_data = np.concatenate([X_model_train.copy(), y_model_train.cop
y().reshape(26048,1)], axis=1)
# Train-test split
X_train, X_test = train_test_split(combined_data, test_size=0.2, random_state=42)
batch_size = 128
# Create DataLoaders
train_loader = DataLoader(TensorDataset(torch.tensor(X_train)), batch_size=batch_size, shuffle=True)
test_loader = DataLoader(TensorDataset(torch.tensor(X_test)), batch_size=batch_size, shuffle=False)
basic_vae = BasicVAE(input_dim=X_train.shape[1], latent_dim=8)
basic_losses, basic_mse = train_vae(
basic_vae, train_loader, epochs=50, learning_rate=0.001,
)
# Visualize results
plt.figure(figsize=(12, 6))
plt.plot(basic_mse, label="Basic VAE")
plt.ylabel("Reconstruction MSE")
plt.title("Training Reconstruction MSE")
plt.legend()
plt.show()
vae_loss consists of two components: reconstruction loss, which measures how well the generated data matches the unique input using Mean Squared Error (MSE), and KL divergence loss, which ensures that the learned latent space follows a traditional distribution.
train_vae optimises the VAE using the Adam optimizer over multiple epochs. During training, the model takes mini-batches of information, reconstructs them, and computes the loss using vae_loss. These errors are then corrected via backpropagation where the model weights are updated. We train the model for 50 epochs and plot how the reconstruction mean squared error decreases over training.
We will see that our model learns quickly the best way to reconstruct our data, evidencing efficient learning.

Now we've trained our BasicVAE to accurately reconstruct the Adult dataset we are able to now use it to generate synthetic data. We would like to generate more samples of the positive class (individuals who earn over 50k) with the intention to balance out the classes and take away the bias from our model.
To do that we select all of the samples from our VAE dataset where income is the positive class (earn greater than 50k). We then encode these samples into the latent space. As we've only chosen samples of the positive class to encode, this latent space will reflect properties of the positive class which we are able to sample from to create synthetic data.
We sample 15000 latest samples from this latent space and decode these latent vectors back into the input data space as our synthetic data points.
# Create column names
col_number = sample_df.shape[1]
col_names = [str(i) for i in range(col_number)]
sample_df.columns = col_names
# Define the feature value to filter
feature_value = 1.0 # Specify the feature value - here we set the income to 1
# Set all income values to 1 : Over 50k
selected_samples = sample_df[sample_df[col_names[-1]] == feature_value]
selected_samples = selected_samples.values
selected_samples_tensor = torch.tensor(selected_samples, dtype=torch.float32)
basic_vae.eval() # Set model to evaluation mode
with torch.no_grad():
mu, logvar = basic_vae.encode(selected_samples_tensor)
latent_vectors = basic_vae.reparameterize(mu, logvar)
# Compute the mean latent vector for this feature
mean_latent_vector = latent_vectors.mean(dim=0)
num_samples = 15000 # Number of latest samples
latent_dim = 8
latent_samples = mean_latent_vector + 0.1 * torch.randn(num_samples, latent_dim)
with torch.no_grad():
generated_samples = basic_vae.decode(latent_samples)
Now we've generated synthetic data of the positive class, we are able to mix this with the unique training data to generate a balanced synthetic dataset.
new_data = pd.DataFrame(generated_samples)
# Create column names
col_number = new_data.shape[1]
col_names = [str(i) for i in range(col_number)]
new_data.columns = col_names
X_synthetic = new_data.drop(col_names[-1],axis=1)
y_synthetic = np.asarray([1 for _ in range(0,X_synthetic.shape[0])])
X_synthetic_train = np.concatenate([X_model_train, X_synthetic.values], axis=0)
y_synthetic_train = np.concatenate([y_model_train, y_synthetic], axis=0)
mapping = {1: '>50K', 0: '<=50K'}
map_function = np.vectorize(lambda x: mapping[x])
# Apply mapping
y_mapped = map_function(y_synthetic_train)
plt.figure(figsize=(8, 6))
plt.hist(y_mapped, bins=2, edgecolor="black")
plt.title('Distribution of Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

We will now use our balanced training synthetic dataset to retrain our random forest classifier. We will then evaluate this latest model on the unique test data to see how effective our synthetic data is at reducing the model bias.
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_synthetic_train, y_synthetic_train)
# Step 5: Make predictions
y_pred = rf_classifier.predict(X_model_test)
cm = confusion_matrix(y_model_test, y_pred)
# Create heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="YlGnBu", xticklabels=["Negative", "Positive"], yticklabels=["Negative", "Positive"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
Our latest classifier, trained on the balanced synthetic dataset makes fewer errors on the unique test set than our original classifier trained on the imbalanced dataset and our error rate is now reduced to 14%.

Nevertheless, we've not been in a position to reduce the discrepancy in errors by a major amount, our error rate for the positive class continues to be 36%. This could possibly be on account of to the next reasons:
- We've discussed how considered one of the advantages of VAEs is the training of a continuous latent space. Nevertheless, if the bulk class dominates, the latent space might skew towards the bulk class.
- The model may not have properly learned a definite representation for the minority class on account of the shortage of information, making it hard to sample from that region accurately.
On this tutorial we've introduced and built a BasicVAE architecture which could be used to generate synthetic data which improves the classification accuracy on an imbalanced dataset.
Follow for future articles where I'll show how we are able to construct more sophisticated VAE architectures which address the above problems with imbalanced sampling and more.
[1] Villalobos, P., Ho, A., Sevilla, J., Besiroglu, T., Heim, L., & Hobbhahn, M. (2024). Will we run out of information? Limits of LLM scaling based on human-generated data. , .
[2] Becker, B. & Kohavi, R. (1996). Adult [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.