The Synthetic Data vault (SDV), a step-by-step guide to creating synthetic data

Data from the real world is usually expensive, messy and restricted by privacy laws. Synthetic data offers a solution—and it’s already widely used:

The AI generated text is used to train LLMs

Fraud systems simulate edge cases

Fake images are used to train vision models

SDV The open-source Python Library (Synthetic Data vault) generates tabular data with machine learning. The library creates synthetic data by learning patterns in real data.

We’ll generate synthetic data in this tutorial using SDV.

Install the SDV library first:

CSVHandler import sdv.io.local

CSVHandler()
FOLDER_NAME = '.' # The data must be in the same directory

data = connector.read(folder_name=FOLDER_NAME)
salesDf = data['data']

We then import the required module, and we connect to our local directory containing dataset files. The CSV files are read from the folder specified and stored as DataFrames in pandas. We access the dataset by using [‘data’].

Import Metadata from sdv.metadata
metadata = Metadata.load_from_json('metadata.json')

Importing the metadata is next. The metadata, which is stored as a JSON-formatted file, tells SDV what to do with your data. This metadata includes:

It is important to note that the word “you” means “people”. Table name
It is important to note that the word “you” means “you”. The primary key
It is important to note that the word “you” means “you”. Data type The columns (e.g. numerical, datetime and categorical)
Alternatives to the Standard Option Column formats Like datetime patterns and ID patterns
Table (for multi-table setups)

This is an example of a metadata.json file:

{
  "METADATA_SPEC_VERSION": "V1",
  "tables": {
    "your_table_name": {
      "primary_key": "your_primary_key_column",
      "columns": {
        "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
        "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
        "category_column": { "sdtype": "categorical" },
        "numeric_column": { "sdtype": "numerical" }
      },
      "column_relationships": []
    }
  }
}

Import Metadata from sdv.metadata

metadata = Metadata.detect_from_dataframes(data)

We can also use SDV to infer metadata automatically. The results are not guaranteed to be complete or accurate, and you may need to update them if they differ.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

We can use SDV with the original dataset and metadata to create synthetic data. Models learn the patterns and structure of your dataset, and use that information to generate synthetic data.

You can set the number of rows that you want to display using num_rows argument.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    salesDf,
    synthetic_data,
    metadata)

SDV also offers tools for evaluating the quality of synthetic data, by comparing them to an original dataset. The best place to begin is by creating an The quality of the report

SDV has plotting tools built in that let you compare synthetic data to real data. You can import, for instance. get_column_plot You can also find out more about us on our website. sdv.evaluation.single_table To create comparison plots, you can select columns to compare.

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=salesDf,
    synthetic_data=synthetic_data,
    column_name="Sales",
    metadata=metadata
)
   
fig.show()

We can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets.

import pandas as pd
Import matplotlib.pyplot into plt

# Make sure that the 'Date columns' are in datetime
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format="%d-%m-%Y")
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format="%d-%m-%Y")

# Remove 'Month'from the year-month string
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)

Calculate average sales by grouping by "Month"
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean()Renamed to 'Synthetic Sales Average'

# Combine the two dataframes into one DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label="Actual Average Sales", marker="o")
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label="Synthetic Average Sales", marker="o")

Plt.title ('Average Sales Comparison: Synthetic vs Actual')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
Plt.ylim (bottom=0). # The y-axis begins at 0.
plt.tight_layout()
plt.show()

Both datasets have very similar average monthly sale figures, and only minor differences.

We demonstrated in this tutorial how you can prepare your data for the generation of synthetic data using SDV. SDV creates high-quality, synthetic data by training its model with your data. This closely mimics real data patterns and distributions. In addition, we explored the evaluation and visualization of synthetic data to verify that key metrics such as monthly trends and sales distribution remain constant. Synthetic data is a great way to solve privacy and accessibility issues while also enabling powerful data analytics and machine learning workflows.

Take a look at the Notebook on GitHub. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe Now our Newsletter.

I’m a Civil Engineering graduate (2022) at Jamia Millia Islamia in New Delhi. I am very interested in Data Science and especially Neural networks and how they can be applied in different areas.

The Synthetic Data vault (SDV), a step-by-step guide to creating synthetic data

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

Discovering the Exploration of “My First Robots” Kit: Empowering Next Generation of Engineers

Anthropic’s New Product Aims To Handle The Hard Part Of Building AI Agents

OpenAI supports a bill that would limit liability for AI-enabled mass deaths or financial disasters

Ed Zitron gets paid to love AI. Ed Zitron is also paid to hate AI

You want a different kind of work trip? Try a Robotics Hotel

Top Insights

An Implementation of a Complete Empirical Framework for Benchmarking Reasoning Methods in Fashionable Agentic AI Techniques

Astronomers are Using Artificial intelligence to Unlock Secrets about Black Holes

Latest News

AI CEOs think they can be everywhere at once

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

The Synthetic Data vault (SDV), a step-by-step guide to creating synthetic data

Related Posts