Data from the real world is usually expensive, messy and restricted by privacy laws. Synthetic data offers a solution—and it’s already widely used:
- The AI generated text is used to train LLMs
- Fraud systems simulate edge cases
- Fake images are used to train vision models
SDV The open-source Python Library (Synthetic Data vault) generates tabular data with machine learning. The library creates synthetic data by learning patterns in real data.
We’ll generate synthetic data in this tutorial using SDV.
Install the SDV library first:
CSVHandler import sdv.io.local
CSVHandler()
FOLDER_NAME = '.' # The data must be in the same directory
data = connector.read(folder_name=FOLDER_NAME)
salesDf = data['data']
We then import the required module, and we connect to our local directory containing dataset files. The CSV files are read from the folder specified and stored as DataFrames in pandas. We access the dataset by using [‘data’].
Import Metadata from sdv.metadata
metadata = Metadata.load_from_json('metadata.json')
Importing the metadata is next. The metadata, which is stored as a JSON-formatted file, tells SDV what to do with your data. This metadata includes:
- It is important to note that the word “you” means “people”. Table name
- It is important to note that the word “you” means “you”. The primary key
- It is important to note that the word “you” means “you”. Data type The columns (e.g. numerical, datetime and categorical)
- Alternatives to the Standard Option Column formats Like datetime patterns and ID patterns
- Table (for multi-table setups)
This is an example of a metadata.json file:
{
"METADATA_SPEC_VERSION": "V1",
"tables": {
"your_table_name": {
"primary_key": "your_primary_key_column",
"columns": {
"your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
"date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
"category_column": { "sdtype": "categorical" },
"numeric_column": { "sdtype": "numerical" }
},
"column_relationships": []
}
}
}
Import Metadata from sdv.metadata
metadata = Metadata.detect_from_dataframes(data)
We can also use SDV to infer metadata automatically. The results are not guaranteed to be complete or accurate, and you may need to update them if they differ.
from sdv.single_table import GaussianCopulaSynthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)
We can use SDV with the original dataset and metadata to create synthetic data. Models learn the patterns and structure of your dataset, and use that information to generate synthetic data.
You can set the number of rows that you want to display using num_rows argument.
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
salesDf,
synthetic_data,
metadata)
SDV also offers tools for evaluating the quality of synthetic data, by comparing them to an original dataset. The best place to begin is by creating an The quality of the report
SDV has plotting tools built in that let you compare synthetic data to real data. You can import, for instance. get_column_plot You can also find out more about us on our website. sdv.evaluation.single_table To create comparison plots, you can select columns to compare.
from sdv.evaluation.single_table import get_column_plot
fig = get_column_plot(
real_data=salesDf,
synthetic_data=synthetic_data,
column_name="Sales",
metadata=metadata
)
fig.show()

We can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets.
import pandas as pd
Import matplotlib.pyplot into plt
# Make sure that the 'Date columns' are in datetime
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format="%d-%m-%Y")
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format="%d-%m-%Y")
# Remove 'Month'from the year-month string
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)
Calculate average sales by grouping by "Month"
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean()Renamed to 'Synthetic Sales Average'
# Combine the two dataframes into one DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label="Actual Average Sales", marker="o")
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label="Synthetic Average Sales", marker="o")
Plt.title ('Average Sales Comparison: Synthetic vs Actual')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
Plt.ylim (bottom=0). # The y-axis begins at 0.
plt.tight_layout()
plt.show()

Both datasets have very similar average monthly sale figures, and only minor differences.
We demonstrated in this tutorial how you can prepare your data for the generation of synthetic data using SDV. SDV creates high-quality, synthetic data by training its model with your data. This closely mimics real data patterns and distributions. In addition, we explored the evaluation and visualization of synthetic data to verify that key metrics such as monthly trends and sales distribution remain constant. Synthetic data is a great way to solve privacy and accessibility issues while also enabling powerful data analytics and machine learning workflows.
Take a look at the Notebook on GitHub. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe Now our Newsletter.



