Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • AI CEOs think they can be everywhere at once
  • OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders
  • Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika
  • TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost
  • Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.
  • OpenMythos – A PyTorch Open Source Reconstruction of Claude Mythos, where 770M Parameters match a 1.3B Transformator
  • This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!
  • NVIDIA Releases Ising – the First Open Quantum AI Model Family For Hybrid Quantum-Classical Systems
AI-trends.todayAI-trends.today
Home»Tech»The Synthetic Data vault (SDV), a step-by-step guide to creating synthetic data

The Synthetic Data vault (SDV), a step-by-step guide to creating synthetic data

Tech By admin26/05/20254 Mins Read
Facebook Twitter LinkedIn Email
Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data
Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data
Share
Facebook Twitter LinkedIn Email

Data from the real world is usually expensive, messy and restricted by privacy laws. Synthetic data offers a solution—and it’s already widely used:

  • The AI generated text is used to train LLMs
  • Fraud systems simulate edge cases
  • Fake images are used to train vision models

SDV The open-source Python Library (Synthetic Data vault) generates tabular data with machine learning. The library creates synthetic data by learning patterns in real data.

We’ll generate synthetic data in this tutorial using SDV.

Install the SDV library first:

CSVHandler import sdv.io.local

CSVHandler()
FOLDER_NAME = '.' # The data must be in the same directory

data = connector.read(folder_name=FOLDER_NAME)
salesDf = data['data']

We then import the required module, and we connect to our local directory containing dataset files. The CSV files are read from the folder specified and stored as DataFrames in pandas. We access the dataset by using [‘data’].

Import Metadata from sdv.metadata
metadata = Metadata.load_from_json('metadata.json')

Importing the metadata is next. The metadata, which is stored as a JSON-formatted file, tells SDV what to do with your data. This metadata includes:

  • It is important to note that the word “you” means “people”. Table name
  • It is important to note that the word “you” means “you”. The primary key
  • It is important to note that the word “you” means “you”. Data type The columns (e.g. numerical, datetime and categorical)
  • Alternatives to the Standard Option Column formats Like datetime patterns and ID patterns
  • Table (for multi-table setups)

This is an example of a metadata.json file:

{
  "METADATA_SPEC_VERSION": "V1",
  "tables": {
    "your_table_name": {
      "primary_key": "your_primary_key_column",
      "columns": {
        "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
        "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
        "category_column": { "sdtype": "categorical" },
        "numeric_column": { "sdtype": "numerical" }
      },
      "column_relationships": []
    }
  }
}
Import Metadata from sdv.metadata

metadata = Metadata.detect_from_dataframes(data)

We can also use SDV to infer metadata automatically. The results are not guaranteed to be complete or accurate, and you may need to update them if they differ.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

We can use SDV with the original dataset and metadata to create synthetic data. Models learn the patterns and structure of your dataset, and use that information to generate synthetic data.

You can set the number of rows that you want to display using num_rows argument.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    salesDf,
    synthetic_data,
    metadata)

SDV also offers tools for evaluating the quality of synthetic data, by comparing them to an original dataset. The best place to begin is by creating an The quality of the report

SDV has plotting tools built in that let you compare synthetic data to real data. You can import, for instance. get_column_plot You can also find out more about us on our website. sdv.evaluation.single_table To create comparison plots, you can select columns to compare.

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=salesDf,
    synthetic_data=synthetic_data,
    column_name="Sales",
    metadata=metadata
)
   
fig.show()

We can observe that the distribution of the ‘Sales’ column in the real and synthetic data is very similar. To explore further, we can use matplotlib to create more detailed comparisons—such as visualizing the average monthly sales trends across both datasets.

import pandas as pd
Import matplotlib.pyplot into plt

# Make sure that the 'Date columns' are in datetime
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format="%d-%m-%Y")
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format="%d-%m-%Y")

# Remove 'Month'from the year-month string
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)

Calculate average sales by grouping by "Month"
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean()Renamed to 'Synthetic Sales Average'

# Combine the two dataframes into one DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label="Actual Average Sales", marker="o")
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label="Synthetic Average Sales", marker="o")

Plt.title ('Average Sales Comparison: Synthetic vs Actual')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
Plt.ylim (bottom=0). # The y-axis begins at 0.
plt.tight_layout()
plt.show()

Both datasets have very similar average monthly sale figures, and only minor differences.

We demonstrated in this tutorial how you can prepare your data for the generation of synthetic data using SDV. SDV creates high-quality, synthetic data by training its model with your data. This closely mimics real data patterns and distributions. In addition, we explored the evaluation and visualization of synthetic data to verify that key metrics such as monthly trends and sales distribution remain constant. Synthetic data is a great way to solve privacy and accessibility issues while also enabling powerful data analytics and machine learning workflows.


Take a look at the Notebook on GitHub. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Join our Facebook group! 95k+ ML SubReddit Subscribe Now our Newsletter.


I’m a Civil Engineering graduate (2022) at Jamia Millia Islamia in New Delhi. I am very interested in Data Science and especially Neural networks and how they can be applied in different areas.

dat synthetic data
Share. Facebook Twitter LinkedIn Email
Avatar
admin
  • Website

Related Posts

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

20/04/2026

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

20/04/2026

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

20/04/2026
Top News

Discovering the Exploration of “My First Robots” Kit: Empowering Next Generation of Engineers

Anthropic’s New Product Aims To Handle The Hard Part Of Building AI Agents

OpenAI supports a bill that would limit liability for AI-enabled mass deaths or financial disasters

Ed Zitron gets paid to love AI. Ed Zitron is also paid to hate AI

You want a different kind of work trip? Try a Robotics Hotel

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

An Implementation of a Complete Empirical Framework for Benchmarking Reasoning Methods in Fashionable Agentic AI Techniques

19/11/2025

Astronomers are Using Artificial intelligence to Unlock Secrets about Black Holes

11/06/2025
Latest News

AI CEOs think they can be everywhere at once

20/04/2026

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.