Simulation and generation of synthetic data

1) Definitions and objectives

Synthetic data - artificially generated sets that preserve the statistical and/or causal properties of the original without disclosing specific records.
Simulation - modeling processes/environments using formal rules (stochastic, discrete-event, agent-base, causal) to obtain data and what-if scenarios.

What for:

Privacy and compliance: fewer PII/PHI/PCI risks.
Covering rare events, "tails" of distributions, stress tests.
R&D acceleration: sandboxes for Dev/QA/ML without access to production data.
Experimentation and model training where real data collection is expensive/not possible.

2) When to use and when not

Suitable: cold start, data shortage, high privacy risks, expensive A/B, simulation of policies/prices/loads, pipeline testing.
Caution/not suitable: regulatory reporting, forensic audit, rare domain artifacts, where local patterns are critical and easily distorted.

3) Taxonomy of generation methods

3. 1 Statistical and classical: bootstrapping, permutations, empirical distributions, copula approaches (Gaussian/Vine/Archimedean) to preserve correlations.

3. 2 Generative models (ML):

GAN/CTGAN/TVAE for tabular data;
VAE/Normalizing Flows for continuous spaces;
Diffusion models for images/audio/time series;
LLM approaches for texts/dialogues (with guardrails and filters).
3. 3 Causal simulators: structural causal models (SCM), causality graphs, do (X) interventions.
3. 4 Discrete-event/regular/monte-carlo: process modeling (logistics, call centers, exchanges, M/M/1 queues, M/G/k).
3. 5 Agent-base: populations of agents with rules of conduct (markets, games, user trajectories).

4) Types of data and specifics

Tabular: categories/numbers/dates; marginal distributions, dependencies, rare values are important.
Time series: trends/seasonality/noise, lag correlation, events and modes; generation of regimens (HMM/HSMM), diffusion models by segment.
Graphs and networks: degree distributions, clusters/communities, motifs; Erdős-Rényi, Barbásy-Albert, graph GAN/VAEs models.
Text/log data: synthetics of user requests, tickets; de-identification and control of toxicity/leakage is required.
Images/audio: domain conditions (resolution, noise), class balance.

5) Privacy and protection

Risk metrics: probability of record-link/re-identification, membership inference-stability, attribute inference-protection.
Differential privacy (DP): DP-SGD, PATE, post-processing with ε-budget; privacy report (ε, δ, sensitivity).
PII revision: tokenization/masking before training; block lists/filters in LLM generation.
Politicians and magazines: who, what, on what data trained the synthetic model; terms of retention.

6) Quality and utility of synthetics

Metrics:

Statistical proximity: KS/ χ ²/WD, PSI, coverage of categories/rare values.
Multicollinearity and relationships: correlations/MI, copula distance.
Utility test: training the model on synthetics → test on real (Train on Synthetic, Test on Real, TSTR), and vice versa (TRTS).
Downstream-stability: sustainability of business metrics/feature-importance.
Fairness and biases: parity metrics, before/after bias comparison.

Calibration: adjustment of generation hyperparameters before passing utility/privacy thresholds.

7) Domain restrictions and rules

Hard business invariants: amounts ≥ 0, balance preservation, ID uniqueness, referential integrity.
Geo/time: valid calendar patterns, time zones, holidays.
Causal relationships: preservation of do-relationships in interventions.
Constraint-aware generation: post-filters, rejection sampling, differentiable constraints.

8) What-if scenarios and stress tests

Monte Carlo: distribution of KPI outcomes with varying inputs.
Causal interventions: price/limit/rule change and uplift/risk assessment.
Load simulations: traffic profiles, bursts, pipeline fault tolerance.
Rare events: fraud, DDoS, "black swans" (oversampling tails).

9) Integration into pipelines and MLOps

Versioning: datasets, seeds, generation configs, model weights; semantics of SemVer.
Lineage: synthetics to sources (level of abstraction without PII).
Tests and contracts: DQ rules for synthetics, privacy checks in CI.
Cataloging: metadata about methods, hyperparameters, ε-budget, utility-estimates.
Automation: DAG for generator training, batch release, drift monitoring.

10) Stack and implementation patterns (solution classes)

Tabular/relational: copulas/CTGAN/TVAE/flows; FK-enabled generators.
Time series: state-space/ARIMA/VAR, diffusion/GAN-time, time switching.
Graphs: generators with structure invariants, GNN-VAE/GAN.
Text/LLM: promptas with rules and dictionaries, RAG framing on impersonal materials, detox/edition.
Simulators: discrete-event frameworks, agent libraries, script config engines.

(Choose tools with support for privacy, constraint-aware generation and reporting.)

11) Validation and acceptance

Stat suite: before/after comparison of distributions and dependencies.
TSTR/TRTS: utility thresholds on targets.
Privacy suite: MIA/AIA tests, epsilon reports, surrogate k-anonymity.
Business invariants: automatic checks (amounts, balances, graph connectivity).
User acceptance: expertise of domain owners, visual sanity checks.

12) Legal and ethical aspects

Coordination with lawyers: purpose of use, cross-border transfers, retention.
Licensing and IP: synthetics derived from training materials and policy per model.
Ethics and fairness: Do not increase discrimination; document risks/displacements.
Communication: explicit labeling of synthetics in systems/reports.

13) Antipatterns

"We generate everything LLM" without privacy checks and invariants.
Ignore tails: synthetics smooth out rarities → dips in food.
No utility validation: beautiful distributions, but useless for tasks.
PII leaks: training on raw data and no DP/filters.
Unfixed sides/versions: non-reproducibility, controversial results.

Lack of causality: Simulations are "beautiful" but incorrectly respond to "what-if."

14) Implementation Roadmap

1. Discovery: goals (utility/privacy), targets, risks, invariants, owners.
2. MVP: one domain (for example, payments/sessions), basic generator + privacy filters, stat suite + TSTR.
3. Scale: support for FK/graphs/time series, constraint-aware, ε-budget DP, directory/lineage.
4. Hardening: causal/agent simulations, stress tests, pipeline chaos scenarios.
5. Optimization: cost-aware generation, active tail improvement, automatic selection of hyperparameters.

15) Pre-release checklist

PII/secrets cleared, legal mode of use described.
Fixed sides/versions, metadata and lineage.
Passed stat suite (distributions/dependencies) and business invariants.
Passed TSTR/TRTS on key tasks with utility thresholds.
Completed privacy tests (MIA/AIA), billed and documented ε budget (if DP).
Configured drift monitoring and periodic re-train of generators.
Synthetics are explicitly labeled in BI/API, unauthorized export is prohibited.

16) Script templates

Tabular sales: copula + post-filters for VAT/currencies/calendar → discount stress test.
Traffic/sessions: agent behavior model + diffusion time series → queue/load test.
Fraud cases: tail oversampling + graph generation of links → scoring debugging.
Support: LLM synthetic tickets with de-identification → router training.
Logistics: discrete event simulation of warehouses/couriers → KPIs at SLA/cost.

Bottom line: simulation and synthetic data is an engineering discipline, not "generation for the sake of generation." Combine privacy (DP/revision), utility (TSTR/TRTS), causality and domain restrictions with a reproducible MLOps circuit. Then synthetics will become a safe accelerator of research, testing and decision-making.

Simulation and generation of synthetic data

(Choose tools with support for privacy, constraint-aware generation and reporting.)

Get in Touch

Quick Contact

The video will be updated soon

We are currently very busy with projects