Synthetic Data: Examples – Realistic – using AI (SYNDERAI), pronounced /ˈsɪn.də.raɪ/
© HL7 Europe | Main Contributor: Dr. Kai U. Heitmann | Privacy Policy • LGPL-3.0 license
Analyzing the list of conditions and their frequency generated by Synthea from Mitre that SYNDERAI used from July 2025 on to create the first set of Synthtic data (Packages version 1.0+ and 2.0+), we can observe that the dirtibution is not comaprable to typical population based diseases.
Future data generation processes need to reflect public health conditions of populations in Europe much better than the synthetic data funds used until April 2026. The updated funds used from May 2026 on will be reflected in SYNDEAI packages version 3.0+.
The top entries are not clinical diagnoses at all — they are social/administrative findings:
| Issue | Count | % |
|---|---|---|
| Medication review due | 881,421 | 23% |
| Stress (finding) | 397,885 | 10.4% |
| Full-time employment | 358,819 | 9.4% |
| Part-time employment | 235,285 | 6.2% |
| Social isolation | 138,970 | 3.6% |
Together these consume ~57% of all records, crowding out clinically meaningful disease burden. These reflect US administrative coding and SDOH (Social Determinants of Health) frameworks that don't map to European health information systems.
Comparing the output to authoritative European sources (WHO Europe, Eurostat, ECDC):
| Condition | In Synthea output | Real European prevalence |
|---|---|---|
| Hypertension | 1.10% | ~30–45% of adults |
| Diabetes type 2 | 0.24% | ~8–10% of adults |
| Depression / Major depressive disorder | ~0.001% (43 records!) | ~15–20% lifetime |
| COPD | Virtually absent | ~5–10% of adults >40 |
| Dementia / Alzheimer's | Not visible | ~7% of >65 population |
| Musculoskeletal/arthritis | Minimal | Leading cause of disability in EU |
| Anxiety disorders | Very low | ~14% of EU population |
Drug overdose appears at 0.47% — a distinctly American epidemiological signature. European drug-related morbidity patterns differ substantially by country and substance.
Cancer is the second leading cause of death in Europe (after cardiovascular disease), yet it barely registers in the dataset — reflecting Synthea's default US-centric module coverage.
Synthea is modular and highly configurable. Here are the concrete levers to pull:
Every Synthea disease is defined in a JSON module under src/main/resources/modules/. Each module contains incidence and prevalence probability tables, often stratified by age and sex. These need to be re-parameterized using European data sources:
Target data sources:
eurostat.ec.europa.euModules to prioritize for recalibration:
| Module | Action |
|---|---|
hypertension.json |
Dramatically increase prevalence onset probabilities |
diabetes.json |
Increase type 2 prevalence; adjust onset age distribution |
copd.json |
Increase prevalence, especially post-50 age brackets |
depression.json |
Massively increase — currently near-zero output is broken |
anxiety.json |
Same as depression |
alzheimers_dementia.json |
Increase prevalence for >65 age cohorts |
lung_cancer.json, colorectal_cancer.json, breast_cancer.json |
Calibrate to ECDC/WHO Europe incidence rates |
opioid_addiction.json |
Reduce prevalence; shift to alcohol use disorder patterns |
The social_determinants_of_health.json and related modules encode very US-specific SDOH concepts. Options:
care_goals.json or similar administrative modules. Cap it or disable it unless you need it for specific use cases.Synthea uses demographic CSV files to drive population age/sex/geographic distributions. SYNDERAI replaced the US Census–based defaults with European equivalents (see Principles).
The European base in SYNDERAI need to get tweaked towards data derived from Eurostat population projections or national census data for your target country/region. This affects:
An older age pyramid will automatically increase the prevalence of age-associated diseases like dementia, COPD, and cardiovascular disease even before you touch the clinical modules.
keep_patients and Simulation ConfigurationYou can also define the simulation to run against specific age cohorts to over-represent elderly populations (important for EU aging burden).
Some conditions common in Europe may have no Synthea module at all. You'll need to author new JSON modules for:
Synthea uses US life tables for mortality. Replace with Eurostat life tables per country:
src/main/resources/cdc_growth_charts.json ← replace mortality assumptions
European life expectancy patterns and cause-of-death distributions differ from the US — cardiovascular disease dominates earlier, while cancer patterns differ by region.
After recalibration, validate the output distribution by computing the Kullback-Leibler divergence or a simpler chi-square goodness-of-fit test against your target European prevalence table. You can automate this as part of your data generation pipeline:
python
from scipy.stats import chisquare
import pandas as pd
# expected = European prevalence rates (from WHO/Eurostat)
# observed = Synthea output frequencies
stat, p = chisquare(f_obs=observed_counts, f_exp=expected_counts)
Iterate on module parameters until the KL divergence is minimized across the top 20–30 conditions by burden.
From April 2026 on the SYNDERAI synthetic funds will be re-generated with specific recalibrated JSON modules for a high-priority condition (e.g., hypertension or depression), or draft a validation script to measure distribution fit against a European reference dataset.
| Priority | Action | Impact |
|---|---|---|
| 🔴 High | Recalibrate hypertension, diabetes, depression, COPD modules | Fixes biggest epidemiological gaps |
| 🔴 High | Replace US demographics CSV with European age pyramid | Cascading effect on all age-stratified conditions |
| 🔴 High | Suppress or cap "Medication review due" + employment SDOH codes | Removes ~57% noise at top of distribution |
| 🟡 Medium | Recalibrate cancer modules with ECDC incidence data | Critical for mortality realism |
| 🟡 Medium | Replace US life tables with Eurostat equivalents | Fixes mortality and comorbidity chains |
| 🟡 Medium | Reduce opioid/drug overdose; increase alcohol use disorder | Shifts substance abuse to EU patterns |
| 🟢 Lower | Author new modules for EU-specific conditions | Fills gaps not covered by default Synthea |
| 🟢 Lower | Validate with KL divergence against GBD/WHO reference data | Ensures quantitative accuracy |