Synthetic Data: Examples – Realistic – using AI (SYNDERAI), pronounced /ˈsɪn.də.raɪ/
© HL7 Europe | Main Contributor: Dr. Kai U. Heitmann | Privacy Policy • LGPL-3.0 license
Analyzing the list of conditions and their frequency generated by Synthea from Mitre that SYNDERAI used from July 2025 on to create the first set of Synthtic data (Packages version 1.0+ and 2.0+), we can observe that the dirtibution is not comaprable to typical population based diseases.
Future data generation processes need to reflect public health conditions of populations in Europe much better than the synthetic data funds used until April 2026. The updated funds used from May 2026 on will be reflected in SYNDEAI packages version 3.0+.
The top entries are not clinical diagnoses at all — they are social/administrative findings:
| Issue | Count | % |
|---|---|---|
| Medication review due | 881,421 | 23% |
| Stress (finding) | 397,885 | 10.4% |
| Full-time employment | 358,819 | 9.4% |
| Part-time employment | 235,285 | 6.2% |
| Social isolation | 138,970 | 3.6% |
Together these consume ~57% of all records, crowding out clinically meaningful disease burden. These reflect US administrative coding and SDOH (Social Determinants of Health) frameworks that don't map to European health information systems.
Comparing the output to authoritative European sources (WHO Europe, Eurostat, ECDC):
| Condition | In Synthea output | Real European prevalence |
|---|---|---|
| Hypertension | 1.10% | ~30–45% of adults |
| Diabetes type 2 | 0.24% | ~8–10% of adults |
| Depression / Major depressive disorder | ~0.001% (43 records!) | ~15–20% lifetime |
| COPD | Virtually absent | ~5–10% of adults >40 |
| Dementia / Alzheimer's | Not visible | ~7% of >65 population |
| Musculoskeletal/arthritis | Minimal | Leading cause of disability in EU |
| Anxiety disorders | Very low | ~14% of EU population |
Drug overdose appears at 0.47% — a distinctly American epidemiological signature. European drug-related morbidity patterns differ substantially by country and substance.
Cancer is the second leading cause of death in Europe (after cardiovascular disease), yet it barely registers in the dataset — reflecting Synthea's default US-centric module coverage.
Synthea is modular and highly configurable. Here are the concrete levers to pull:
Every Synthea disease is defined in a JSON module under src/main/resources/modules/. Each module contains incidence and prevalence probability tables, often stratified by age and sex. These need to be re-parameterized using European data sources:
Target data sources:
eurostat.ec.europa.euModules to prioritize for recalibration:
| Module | Action |
|---|---|
hypertension.json |
Dramatically increase prevalence onset probabilities |
diabetes.json |
Increase type 2 prevalence; adjust onset age distribution |
copd.json |
Increase prevalence, especially post-50 age brackets |
depression.json |
Massively increase — currently near-zero output is broken |
anxiety.json |
Same as depression |
alzheimers_dementia.json |
Increase prevalence for >65 age cohorts |
lung_cancer.json, colorectal_cancer.json, breast_cancer.json |
Calibrate to ECDC/WHO Europe incidence rates |
opioid_addiction.json |
Reduce prevalence; shift to alcohol use disorder patterns |
The social_determinants_of_health.json and related modules encode very US-specific SDOH concepts. Options:
care_goals.json or similar administrative modules. Cap it or disable it unless you need it for specific use cases.Synthea uses demographic CSV files to drive population age/sex/geographic distributions. SYNDERAI replaced the US Census–based defaults with European equivalents (see Principles).
The European base in SYNDERAI need to get tweaked towards data derived from Eurostat population projections or national census data for your target country/region. This affects:
An older age pyramid will automatically increase the prevalence of age-associated diseases like dementia, COPD, and cardiovascular disease even before you touch the clinical modules.
keep_patients and Simulation ConfigurationYou can also define the simulation to run against specific age cohorts to over-represent elderly populations (important for EU aging burden).
Some conditions common in Europe may have no Synthea module at all. You'll need to author new JSON modules for:
Synthea uses US life tables for mortality. Replace with Eurostat life tables per country:
src/main/resources/cdc_growth_charts.json ← replace mortality assumptions
European life expectancy patterns and cause-of-death distributions differ from the US — cardiovascular disease dominates earlier, while cancer patterns differ by region.
After recalibration, validate the output distribution by computing the Kullback-Leibler divergence or a simpler chi-square goodness-of-fit test against your target European prevalence table. You can automate this as part of your data generation pipeline:
python
from scipy.stats import chisquare
import pandas as pd
# expected = European prevalence rates (from WHO/Eurostat)
# observed = Synthea output frequencies
stat, p = chisquare(f_obs=observed_counts, f_exp=expected_counts)
Iterate on module parameters until the KL divergence is minimized across the top 20–30 conditions by burden.
From April 2026 on the SYNDERAI synthetic funds will be re-generated with specific recalibrated JSON modules for a high-priority condition (e.g., hypertension or depression), or draft a validation script to measure distribution fit against a European reference dataset.
| Priority | Action | Impact |
|---|---|---|
| 🔴 High | Recalibrate hypertension, diabetes, depression, COPD modules | Fixes biggest epidemiological gaps |
| 🔴 High | Replace US demographics CSV with European age pyramid | Cascading effect on all age-stratified conditions |
| 🔴 High | Suppress or cap "Medication review due" + employment SDOH codes | Removes ~57% noise at top of distribution |
| 🟡 Medium | Recalibrate cancer modules with ECDC incidence data | Critical for mortality realism |
| 🟡 Medium | Replace US life tables with Eurostat equivalents | Fixes mortality and comorbidity chains |
| 🟡 Medium | Reduce opioid/drug overdose; increase alcohol use disorder | Shifts substance abuse to EU patterns |
| 🟢 Lower | Author new modules for EU-specific conditions | Fills gaps not covered by default Synthea |
| 🟢 Lower | Validate with KL divergence against GBD/WHO reference data | Ensures quantitative accuracy |
We created both deliverables: a recalibrated European Hypertension Synthea module (here) and a Python validation script (here) with statistical fit testing against WHO/Eurostat reference data.
hypertension_europe.json — A drop-in Synthea module replacing the US default. Key changes from the US original: age-stratified onset rates from EHIS 2019 (ranging from 0.45%/yr at age 18–34 to 1.7%/yr at 55–64), ESC/ESH-preferred medications (ramipril → losartan → amlodipine instead of lisinopril/HCTZ), European BP control targets (<130/80), and complication rates calibrated against GBD 2019 Europe. Install by placing it under src/main/resources/modules/.
validate_synthea_europe.py — Run against any conditions.csv with:
python3 validate_synthea_europe.py --input conditions.csv --population 100000 --output results.csv --charts
The numbers are striking — only 1 out of 38 reference conditions falls within the acceptable ±20% band:
| Global Metric | Value | Interpretation |
|---|---|---|
| KL Divergence | 4.01 | Very high (>0.5 = poor) |
| Jensen-Shannon Div. | 0.21 | Moderate fit — recalibration required |
| Weighted MAE | 12.6 pp | 12.6 percentage point average error |
The three sharpest failure modes surfacing from your data:
1. Respiratory infections are wildly over-represented — Viral sinusitis at 9.78% vs an EU reference of 0.8% (+1,122%), and acute pharyngitis at 5.51% vs 0.7% (+687%). These are acute, self-limiting conditions that accumulate many records per patient lifetime in Synthea's default encounter model, inflating them far beyond population prevalence.
2. Eight conditions are completely absent — COPD, Depression, Anxiety disorder, Osteoarthritis, Osteoporosis, Prostate cancer, Dementia, and Parkinson's disease generate zero records. These represent some of the largest sources of disability-adjusted life years in Europe.
3. Metabolic and cardiovascular burden is severely compressed — Hyperlipidaemia sits at 1.64% vs a 22% EU reference (−93%), Metabolic syndrome at 2.7% vs 25% (−89%), and Hypertension at 4.78% vs 32% (−85%). These are the foundational chronic conditions that drive most downstream morbidity in any European cohort.
The hypertension_europe.json module directly corrects the hypertension gap. The same calibration logic can be applied to the other missing/under-sampled modules following the same pattern.


| SNOMED | Condition | Category | ICD-10 | Synthea Count | Synthea Prev % | EU Ref Prev % | Deviation % | Status | Source |
|---|---|---|---|---|---|---|---|---|---|
| 59621000 | Essential hypertension | Cardiovascular | I10 | 42144 | 4.781 | 32.0 | -85.1 | ⬇ UNDER | WHO Europe 2023; EHIS 2019 |
| 414545008 | Ischemic heart disease | Cardiovascular | I20-I25 | 26525 | 3.009 | 4.5 | -33.1 | ⬇ UNDER | ESC Atlas of Cardiology 2021; GBD 2019 Europe |
| 22298006 | Myocardial infarction | Cardiovascular | I21 | 3442 | 0.391 | 1.8 | -78.3 | ⬇ UNDER | ESC Atlas 2021; EuroHeart 2022 |
| 84114007 | Heart failure | Cardiovascular | I50 | 197 | 0.022 | 2.2 | -99.0 | ⬇ UNDER | ESC Heart Failure Atlas 2021 |
| 49436004 | Atrial fibrillation | Cardiovascular | I48 | 1353 | 0.154 | 2.5 | -93.9 | ⬇ UNDER | ESC/EHRA 2020; EuroObservational Research Programme |
| 44054006 | Diabetes mellitus type 2 | Metabolic | E11 | 9103 | 1.033 | 7.5 | -86.2 | ⬇ UNDER | IDF Diabetes Atlas 10th Ed. 2021; WHO Europe 2022 |
| 15777000 | Prediabetes | Metabolic | R73.09 | 41340 | 4.69 | 8.0 | -41.4 | ⬇ UNDER | IDF Atlas 2021; Diabetes Care Europe estimates |
| 162864005 | Obesity (BMI 30+) | Metabolic | E66 | 51883 | 5.886 | 17.0 | -65.4 | ⬇ UNDER | WHO Europe Obesity Report 2022; Eurostat EHIS 2019 |
| 55822004 | Hyperlipidaemia | Metabolic | E78 | 14429 | 1.637 | 22.0 | -92.6 | ⬇ UNDER | EAS Dyslipidaemia survey; Eurostat EHIS 2019 |
| 237602007 | Metabolic syndrome | Metabolic | E88.81 | 23838 | 2.704 | 25.0 | -89.2 | ⬇ UNDER | IDF; DECODE Study; GBD 2019 Europe |
| 13645005 | COPD | Respiratory | J44 | 0 | 0.0 | 6.5 | -100.0 | ❌ MISSING | ERS White Book 2013; BOLD Europe 2022; GBD 2019 |
| 195967001 | Asthma | Respiratory | J45 | 151 | 0.017 | 7.0 | -99.8 | ⬇ UNDER | GINA 2023; ERS / European Lung Foundation survey |
| 444814009 | Viral sinusitis | Respiratory | J01 | 86161 | 9.775 | 0.8 | 1121.9 | ⬆ OVER | EPOS 2020 guidelines; GP consultation rates EU |
| 195662009 | Acute viral pharyngitis | Respiratory | J02.9 | 48576 | 5.511 | 0.7 | 687.3 | ⬆ OVER | ECDC Antimicrobial Resistance 2022; Eurostat health care use |
| 35489007 | Depression | Mental Health | F32-F33 | 0 | 0.0 | 6.9 | -100.0 | ❌ MISSING | ECNP/EBC 2023 Size and Burden of Mental Disorders in Europe |
| 197480006 | Anxiety disorder | Mental Health | F40-F41 | 0 | 0.0 | 6.5 | -100.0 | ❌ MISSING | ECNP/EBC 2023; WHO Mental Health Atlas Europe 2022 |
| 80583007 | Severe anxiety (panic) | Mental Health | F41.0 | 17858 | 2.026 | 2.0 | 1.3 | ✅ OK | ECNP/EBC 2023; ESEMeD study |
| 36923009 | Major depression single episode | Mental Health | F32 | 43 | 0.005 | 4.0 | -99.9 | ⬇ UNDER | ECNP/EBC 2023; ESEMeD; GBD 2019 Europe |
| 396275006 | Osteoarthritis | Musculoskeletal | M15-M19 | 0 | 0.0 | 9.5 | -100.0 | ❌ MISSING | EULAR / Arthritis Research UK; GBD 2019 Europe |
| 69896004 | Rheumatoid arthritis | Musculoskeletal | M05-M06 | 321 | 0.036 | 0.8 | -95.4 | ⬇ UNDER | EULAR 2022; Eurostat EHIS 2019 |
| 278860009 | Chronic low back pain | Musculoskeletal | M54.5 | 16647 | 1.889 | 8.0 | -76.4 | ⬇ UNDER | EuroSpine / Airaksinen 2006; GBD 2019; Eurostat EHIS 2019 |
| 82423001 | Chronic pain | Musculoskeletal | G89.29 | 24045 | 2.728 | 19.0 | -85.6 | ⬇ UNDER | Breivik et al. 2006 European Pain Survey; Pain Alliance Europe 2022 |
| 64572001 | Osteoporosis | Musculoskeletal | M81 | 0 | 0.0 | 5.6 | -100.0 | ❌ MISSING | IOF European Audit 2021; Kanis et al. 2021 |
| 254837009 | Malignant neoplasm of breast | Oncology | C50 | 2280 | 0.259 | 1.4 | -81.5 | ⬇ UNDER | ECIS/IARC GLOBOCAN 2020 Europe; ECDC Cancer Burden |
| 363418001 | Malignant neoplasm of prostate | Oncology | C61 | 0 | 0.0 | 1.1 | -100.0 | ❌ MISSING | ECIS/IARC GLOBOCAN 2020 Europe |
| 363406005 | Colorectal cancer | Oncology | C18-C20 | 990 | 0.112 | 0.7 | -84.0 | ⬇ UNDER | ECIS/IARC GLOBOCAN 2020 Europe |
| 254637007 | Non-small cell lung cancer | Oncology | C34 | 1068 | 0.121 | 0.5 | -75.8 | ⬇ UNDER | ECIS/IARC GLOBOCAN 2020 Europe; ERS Lung Cancer Report |
| 230690007 | Cerebrovascular accident (stroke) | Neurological | I63-I64 | 819 | 0.093 | 1.6 | -94.2 | ⬇ UNDER | ESO European Stroke Organisation; GBD 2019 Europe |
| 26929004 | Alzheimer disease | Neurological | G30 | 4887 | 0.554 | 1.5 | -63.0 | ⬇ UNDER | Alzheimer Europe Dementia Monitor 2022; Eurostat Ageing |
| 56193007 | Dementia | Neurological | F00-F03 | 0 | 0.0 | 2.0 | -100.0 | ❌ MISSING | Alzheimer Europe 2022; WHO Europe NCD Report 2022 |
| 32798002 | Parkinson disease | Neurological | G20 | 0 | 0.0 | 0.3 | -100.0 | ❌ MISSING | European Parkinson's Disease Association; GBD 2019 |
| 431855005 | Chronic kidney disease stage 1 | Renal | N18.1 | 18151 | 2.059 | 3.0 | -31.4 | ⬇ UNDER | ERA-EDTA 2020; KDIGO CKD Prevalence in Europe |
| 431856006 | Chronic kidney disease stage 2 | Renal | N18.2 | 16177 | 1.835 | 3.1 | -40.8 | ⬇ UNDER | ERA-EDTA 2020 |
| 433144002 | Chronic kidney disease stage 3 | Renal | N18.3 | 11820 | 1.341 | 3.9 | -65.6 | ⬇ UNDER | ERA-EDTA 2020; CKD Prognosis Consortium Europe |
| 68496003 | Polyp of colon | Gastrointestinal | K63.5 | 10188 | 1.156 | 2.5 | -53.8 | ⬇ UNDER | European Society of Gastrointestinal Endoscopy 2022; ESGE |
| 40055000 | Chronic sinusitis | Gastrointestinal | J32 | 21536 | 2.443 | 1.2 | 103.6 | ⬆ OVER | EPOS 2020; Fokkens et al. Rhinology 2020 |
| 307426000 | Acute infective cystitis | Infectious | N30.0 | 9885 | 1.121 | 1.5 | -25.2 | ⬇ UNDER | EAU Guidelines on Urological Infections 2023; ECDC |
| 10939881000119105 | Unhealthy alcohol drinking behaviour | Substance Use | F10 | 19579 | 2.221 | 7.6 | -70.8 | ⬇ UNDER | WHO Europe Alcohol & Health Status Report 2022; Eurostat EHIS |