Synthetic Data Examples – Realistic – using AI (SYNDERAI), pronounced /ˈsɪn.də.raɪ/

Caveats (warnings)

Synthetic Data: Examples – Realistic – using AI (SYNDERAI), pronounced /ˈsɪn.də.raɪ/

© HL7 Europe | Main Contributor: Dr. Kai U. Heitmann | Privacy Policy • LGPL-3.0 license

Problem Description

Analyzing the list of conditions and their frequency generated by Synthea from Mitre that SYNDERAI used from July 2025 on to create the first set of Synthtic data (Packages version 1.0+ and 2.0+), we can observe that the dirtibution is not comaprable to typical population based diseases.

Future data generation processes need to reflect public health conditions of populations in Europe much better than the synthetic data funds used until April 2026. The updated funds used from May 2026 on will be reflected in SYNDEAI packages version 3.0+.

Analysis of the Distribution Problems

1. Administrative & Social Findings Dominate Clinical Reality

The top entries are not clinical diagnoses at all — they are social/administrative findings:

Issue Count %
Medication review due 881,421 23%
Stress (finding) 397,885 10.4%
Full-time employment 358,819 9.4%
Part-time employment 235,285 6.2%
Social isolation 138,970 3.6%

Together these consume ~57% of all records, crowding out clinically meaningful disease burden. These reflect US administrative coding and SDOH (Social Determinants of Health) frameworks that don't map to European health information systems.

2. Major European Burden Diseases Are Severely Underrepresented

Comparing the output to authoritative European sources (WHO Europe, Eurostat, ECDC):

Condition In Synthea output Real European prevalence
Hypertension 1.10% ~30–45% of adults
Diabetes type 2 0.24% ~8–10% of adults
Depression / Major depressive disorder ~0.001% (43 records!) ~15–20% lifetime
COPD Virtually absent ~5–10% of adults >40
Dementia / Alzheimer's Not visible ~7% of >65 population
Musculoskeletal/arthritis Minimal Leading cause of disability in EU
Anxiety disorders Very low ~14% of EU population

3. US Opioid Crisis Artifacts

Drug overdose appears at 0.47% — a distinctly American epidemiological signature. European drug-related morbidity patterns differ substantially by country and substance.

4. Cancer Is Nearly Absent

Cancer is the second leading cause of death in Europe (after cardiovascular disease), yet it barely registers in the dataset — reflecting Synthea's default US-centric module coverage.


How to Modify Synthea for European Populations

Synthea is modular and highly configurable. Here are the concrete levers to pull:

1. Recalibrate Disease Module Prevalence Parameters

Every Synthea disease is defined in a JSON module under src/main/resources/modules/. Each module contains incidence and prevalence probability tables, often stratified by age and sex. These need to be re-parameterized using European data sources:

Target data sources:

Modules to prioritize for recalibration:

Module Action
hypertension.json Dramatically increase prevalence onset probabilities
diabetes.json Increase type 2 prevalence; adjust onset age distribution
copd.json Increase prevalence, especially post-50 age brackets
depression.json Massively increase — currently near-zero output is broken
anxiety.json Same as depression
alzheimers_dementia.json Increase prevalence for >65 age cohorts
lung_cancer.json, colorectal_cancer.json, breast_cancer.json Calibrate to ECDC/WHO Europe incidence rates
opioid_addiction.json Reduce prevalence; shift to alcohol use disorder patterns

2. Suppress or Reweight US-Centric SDOH Modules

The social_determinants_of_health.json and related modules encode very US-specific SDOH concepts. Options:

3. Adjust the Demographic Input Files

Synthea uses demographic CSV files to drive population age/sex/geographic distributions. SYNDERAI replaced the US Census–based defaults with European equivalents (see Principles).

The European base in SYNDERAI need to get tweaked towards data derived from Eurostat population projections or national census data for your target country/region. This affects:

An older age pyramid will automatically increase the prevalence of age-associated diseases like dementia, COPD, and cardiovascular disease even before you touch the clinical modules.

4. Modify the keep_patients and Simulation Configuration

You can also define the simulation to run against specific age cohorts to over-represent elderly populations (important for EU aging burden).

5. Create New Modules for Europe-Specific Conditions

Some conditions common in Europe may have no Synthea module at all. You'll need to author new JSON modules for:

6. Recalibrate Mortality and Life Tables

Synthea uses US life tables for mortality. Replace with Eurostat life tables per country:

src/main/resources/cdc_growth_charts.json  ← replace mortality assumptions

European life expectancy patterns and cause-of-death distributions differ from the US — cardiovascular disease dominates earlier, while cancer patterns differ by region.

7. Validate Against Reference Prevalence Targets

After recalibration, validate the output distribution by computing the Kullback-Leibler divergence or a simpler chi-square goodness-of-fit test against your target European prevalence table. You can automate this as part of your data generation pipeline:

python

from scipy.stats import chisquare
import pandas as pd

# expected = European prevalence rates (from WHO/Eurostat)
# observed = Synthea output frequencies
stat, p = chisquare(f_obs=observed_counts, f_exp=expected_counts)

Iterate on module parameters until the KL divergence is minimized across the top 20–30 conditions by burden.


Priority Actions Summary

From April 2026 on the SYNDERAI synthetic funds will be re-generated with specific recalibrated JSON modules for a high-priority condition (e.g., hypertension or depression), or draft a validation script to measure distribution fit against a European reference dataset.

Priority Action Impact
🔴 High Recalibrate hypertension, diabetes, depression, COPD modules Fixes biggest epidemiological gaps
🔴 High Replace US demographics CSV with European age pyramid Cascading effect on all age-stratified conditions
🔴 High Suppress or cap "Medication review due" + employment SDOH codes Removes ~57% noise at top of distribution
🟡 Medium Recalibrate cancer modules with ECDC incidence data Critical for mortality realism
🟡 Medium Replace US life tables with Eurostat equivalents Fixes mortality and comorbidity chains
🟡 Medium Reduce opioid/drug overdose; increase alcohol use disorder Shifts substance abuse to EU patterns
🟢 Lower Author new modules for EU-specific conditions Fills gaps not covered by default Synthea
🟢 Lower Validate with KL divergence against GBD/WHO reference data Ensures quantitative accuracy