Synthetic Data Examples – Realistic – using AI (SYNDERAI), pronounced /ˈsɪn.də.raɪ/

Caveats (warnings)

Synthetic Data: Examples – Realistic – using AI (SYNDERAI), pronounced /ˈsɪn.də.raɪ/

Problem Description

Analyzing the list of conditions and their frequency generated by Synthea from Mitre that SYNDERAI used from July 2025 on to create the first set of Synthtic data (Packages version 1.0+ and 2.0+), we can observe that the dirtibution is not comaprable to typical population based diseases.

Future data generation processes need to reflect public health conditions of populations in Europe much better than the synthetic data funds used until April 2026. The updated funds used from May 2026 on will be reflected in SYNDEAI packages version 3.0+.

Analysis of the Distribution Problems

1. Administrative & Social Findings Dominate Clinical Reality

The top entries are not clinical diagnoses at all — they are social/administrative findings:

Issue	Count	%
Medication review due	881,421	23%
Stress (finding)	397,885	10.4%
Full-time employment	358,819	9.4%
Part-time employment	235,285	6.2%
Social isolation	138,970	3.6%

Together these consume ~57% of all records, crowding out clinically meaningful disease burden. These reflect US administrative coding and SDOH (Social Determinants of Health) frameworks that don't map to European health information systems.

2. Major European Burden Diseases Are Severely Underrepresented

Comparing the output to authoritative European sources (WHO Europe, Eurostat, ECDC):

Condition	In Synthea output	Real European prevalence
Hypertension	1.10%	~30–45% of adults
Diabetes type 2	0.24%	~8–10% of adults
Depression / Major depressive disorder	~0.001% (43 records!)	~15–20% lifetime
COPD	Virtually absent	~5–10% of adults >40
Dementia / Alzheimer's	Not visible	~7% of >65 population
Musculoskeletal/arthritis	Minimal	Leading cause of disability in EU
Anxiety disorders	Very low	~14% of EU population

3. US Opioid Crisis Artifacts

Drug overdose appears at 0.47% — a distinctly American epidemiological signature. European drug-related morbidity patterns differ substantially by country and substance.

4. Cancer Is Nearly Absent

Cancer is the second leading cause of death in Europe (after cardiovascular disease), yet it barely registers in the dataset — reflecting Synthea's default US-centric module coverage.

How to Modify Synthea for European Populations

Synthea is modular and highly configurable. Here are the concrete levers to pull:

1. Recalibrate Disease Module Prevalence Parameters

Every Synthea disease is defined in a JSON module under src/main/resources/modules/. Each module contains incidence and prevalence probability tables, often stratified by age and sex. These need to be re-parameterized using European data sources:

Target data sources:

Eurostat (Health Statistics) — eurostat.ec.europa.eu
ECDC (European Centre for Disease Prevention and Control)
WHO Europe Global Health Observatory
Global Burden of Disease (IHME) — EU-specific country profiles
National health registries (e.g., NHS Digital for England, RKI for Germany, INSEE for France)

Modules to prioritize for recalibration:

Module	Action
`hypertension.json`	Dramatically increase prevalence onset probabilities
`diabetes.json`	Increase type 2 prevalence; adjust onset age distribution
`copd.json`	Increase prevalence, especially post-50 age brackets
`depression.json`	Massively increase — currently near-zero output is broken
`anxiety.json`	Same as depression
`alzheimers_dementia.json`	Increase prevalence for >65 age cohorts
`lung_cancer.json`, `colorectal_cancer.json`, `breast_cancer.json`	Calibrate to ECDC/WHO Europe incidence rates
`opioid_addiction.json`	Reduce prevalence; shift to alcohol use disorder patterns

2. Suppress or Reweight US-Centric SDOH Modules

The social_determinants_of_health.json and related modules encode very US-specific SDOH concepts. Options:

Disable the module entirely by removing it from the module path or setting all transition probabilities to 0 for non-applicable codes.
Replace SDOH coding with European equivalents — for example, using ICPC-2 or ICD-10-CM equivalents used in EU primary care systems.
Reduce the "Medication review due" entry — this is generated at extremely high frequency by the care_goals.json or similar administrative modules. Cap it or disable it unless you need it for specific use cases.

3. Adjust the Demographic Input Files

Synthea uses demographic CSV files to drive population age/sex/geographic distributions. SYNDERAI replaced the US Census–based defaults with European equivalents (see Principles).

The European base in SYNDERAI need to get tweaked towards data derived from Eurostat population projections or national census data for your target country/region. This affects:

Age pyramid (Europe is generally older than the US default)
Sex ratios
Urban/rural splits

An older age pyramid will automatically increase the prevalence of age-associated diseases like dementia, COPD, and cardiovascular disease even before you touch the clinical modules.

4. Modify the `keep_patients` and Simulation Configuration

You can also define the simulation to run against specific age cohorts to over-represent elderly populations (important for EU aging burden).

5. Create New Modules for Europe-Specific Conditions

Some conditions common in Europe may have no Synthea module at all. You'll need to author new JSON modules for:

Tick-borne encephalitis (Central/Eastern Europe)
Mediterranean diet–related conditions (Southern Europe)
Country-specific screening programs (e.g., cervical cancer screening timelines differ significantly from the US)
Occupational diseases (relevant in industrial regions)
Alcohol use disorder — recalibrate with WHO Europe alcohol consumption data, which varies enormously by country

6. Recalibrate Mortality and Life Tables

Synthea uses US life tables for mortality. Replace with Eurostat life tables per country:

src/main/resources/cdc_growth_charts.json  ← replace mortality assumptions

European life expectancy patterns and cause-of-death distributions differ from the US — cardiovascular disease dominates earlier, while cancer patterns differ by region.

7. Validate Against Reference Prevalence Targets

After recalibration, validate the output distribution by computing the Kullback-Leibler divergence or a simpler chi-square goodness-of-fit test against your target European prevalence table. You can automate this as part of your data generation pipeline:

python

from scipy.stats import chisquare
import pandas as pd

# expected = European prevalence rates (from WHO/Eurostat)
# observed = Synthea output frequencies
stat, p = chisquare(f_obs=observed_counts, f_exp=expected_counts)

Iterate on module parameters until the KL divergence is minimized across the top 20–30 conditions by burden.

Priority Actions Summary

From April 2026 on the SYNDERAI synthetic funds will be re-generated with specific recalibrated JSON modules for a high-priority condition (e.g., hypertension or depression), or draft a validation script to measure distribution fit against a European reference dataset.

Priority	Action	Impact
🔴 High	Recalibrate hypertension, diabetes, depression, COPD modules	Fixes biggest epidemiological gaps
🔴 High	Replace US demographics CSV with European age pyramid	Cascading effect on all age-stratified conditions
🔴 High	Suppress or cap "Medication review due" + employment SDOH codes	Removes ~57% noise at top of distribution
🟡 Medium	Recalibrate cancer modules with ECDC incidence data	Critical for mortality realism
🟡 Medium	Replace US life tables with Eurostat equivalents	Fixes mortality and comorbidity chains
🟡 Medium	Reduce opioid/drug overdose; increase alcohol use disorder	Shifts substance abuse to EU patterns
🟢 Lower	Author new modules for EU-specific conditions	Fills gaps not covered by default Synthea
🟢 Lower	Validate with KL divergence against GBD/WHO reference data	Ensures quantitative accuracy