Synthetic Data Examples – Realistic – using AI (SYNDERAI), pronounced /ˈsɪn.də.raɪ/

Principles

Introduction

One of the recurring challenges in developing, testing, and validating HL7 FHIR-based systems is the availability of medically realistic, safe, and standards-compliant sets of example/test data. To address this, the xShare Project integrates synthetic data generated under this initiative, coordinated by HL7 Europe.

SYNDERAI provides synthetic, high-quality HL7 FHIR instances that replicate real-world clinical records without exposing personal health information. These instances are used across the xShare toolbox — from transformation to visualization to sharing — enabling fully privacy-compliant test workflows, aligned with the European Electronic Health Record Exchange Format (EEHRxF) and General Data Protection Regulation (GDPR, EU 2016/679) principles.

SYNDERAI Design

SYNDERAI datasets are designed to

The project generates not just HL7 FHIR JSON instance files, but also includes metadata, test coverage indicators, and placeholders for multilingual expansion and narrative descriptions.

Within xShare, SYNDERAI synthetic data is

This data enables xShare Adoption Sites and developers to

Major Achievements

Following the design, a couple of achievements were made: European patient cohorts and providers were compiled, assuring some realistic properties. Clincial "stories" were combined with the individuals to create use case based data sets that are subsequently turned into the appropriate example/testing instances. The use of Artificial Intelligence (AI) was added at any of these levels where appropriate, e.g. prompting for populating data fragments based on a given demographic and clinical context or to provide human readable text based on exsting granular data.

image-20250512215745584

Depositphotos.com © Robert Marmion

Patient cohorts, Provider crowds

Demographics

Large collections of human names and base demographics in Europe were created featuring gender and age, address with geospatial coordinates. The were subsequently newly mixed to create use case bound health care cases or linked to our featured personas.

image-20250712132811511

Example synthetic data in tabluar format for further use

Background Story: In 2020, Kai Heitmann started the 25tipster initiative (25 Thousand International Patient Summary Test Records) was started to produce 25.000 International Patient Summaries as synthetic data. While that initiative did not completely conclude, all materials and methodology was taken over and refined.

Stratification

Stratification [1] plays a role when going into more realistic examples. A strato-proximity-match was applied for further processing: nationality, gender almost-same-age synthetic individuals were chosen to match them with the clinical data funds (see "Clinical relationship" later on this page) assigning them findings, diagnoses, and more.

Providers

tbd

Proximity

tbd

Clinical relationship

image-20250514085415926

Depositphotos.com © Yevhen Shkolenko

Synthea

The Synthea methodology from MITRE Corporation was used to create a basic clinical data fund. However, many parts of the data were influenced by american circumstances, e.g. for medication codes etc. Diagnoses, Lab values, Vital Signs, Immunizations and more was taken from these funds, partially mapped in context for example code-wise to SNOMED CT or some national code systemes.

Personas

We invented personas based on granular facts.

AI in SYNDERAI

The use of Artificial Intelligence is applied to just fragments and parts of the complete "story", SYNDERAI tells, not to invent whole stories. For example, the Lab Report is based on realistic lab values and projected on European citzizen/patients all over Europe, stratified [1] by demographic and clinical factors to reach close clinical coverage. Only the normal lab value ranges based on the strata used is provided by concise calls to the AI API.

Artificial Intelligence was used to create or add fragments and parts of health care data. Examples include lab data normal range based on age, gender and diagnoses, a structured medication dosage for drugs taken by the systhetic individual or invention of human readble short text for discharge summaries for our personas.

The Human Text

Artificial Intelligence helps to invent human text for typical Hospital Discharge Summary sections based on granular synthetic but realistic facts such a medication. Lab results, diagnoses.

The xShare Yellow Button Story

SYNDERAI plays a critical role in enabling safe, reusable, and standards-aligned testing of the Yellow Button components. It supports transparency, traceability, and validation across the xShare toolbox — and provides a solid foundation for future testbeds, certification frameworks, and developer onboarding efforts in support of the EHDS.

Compared to the earlier xShare D3.3 deliverable, the SYNDERAI component has evolved significantly. What was previously described as a conceptual asset is now implemented and actively integrated into xShare workflows. The sets of example/test data is openly available, continuously expanded, and aligned with HL7 FHIR Implementation Guides such as International/European Patient Summary, European Laboratory Report, and th European Hospital Discharge Summary.

It is now validated through real-world use in visualization rendering via vi7eti, Smart Health Link generation, and IHE Connectathon/Plugathon test case development. These advancements make SYNDERAI a critical enabler for privacy-preserving, standards-compliant, and repeatable validation processes across the toolbox.


[1] Stratification of clinical trials is the partitioning of subjects and results by a factor other than the treatment given. – see Wikipedia https://en.wikipedia.org/wiki/Stratification_(clinical_trials)