During his keynote lecture at a 2022 conference on medical images, Alex Frangi displayed two scans of the vasculature of the same subject’s brain. Although the luminous tangle of blood vessels appeared to be all but identical, even to this audience of radiologists and computer science experts, only one of them depicted the real vessels measured on a human subject. The other had been synthesized by a computer algorithm, mimicking what might be captured from a real human subject using magnetic resonance angiography.

“I asked, which of these is real and which is synthetic?” says Frangi, who directs the Center for Computational Imaging & Simulation Technologies in Biomedicine at the University of Leeds in the United Kingdom. “It’s very, very difficult to tell.”

The creation of a brain image real enough to fool experts was far from an academic curiosity. Rather, the feat was central to a flourishing new field of synthetic data, which could change the way patients are diagnosed, how clinical trials are conducted and especially how artificial intelligence–driven tools—proliferating across the medical landscape—are trained and perform. 

AI tools require vast amounts of data to train their algorithms, and health care produces that data in torrential quantities. But health systems and other guardians of data are reluctant to share that information, in large part because of privacy concerns. It has also, historically, been extremely difficult for researchers to put most of that data to use. 

Synthetic data, used to produce fake brains and other not-quite-real medical artifacts, can help solve that problem, giving researchers access to potentially unlimited numbers of images and histories they can use to train AI models, which in turn can diagnose illnesses, or model and predict how diseases such as COVID-19 affect populations. Another exciting use of synthetic data is taking off in the pharmaceutical or medical device industries, where “digital twins”—virtual subjects in clinical trials’ control arms—reduce the need for real humans. 

In some of these cases, the manufactured data is purely numerical, providing the statistical parameters that make up a unique health profile. In other cases, it is visual, approximating scans, photos or other medical imaging. Yet the aim for most synthetic data is the same—to create exquisitely accurate models representing human subjects to further science while keeping people out of harm’s way. The possibilities of that approach, and its limits, are only beginning to emerge. 

The implications of the medical sector’s data gap came into harsh relief during the COVID pandemic. In theory, a disease that has infected more than 600 million people globally would create a robust data trail, one equal to training AI in some of the key needs for pandemic management: spotting the signs of COVID in patients and building models that help predict the spread and impact of the disease. 

But multiple studies examining the flurry of COVID AI models came to the same conclusion: They were ineffectual in the fight against the virus. A large review published in Nature Machine Intelligence combed through hundreds of deep-learning models trained on chest X-rays, chest computer tomography scans and other medical images to diagnose COVID or predict patient risk. After closely examining 62 of these tools, the authors concluded that because of a high risk for bias or other methodological flaws, not one was fit for clinical use. Another review, published in The BMJ, which examined models using any type of clinical input data (not just medical imaging), reached a similarly dim view of hundreds of newly minted diagnostic and prognostic tools for COVID. 

Deep-learning algorithms hone their proficiency through practicing with an astronomical number of records, which must, in turn, be clean and neatly labeled. Yet with COVID, deep wells of data that met the needed criteria—substantive, freely accessible and well formatted—have proved almost impossible to come by. Privacy rules have thrown up another barrier.   

There’s also the laborious process of preparing the data. “If you’re trying to  identify cancers in the lungs, for example, you need someone to manually annotate those images indicating the pixels that correspond to, say, carcinogenic nodules,” says Frangi. This makes collecting medical images costly and tedious, requiring the participation of radiologists as well as AI experts.

Without extensive reserves of meticulous patient records, the resulting AI model can be unpredictable or incomplete. For example, several research groups used chest scans of children not infected with COVID-19 to teach algorithms what non-COVID cases looked like. But instead of learning to identify who did or did not have COVID, the AI tools learned to identify children.

At the heart of the issue was “not having access to the desired data, or having data that were not suitably formatted or documented,” according to a 2021 report from the Alan Turing Institute, the United Kingdom’s national center for data science and AI.

The creation of a brain image detailed enough to fool experts was far from an academic curiosity.

Yet even AI trained on images that are clean and neatly labeled may still face challenges in the field. In 2020, Google Health researchers tried out an AI diagnostic tool for diabetic retinopathy, a condition that affects people with diabetes and can cause blindness. In the laboratory, the tool worked with a data set of eye photographs and achieved 90% accuracy in diagnosing the condition. But in the real-world setting of 11 clinics in Thailand, the deep-learning system’s shortcomings became apparent. It frequently didn’t know what to make of photos of patients’ eyes snapped in poor lighting conditions and wasn’t able to make a diagnosis.

Synthetic data might help with many of these issues. By generating records that are in line with real human examples, it can produce an ample, well-annotated training database that may avoid privacy concerns. And synthetic data engines can be calibrated to produce a wider array of examples that reflect what AI tools are likely to encounter when meeting real-life data in the wild. 

To produce synthetic data or images, a model known as a generative adversarial network, or GAN, is often used. As a first step, the GAN works with a deep set of real images and learns to produce data that is statistically similar. A neural network called a generator creates outputs—for example, photos of artificial faces—that are as realistic as possible. A second network, called the discriminator, compares those generated images with real examples in the training data and tries to decide whether they are genuine or fake. Based on that feedback, the generator tweaks its parameters for creating new images, continuing until the discriminator can no longer tell the difference between real and artificial.

Synthetic data generated in this way can produce more training images with greater variation, including examples on the margins of clarity that mirror real-world data. While the GAN approach is relatively new, researchers have already used it to create photorealistic synthetic data for skin lesions, pathology slides and chest X-rays. Early in the pandemic, a study carried out by researchers from the Maharaja Agrasen Institute of Technology in New Delhi created synthetic chest X-ray images of COVID-19 patients to supplement a scarce set of real radiographic images. They found that adding the synthetic images increased a diagnostic AI’s ability to detect patients who had COVID. 

Philip Payne, founding director of the Institute for Informatics at Washington University in St. Louis, led a team that created what he believes to be the largest synthetic medical data set ever assembled. The set was based on data in the National COVID Cohort Collaborative (N3C), which pooled patient-level data representing 13 million patients from 72 institutions. The synthetic data mirrors the characteristics of the original subjects, but because of the way the set was created, its records do not correspond exactly to any real patients.


The techniques the researchers used represent one approach to using patient data sets without running into privacy concerns. The usual way to share patient records for research is to “de-identify” the patients, removing characteristics such as names, phone numbers and birth dates that could be used to pin down a person’s identity. The Health Insurance Portability and Accountability Act, or HIPAA, includes a privacy rule that spells out many of these requirements. Some regulations were temporarily lifted during the pandemic, and sources such as the Open COVID-19 Data Curation Group, which collated international patient-level de-identified data, emerged. But health care teams are trained to protect their patients, and invitations to contribute patients’ data to national efforts met resistance, even when oversight was relaxed. “Combining data can create a lot of anxiety for the stewards or owners of that data,” Payne says.

Researchers needed elements of real patient data—neighborhoods in a city where COVID infection was highest, for example, or the dates when an infection had peaked. But ersatz records that mirrored the situation accurately might work just as well as the real thing. To create the N3C data set, Payne’s team partnered with Israeli startup MDClone on a computational approach that takes detailed information—not only geographic information but also medical data including body mass index, kidney function and blood pressure—from the records of real patients. “You create new synthetic patients that are replicas of the source patients,” Payne says. “The individual measurements, while not identical, don’t have statistically significant differences.” 

Payne’s team was excited by the prospect that this huge, national data set would enable the research team to undertake “really big, predictive analytics projects—such as trying to ascertain which COVID patients are at risk of requiring ventilator support and who is going to get very severe disease,” he says. “Researchers would have enough data to really be able to answer those questions.” 

Two studies, in the Journal of the American Informatics Association and the Journal of Medical Internet Research, confirmed that the synthetic data provided an accurate representation of the real patients on which they were based. But Payne and his team also wanted to assess whether the fake data effectively maintained the privacy of patients. So researchers looked for data in the public domain involving the patients from whom the synthetic data was derived. This included demographic data, census data, voting records, information about what foods people bought in grocery stores and financial data. They then tried to combine that information with the fake data to re-identify particular patients. According to Payne, the synthetic data did a better job than normal methods of de-identification to reduce the chance that someone could be re-identified.

The N3C data set of actual patients has already been used by researchers, for instance in probing which factors may predict the development of long COVID, or why some immunocompromised patients experience breakthrough infections after vaccination. The synthetic data based on the NC3 can be used to answer similar questions. 

Payne believes the work with COVID-19 could lead to a much broader use of synthetic data. “We’ve seen how data-sharing is the lifeblood of a rapid, agile public health response,” he says. “Synthetic data is at the forefront of being able to do that while managing privacy and confidentiality.” 

Synthetic data may also revolutionize how new drugs and medical devices are developed, which typically requires testing first in the lab, then on animals and finally on people in multiple phases of clinical trials. The process can take as long as 15 years and may cost hundreds of millions of dollars, says Alex Frangi at the University of Leeds. Much of that expense goes into the logistics of identifying, recruiting, following and supporting trial subjects. “Digital twins”—computationally derived versions of human subjects—are showing promise for populating trials’ control arms and could speed research, control costs and offer more patients experimental medicines.  


Suppose you want to run a clinical trial involving 2,000 people, says Charles Fisher, co-founder and CEO of Unlearn.AI, a trial-design company pioneering the use of digital twins. Ordinarily, that would mean recruiting 1,000 patients for the experimental arm and 1,000 for the control arm. Unlearn uses computational modeling to deploy human trial subjects more efficiently—say, with 1,500 patients receiving the experimental treatment and just 500 getting the control, with synthetic versions of patients used to bolster the control group. “We can’t completely eliminate real human control groups, but we can make them smaller,” says Fisher, a biophysicist and a former data scientist for Pfizer. 

Unlearn uses control arms of previous clinical trials and observational studies to create an AI model that is appropriate to the disease under investigation. It then collects relevant health information for all of the patients in the new trial. For patients with Huntington’s disease, for example, that might include not only general health data but also results from tests such as the Unified Huntington’s Disease Rating Scale–Total Motor Score. 

The model then derives a digital twin by simulating how the real patient, assigned to the trial’s experimental arm, would likely have fared if they had been in the control arm. One trial subject effectively becomes two, with the investigators gathering data on what actually happens when the person receives the experimental treatment, and Unlearn’s model predicting what would have happened if that subject had received the control treatment. The difference is taken as a measure of effectiveness for the new drug or device. “We’re not imagining the medical records of some hypothetical person,” says Fisher. “We’re actually predicting the future medical records of a specific person.” 

To gauge the accuracy of the model, it also projects the synthetic control group’s response and compares that to real-world responses in the human control group. If there’s a discrepancy between the two, the model can be adjusted. Using this method means researchers are able to assign a larger proportion of total subjects to the experimental arm while using fewer real patients as controls. 

One trial subject effectively becomes two.

Major pharmaceutical companies are beginning to show interest in experimenting with virtual patients in trials. Unlearn has signed up for a multi-year collaboration with Merck KGaA, Germany, to accelerate late-stage clinical trials in immunology and perhaps other therapeutic areas.  

Unlearn has received a regulatory qualification from the European Medicines Agency that describes the applicability of its approach for phase 2 and phase 3 clinical trials. The U.S. Food and Drug Administration may also be open to such trials. In 2020, it approved a new indication for a drug for atrial fibrillation patients based on this kind of trial data. 

But many questions about synthetic data remain to be resolved. One problem is that there is not yet any regulatory framework for developing AI models, says Faisal Mahmood, associate professor of pathology at Harvard Medical School. Another issue is to what extent clinicians will actually adopt AI tools trained by synthetic data. Mahmood notes a reluctance on the part of clinicians to use the 200 “software as a medical device” products already approved by the FDA, mostly because those tools tend to require changing well established and commonly used clinical workflows. “It will take a fundamental redesign of clinical workflows in order for AI to have a role,” he says.

In the meantime, those who believe artificial data could have a broad impact are working to refine their models. Working on one disease at a time, Unlearn is looking at how to create digital twins for clinical trials for treating such diseases as Alzheimer’s and rheumatoid arthritis. The ultimate goal is to have one model that works for everyone, Fisher says, “with the ability to simulate any person’s future health outcomes under treatment A, B or C.” 

To accomplish that, and to explore other frontiers for the use of synthetic data, will require overcoming technological challenges. Better machine learning algorithms, faster computers and more and better data will help speed progress. For now, the quality of medical data lags far behind data in other industries. But synthetic data research demonstrates how much can be done even with substandard data-raising hopes for a future in which ever more sophisticated efforts might finally crack medicine’s persistent data problem and help revolutionize medical AI.