
AI-generated Medical Data Can Sidestep Usual Ethics Review, Universities Say (nature.com) 38
An anonymous reader shares a report: Medical researchers at some institutions in Canada, the United States and Italy are using data created by artificial intelligence (AI) from real patient information in their experiments without the need for permission from their institutional ethics boards, Nature has learnt.
To generate what is called 'synthetic data', researchers train generative AI models using real human medical information, then ask the models to create data sets with statistical properties that represent, but do not include, human data.
Typically, when research involves human data, an ethics board must review how studies affect participants' rights, safety, dignity and well-being. However, institutions including the IRCCS Humanitas Research Hospital in Milan, Italy, the Children's Hospital of Eastern Ontario (CHEO) in Ottawa and the Ottawa Hospital, both in Canada, and Washington University School of Medicine (WashU Medicine) in St. Louis, Missouri, have waived these requirements for research involving synthetic data.
The reasons the institutions use to justify this decision differ. However, the potential benefits of using synthetic data include protecting patient privacy, being more easily able to share data between sites and speeding up research, says Khaled El Emam, a medical AI researcher at the CHEO Research Institute and the University of Ottawa.
To generate what is called 'synthetic data', researchers train generative AI models using real human medical information, then ask the models to create data sets with statistical properties that represent, but do not include, human data.
Typically, when research involves human data, an ethics board must review how studies affect participants' rights, safety, dignity and well-being. However, institutions including the IRCCS Humanitas Research Hospital in Milan, Italy, the Children's Hospital of Eastern Ontario (CHEO) in Ottawa and the Ottawa Hospital, both in Canada, and Washington University School of Medicine (WashU Medicine) in St. Louis, Missouri, have waived these requirements for research involving synthetic data.
The reasons the institutions use to justify this decision differ. However, the potential benefits of using synthetic data include protecting patient privacy, being more easily able to share data between sites and speeding up research, says Khaled El Emam, a medical AI researcher at the CHEO Research Institute and the University of Ottawa.
Holy shit, the logic fail here. (Score:5, Insightful)
To generate what is called 'synthetic data', researchers train generative AI models using real human medical information, then ask the models to create data sets with statistical properties that represent, but do not include, human data.
We're literally in a world where AI is allowed such carte blanche that medical records that need ethical review to be included in a study can just be flung to the AI as training data to side-step the ethical review? Are you fucking kidding me?
You'd still be using the actual medical data of real human beings in the up-front for this training data. Or has the universe simply declared AI training on anything is perfectly fine because it's AI, therefore logic and ethics have no need to step into the conversation. What the actual fuck?
Re: (Score:1)
"We're literally in a world where AI is allowed such carte blanche that medical records that need ethical review to be included in a study can just be flung to the AI as training data to side-step the ethical review? Are you fucking kidding me?"
What do you mean "are you fucking kidding me"? No one is telling you that, you've made it up.
The waivers are for synthetic data. Nowhere does it say that the AI training did any "side-step" of "ethical review".
If personal information is NOT included in synthetic dat
Re: (Score:3)
I think you missed the point where the AI is trained on real medical data from real people.
So the AI is fed (and stores) real patient medical data. This is done without any ethical review.
The fact that it regurgitates "synthetic" data is irrelevant.
Re: Holy shit, the logic fail here. (Score:2)
If you feed real patient data into an AI, that would require ethical approval. If you use an ai afterwards if would not
Re: (Score:2)
The AI would have to prove the there is no individual patient identification possible in the output.
That's probably impossible.
Re: (Score:3)
We're literally in a world where AI is allowed such carte blanche that medical records that need ethical review to be included in a study can just be flung to the AI as training data
No. It says synthetic data, which belongs to no person, is exempt from the review. There are obviously no privacy concerns for synthetic data which does not belong to anyone. Research involving developing an AI by processing, training it on a real person's or persons' medical records would obviously require ethical rev
Re: Holy shit, the logic fail here. (Score:2, Troll)
Re: (Score:3, Informative)
How do you make the synthetic data, dipshit? AI copies real data..
Hey, stupid idiot. AI copies nothing. AI is trained on real data, but the output is new data imputed by a generation algorithm: it is the reverse of a pattern match or categorization, and synthetic data is not a collection of new data from a person. I'd say it is also questionable how they can show the data is truly representative.
In any case: the ethics review is only about data being measured and collected from patients as part of a st
Re: (Score:2)
The AI is fed (and stores) real patient medical data. This is done without any ethical review.
The fact that it regurgitates "synthetic" data is irrelevant.
Re: (Score:2)
The AI is fed (and stores) real patient medical data.
That is an irrelevent observation.
New research that only uses existing data is already exempt from review.
You have yet to explain a reasoning why new synthetic data based on previous studies should require a review.
Re: (Score:2)
All new research requires ethics review.
They have to show that the patient data is suffiently de identified.
That is fairly easy for most data sets.
However, since no one can know how the AI derived the synthetic data, it's impossible to prove that it's deidentified.
Re:Holy shit, the logic fail here. (Score:5, Interesting)
There's a much deeper potential Ethical Holly Shit here, this is a big can of worms being opened. That of synthetic (read "made up") with unverifiable claims to credibility, through AI and a real data set.
The reason medical data is needed in research is to _verify_ the effectiveness of treatments, drugs, or patterns that are later ultimately used to justify treatment plans or decisions about what medicines to legalize, prescribe and use.
Feeding real data to AI and then asking it to "make up a data set that's the same" breaks the link between real hard data collected by physicians and clinicians, which is _verifiable_ and _trustable_. There is no way aside from repeating the work with real data sets to verify that the data in the "synthetic data" set is representative of the real world.
If this synthetic data set is used to make decisions, there will be a significant risk that can be introduced in the AI process, as 1) there is no verification possible and 2) the potential to "ask AI" to "help clean up the data" raises potentials for either open abuse, or accidental skewing of the data. Even non malicious but hopeful AI prompts may cause the AI to deliver the results the researcher or pharmaceutical company "wants to find".
That could lead to incorrect results or desired or not which would lead to bad decisions about patient treatment or diagnosis potentially leading to wasted money at best and patient harm at worst.
Ethically, I think this is standing unroped on a slippery slope uphill of a cliff...
Re:Holy shit, the logic fail here. (Score:5, Insightful)
The purported claim is that "because the AI-generated data do not include data from actual humans, they do not need ethics review to use."
But if the data only represent actual patients in a "statistical" sense (whatever that means), how can the research be CERTAIN that it has captured appropriate signals or effects that are observed in such data? And I say this as a statistician who has over a decade of experience in statistical analysis of clinical trials.
There is a fundamental principle at work here, one that researchers cannot take the better part of both ways of the argument: any meaningful inference must be drawn on real world data, and if such data is taken from humans, it must pass an ethics board review. If one argues that AI-generated data doesn't need the latter because it is a fabrication, then it doesn't meet the standard for meaningful inference. If one argues that it does meet the standard, then no matter how the data was transformed from real-world patient sources, it requires ethics board review.
In biostatistics, we use models to analyze data to detect potential effects, draw hypotheses or make predictions, and test those hypotheses to make probabilistic statements--i.e., statistical inferences--about the validity of those hypotheses. This is done within a framework that obeys mathematical truth, so that as long as certain assumptions about the data are met, the results are meaningful. But what "statistically naive" people consistently fail to appreciate, especially in their frenzy to "leverage" AI everywhere, is that those assumptions are PRETTY FUCKING IMPORTANT and using an LLM to generate "new" data from existing, real-world data, is like making repeated photocopies of an original--placing one model on top of another model. LLMs will invent signals where none originally existed. LLMs will fail to capture signals where one actually existed.
Re: (Score:2)
I'm not sure if you really understood the advantage here.
Let's say I want to sample the age of Slashdotters. If I ask you for your age it is very personal data. So selecting some random slashdotters and asking the age is a no-go. Now let's instead just count. We create a histogram with one bin per age and everyone of that age increments the counter a bit. You usually get some kind of gaussian from that. When I now want to sample from the age distribution of Slashdotters, I don't choose one of you and ask fo
Re: (Score:2)
What you describe is essentially a form of bootstrapping, which is a legitimate statistical method. However, there are important limitations that cannot be overlooked.
First, the constructed data are still being created from real data. Ethics is not just about preserving patient privacy, although that is a very important aspect. It's also about taking into consideration how the data will be used. Does the patient consent to this use, and if they are unable to consent, how should this be taken into consid
Re: (Score:2)
The point is, that training a model can capture the essence of the data and allow to draw samples that have the same characteristics as the original data while protecting individuals. If you want to argue that you don't want the data to be used (e.g. because you think the science could create things you don't like) it doesn't matter if we talk about creating models or using real data with the necessary guardrails to protect individuals, you would want to stop the whole thing. If you start with wanting the s
Re: (Score:2)
Re: (Score:2)
I think you're arguing that you want to correlate data inside a record and want to examine outliers (which may be relevant). Then there is just no way around using real data, but you should take care about privacy measures and your ethics committee. Using a generative model will create plausible records of a plausible distribution, but have by definition very little data about outliers.
Re: (Score:2)
We're literally in a world where AI is allowed such carte blanche that medical records that need ethical review to be included in a study can just be flung to the AI as training data to side-step the ethical review? Are you fucking kidding me?
The argument is that revealing a person's health information to another person is a violation of personal medical privacy, but this isn't revealing it to a person.
There are multiple cans of worms opened here (some of them being discussed by the other commenters), but it seems a reasonable argument that it's not a HIPAA violation (...except in that we've sometimes seen that a carefully worded query can cause some LLMs to output their training data )
Re: (Score:2)
There's generally lots (and lots) of oversight for the training part. Many universities and health research centres have even set up review commitees to approve AI projects before they can be submitted to the regular ethics review.
I think the idea here is that once the mdoel is trained you can use it to produce data that's better anonymized. Many different kinds of medical data are re-identifiable in theory. Patterns of vessels in your MRI, a panel of a hundred blood protein levels, DNA, whatever, so that d
Re: (Score:2)
Everything, of course, but in this case AI is not being used for medicine, only to assist in patient privacy. /. is truly filled with idiots. Do we not teach reading any more?
Burdensome ethics rules slow research, so... (Score:2)
...researchers find a workaround
Unfortunately, this AI fabricated "statistical" data may be useless or worse
Medical research is really hard with real data, some of which is bad or skewed
Using AI invented, "kinda close" data makes it worse
Human Data (Score:5, Insightful)
Typically, when research involves human data, an ethics board must review how studies affect participants' rights, safety, dignity and well-being.
It's the initial physical collection of the data that puts people at risk. And once an ethics board has reviewed these procedures for safety and anonymization, there isn't really an issue for the post collection information. There's not much one can do with my cholesterol level as one of many points on a graph thst can come back and hurt me.
What we really need to worry about is the AI bullshit factor. The NIH or FDA can examine a spreadsheet for formula errors, but what goes through ChatGPT is better chewed than any cows cud.
Re: (Score:2)
Why do you think should ChatGPT be used? Is that the only form of AI you know?
Irresponsible (Score:2)
Re: (Score:2)
Yes.. Honestly there should be an "ethics" review of every piece of research involving data prior to publication. The inquiry should be how is the data protected procedurally to ensure high-quality accurate representative data and sound reasoning both in terms of logic, mathematically, and statistically with disclosures of uncertainties and potentials for error.
fake data (Score:2)
Sure, fake data is just what we need. Yup. Yup.
I'd say no (Score:2)
Considering there's no human in the loop, I'd argue these need INTENSIVE ethics reviews, not "none".
That's fucked up.
Sample Size (Score:2)
The AI shell game (Score:2)
Sidestep copyright? Check.
Sidestep all other IP law? Check.
Put content designers/lawyers/coders/professionals out of work because screw human dignity for more pointless concentration of capital? Check.
And now...
Sidestep HIPAA and other medical ethics laws? Check.
They cry, "It's AI. It's different. Believe us, we're doing something NEW and UNIQUE."
Bollocks. Training data is simply created data, with the same protections it had when it was created. Creation is new. Actual derivative works are new. Generating
putting the ethics to the side (Score:2)
Putting the ethics question to the side, how is it possible to have a high degree of confidence that the synthetic data is statistically similar enough to the original training data, specifically in the way(s) that may matter for it's intended use? The only way I can think of is comparing it to the original data. In which case, the ethical question can no longer be ignored.
It reminds me of a trick on one of the old West Wing episodes, when the press secretary is asked if the President has considered calling
Re: (Score:2)
Great, but pointless. What does the IRB say? (Score:2)
Also, not whoring for +5 Informative here, but what does the IRB have to say about this [wikipedia.org].
Because unless American universities are telling researchers they can skip IRB, it doesn't matter what they say. And they are opening themselves up to an immediate and punishing lawsuit if they are allowing human research to skip IRB. Like it's probably on the docket right now.
Of course, such research can crop up in other, less ethical, countries. Oh, who am I kidding? US ethics are shot to hell right now as well.