Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material (404media.co) 70
samleecole writes: The LAION-5B machine learning dataset used by Google, Stable Diffusion, and other major AI products has been removed by the organization that created it after a Stanford study found that it contained 3,226 suspected instances of child sexual abuse material, 1,008 of which were externally validated.
LAION told 404 Media on Tuesday that out of "an abundance of caution," it was taking down its datasets temporarily "to ensure they are safe before republishing them." According to a new study by the Stanford Internet Observatory shared with 404 Media ahead of publication, the researchers found the suspected instances of CSAM through a combination of perceptual and cryptographic hash-based detection and analysis of the images themselves.
LAION told 404 Media on Tuesday that out of "an abundance of caution," it was taking down its datasets temporarily "to ensure they are safe before republishing them." According to a new study by the Stanford Internet Observatory shared with 404 Media ahead of publication, the researchers found the suspected instances of CSAM through a combination of perceptual and cryptographic hash-based detection and analysis of the images themselves.
State of the art (Score:3)
Generative AI model can't distinguish abusive vs. non-abusive content. If you're training a model, I think one of the first things would be to distinguish content to filter out these things.
Re:State of the art (Score:5, Insightful)
When humans themselves can have trouble distinguishing between a picture a parent takes of their toddler playing in the bathtub and actual pornographic content, how do you expect AI to do it?
Re:State of the art (Score:4, Insightful)
is the picture of a toddler playing the bathtub pornographic parent: other:
how will any one be trained for this including humans. could the parent not also be the abuser? if so should all children pictures in bathtub be considered pornographic.
i think shariah law says it should.
Shariah law isn't practiced in the western world. Yet. We're working our way up to it slowly.
Re: (Score:2)
Because I don't expect AI to be making judgement calls like that based on the context of the photo. We know AI isn't to that level yet. But you would think it could find child and nudity together and filter for that and flag it. Then a human could look further. The fact we are training AI on photos that include child porn means the AI will normalize child porn and that is a major issue.
Re: (Score:3)
Then a human could look further.
That would work if they were using a few dozen or a few hundred pictures. Even a few thousand. But they're not, they're using billions of pictures.
There's no viable answer here that doesn't involve tossing a huge percentage of their training images.
Re: (Score:2)
Re: (Score:2)
If they can't check their sources, then they should not use the data.
Congratulations, you've just invalidated the entire science of statistics.
Your mother must be proud.
Re: (Score:2)
Re: (Score:2)
Re: (Score:1)
There's no viable answer here that doesn't involve tossing a huge percentage of their training images.
Yea, but that's not a good argument against doing so. Everyone's been training their AIs on massive piles of unverified data, which is a pathologically bad idea, and this is just one tiny example of the many reasons why.
Re: (Score:2)
It's not an argument against anything, it's reality. Human checking each image is impossible, no matter what you or anyone else thinks about it. Argue against the entire idea, sure, and many people do, but advocating that they have to do something that literally cannot be done just makes you look stupid or crazy.
Re: (Score:1)
Come on, it's just a measly few billion web-scraped images and we have an employment crisis globally which people are worried AI will make worse. How is it not obvious to you that these two problems are also each other's solutions?
Re: (Score:2)
Everyone's been training their AIs on massive piles of unverified data, which is a pathologically bad idea
Thank you! I totally agree with you. Training these systems with piles of unverified data is akin to throwing shit at the wall and selling whatever sticks.
We need to have a way to curate the data at scale. We don't have an answer to that problem. I don't think we even know how to have a technical conversation on how to do it. But this is a start (culling a large % of images.)
Not great. Perhaps not even good. But it has to start from somewhere and see how we can make it better.
Re: (Score:2)
Training these systems with piles of unverified data is akin to throwing shit at the wall and selling whatever sticks.
I've met people that were deeply misguided yet I didn't come out of those conversations thinking that governments aren't necessary, that plastic is dangerous, that redheads don't have souls, or that people as in general do have souls.
I can tolerate a significant amount of bad data. Why should we build LLMs that require perfect input data? (Why should we improve the data rather than the LLM?)
Re: (Score:2)
I've met people that were deeply misguided yet I didn't come out of those conversations thinking that governments aren't necessary, that plastic is dangerous, that redheads don't have souls, or that people as in general do have souls.
Good for you. I have no clue how this is relevant to my post.
I can tolerate a significant amount of bad data.
Any system with enough complexity (a human included) is required to tolerate a % of bad data. Depending on the task at hand, the margin of error will vary, however.
Tell me your requirements, and I might tell me the margin of error needed to efficiently meet those requirements.
Why should we build LLMs that require perfect input data
I never made the claim that we require perfect input data. Only that we need to have a way to curate it.
These are two distinct propositions (from both technical and epis
Re: (Score:2)
Okay, fair enough. But the pretty good data with errors I'm talking about is an analogy to LAION-5B. I thought you were using hyperbole and I meant to be discussing the same proposition--imperfect data.
Re: (Score:1)
That's the hubris of someone who has already been fed a relatively carefully curated set of experiences for at least nearly 2 decades but has forgotten that is the case and taken for granted the benefits of said experiences. I guarantee you that if you had seen those images while your brain was still developing it would have fucked you up too.
Re: (Score:2)
There's no viable answer here that doesn't involve tossing a huge percentage of their training images.
Until better solutions exist, this is an acceptable trade-off for most systems. Either you throw a large percentage of the training images (which specifically contain a child and potential nudity), or, as in this case, you end up throwing the entire data set.
We just don't have a good answer, yet. But that doesn't (nor should) preclude us from having to take a hit (in loss of training data) in order to save as much as possible.
And if excluding these images is so unacceptable, we must ask ourselves what t
Re: (Score:2)
Re: (Score:2)
When humans themselves can have trouble distinguishing between a picture a parent takes of their toddler playing in the bathtub and actual pornographic content, how do you expect AI to do it?
Don't let the human race be defined by a bunch of moronic Karens.
Re: (Score:2)
When humans themselves can have trouble distinguishing between a picture a parent takes of their toddler playing in the bathtub and actual pornographic content, how do you expect AI to do it?
Most humans can do this, and it is certain that the images encountered in the data set are not of babies in bathtubs. These were images of sexual abuse. I am not sure what the confusion is here.
Now, there is a way to train an AI... but training it with actual images of sexual abuse. Horrifically, law enforcement agencies have data sets of these images (which is how they track perpetrators.)
So, in theory, an AI system could be trained to recognize them. There are legal implications here, however, because
Re: (Score:2)
Under UK law, child sexual abuse images are anything that a paedophile finds stimulating. That can include things like clothing catalogues and clothed images stolen from social media, if they are stored for the purpose of arousing the accused.
For an AI, presumably the owners don't want it to generate such material on demand. I don't know if it's better to remove it from the training material, or try to adjust the prompts to prevent it being generated.
Documenting war crimes vs. plain old child porn (Score:2)
When it comes to classifying images as "legal abuse," think about "napalm girl" from Vietnam and other wartime journalism:
If someone who was of an evil mind took a 9 year old girl, burned her with napalm, took a photo that was meant to "look like a Vietnam War photo," and uploaded it to some corner of the dark web or for that matter the "open web," I wouldn't be surprised if that photo was deemed illegal. In fact, I would be shocked if it wasn't. But "The Terror of War," a Pulitzer-prize-winning photo docum
Re: (Score:2)
This is why you want curated image sets to train AI.
You don't need to ask permission, if the image was publicly available with no cost in the first place. You do need to ask permission for specialized data sets (eg artists, news organizations, and stock artwork sites.)
Scraping facebook and instagram, is just going to put you in a heap of trouble. Especially from child-protection groups because even if the AI "does know" because an image was properly tagged, eg "child in bathtub", that may result in the AI l
Re: (Score:2)
They have filters. But it's a dataset of five billion images. It doesn't take a very high false negative rate for 1k instances to slip through. And it's impossible to have humans manually validate five billion images (and even *they* would have a false native rate as well).
LAION (Score:3, Informative)
Re: (Score:3)
if you are monetizing it, you should definitely be held liable.
if your process is so broke that you cannot verify what you are selling, you shouldn't be selling it. but then its tech bros and move fast break things is the new design paradigm.
Re: (Score:2)
Definitely. If there is verified material in there, they should go to prison for commercial distribution.
Re: (Score:2)
I think you'd be at home in the Chinese Central Committee.
Here, in civilized legal systems, we have the legal concept of mens rea.
Re: (Score:2)
Bravo. But for how much longer will this be true?
I continued to be amazed at how the tide has turned in America. We used to be against all this totalitarian police state absolutist shit.
Re: (Score:2)
Holding companies accountable for the products they sell is totalitarian?
Re:LAION (Score:4, Insightful)
Every ISP and search engine makes money from child pornography. Should all those companies be shut down and their owners & employees jailed?
In this case, they identified a problem with about 0.00006% of their data set and they are taking action to correct it. But you want blood?
Karma is a bitch. I hope you get held to the same standard of perfection someday so you can learn something about life.
Re: (Score:2)
If a thief hides a body in your car, and the police prove you didn't do it, should you still go to fucking jail for tampering with a body because you drove to work?
Think before you fucking post, man.
Re: (Score:2)
Fuck the Fire Department, by Vincent E. L. (with lyrics and funk):
https://www.youtube.com/watch?... [youtube.com]
Re:LAION (Score:5, Insightful)
People lose track of the fact that the problem is the abuse itself.
Look, that man is robbing an old lady! Get out your phone, take a picture for police! vs
Look, CS abuse is happening! Cover your eyes and run lest CSAM get encoded in the neural structure of your memory!
If you take it too far at a certain point focusing on stopping the proliferation of what is in fact the evidence of a crime becomes a cover up for the original real crime, which is the abuse.
Re: (Score:2)
For CP possession and distribution? I doubt it. In most jurisdictions they do not have to prove intent for that. Yes, that is fundamentally broken.
Re:LAION (Score:4, Informative)
In most jurisdictions they do not have to prove intent for that.
You absolutely must. There is no CP indictment that didn't start have the words, "knowingly possessed or distributed..."
Re: (Score:2)
Although I might make the case that if you're selling on a library of pictures, the company as a whole ought to "know" what the pictures they are selling *are*. For all sorts of legal reasons including copyright, CSAM, and just basic quality control/fitness for purpose.
Just because you sold me a couple million bolts doesn't mean you can disclaim them being a specific grade of bolt suitable for a specific purpose. And companies are held liable for messing that up all the time.
Re: (Score:2)
Although I might make the case that if you're selling on a library of pictures, the company as a whole ought to "know" what the pictures they are selling *are*.
You could try to make that legal argument, but your chance of succeeding is very low. That's like trying to sue Google because they have revenge porn for 24 hours before someone catches it. Your criminal liability begins the second you know about it, not a second before.
Just because you sold me a couple million bolts doesn't mean you can disclaim them being a specific grade of bolt suitable for a specific purpose. And companies are held liable for messing that up all the time.
Not criminally, that's the distinction.
You don't need mens rea to be liability for harming someone.
Re: (Score:2)
Re: LAION (Score:2)
Re:LAION (Score:4, Insightful)
They are not selling it. It's also not a dataset of images, it's a dataset of links to images. Precisely because they wanted to limit liability in cases like this (or copyright).
Re: LAION (Score:2)
Issue though is that a lot of image gen companies used this dataset to train their own products.
So if they go down, they are going to drag LAION down and vice versa.
Re: (Score:2)
LAION is a non-profit founded by a bunch of academics to make open large datasets available.
Re: (Score:2)
"Ooops sorry, my image crawling algorithm that was pointing to www.underage-lolitas.com collected some child porn... bad, BAD automated algorithm!!! Btw, here's our image generator that uses those images"
Re: (Score:1)
But on the plus side, now we know why the only thing you can get all these image generators to generate reliably is porn.
Re: (Score:3)
Your argument essentially boils down to saying that all the people with the horrible job of actually looking at and verifying these kinds of pictures must never create anything, because their creative process is influenced by their experiences and the things they have seen.
Re: (Score:2)
only if "all people" also take text prompts to "create" illegal porn.
Re: (Score:2)
Your statement is obvious nonsense. Stop anthropomorphizing software.
Re: (Score:2)
It is the exact same argument being made for why AI should be allowed to create new books after 'reading' a treasure trove of existing books, because that is how humans do it.
Re: (Score:2)
Humans write books based on what they think other humans will find interesting.
Re: So every Stable Diffusion Model is illegal? (Score:2)
This argument has been beaten harder then a dead horse. There is no conclusive evidence current AI models think and create exactly like humans do.
And even if it did we would still be enforcing laws and enacting punishments on the offending companies that allowed this to happen. If a company or school had their artists study CSAM for creative output they would also be in major hot water.
LOL! (Score:3)
What as that company motto? Do no evil? I see their new direction is... exciting.
Re:LOL! (Score:4, Funny)
This is why ... (Score:2)
It's also interesting to note that the individual site moderation depended upon to filter out such undesirable content appears to be failing miserably. CP makes it through. But Trump opinions, "O
Re: (Score:2)
which "trump opinions" are "blocked"
Those that we will never see after his being banned from Twitter.
Re:This is why ... (Score:4, Informative)
I'm not sure I would call it failing miserably. They detected around 3k potentially problematic images, of which 1k were verified to really BE a problem, out of a dataset of nearly 6 billion images. That's a very, very, very tiny fraction. It shouldn't be there at all, true, but failing miserably should be a higher threshold.
Re: (Score:3)
>the kiddie porn could have been labeled as such, excluded from training and perhaps even used to raise an alarm.
I'm not even sure about that. Kiddie porn is such a hot button that if I ever had any land on my server for any reason other than "I saw that guy over there download it officer, here are the log files"... I'd destroy it.
"Tag it and call the cops" if there's anything other than rock-solid documentation showing exactly who (who isn't you) is guilty just seems like too big a risk. Possession is
Re: (Score:1)
It's also interesting to note that the individual site moderation depended upon to filter out such undesirable content appears to be failing miserably. CP makes it through. But Trump opinions, "OMG! Block them!" It says something about the values of the moderators*.
*Awaiting my -1 Troll score.
Well when you start paraphrasing Hitler, I can see why people would want the "opinions" removed from their site. Let's take a look at the latest from the great Orange one.
“They’re destroying the blood of our country. That’s what they’re doing. They’re destroying our country. They don’t like it when I said that — and I never read Mein Kampf,” said Trump, referencing Hitler’s manifesto. “They could be healthy, they could be very unhealthy, they could
Re: (Score:1)
That's not the only problem (Score:2)
Re: (Score:2)