Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
AI Technology

Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material (404media.co) 70

samleecole writes: The LAION-5B machine learning dataset used by Google, Stable Diffusion, and other major AI products has been removed by the organization that created it after a Stanford study found that it contained 3,226 suspected instances of child sexual abuse material, 1,008 of which were externally validated.

LAION told 404 Media on Tuesday that out of "an abundance of caution," it was taking down its datasets temporarily "to ensure they are safe before republishing them." According to a new study by the Stanford Internet Observatory shared with 404 Media ahead of publication, the researchers found the suspected instances of CSAM through a combination of perceptual and cryptographic hash-based detection and analysis of the images themselves.

This discussion has been archived. No new comments can be posted.

Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material

Comments Filter:
  • by Virtucon ( 127420 ) on Wednesday December 20, 2023 @11:56AM (#64093617)

    Generative AI model can't distinguish abusive vs. non-abusive content. If you're training a model, I think one of the first things would be to distinguish content to filter out these things.

    • by Calydor ( 739835 ) on Wednesday December 20, 2023 @12:10PM (#64093675)

      When humans themselves can have trouble distinguishing between a picture a parent takes of their toddler playing in the bathtub and actual pornographic content, how do you expect AI to do it?

      • by dirk ( 87083 )

        Because I don't expect AI to be making judgement calls like that based on the context of the photo. We know AI isn't to that level yet. But you would think it could find child and nudity together and filter for that and flag it. Then a human could look further. The fact we are training AI on photos that include child porn means the AI will normalize child porn and that is a major issue.

        • by taustin ( 171655 )

          Then a human could look further.

          That would work if they were using a few dozen or a few hundred pictures. Even a few thousand. But they're not, they're using billions of pictures.

          There's no viable answer here that doesn't involve tossing a huge percentage of their training images.

          • If they can't check their sources, then they should not use the data. Automation difficulty is not an excuse. They make something to SELL it, why would we care how hard it is to make it legally?
            • by taustin ( 171655 )

              If they can't check their sources, then they should not use the data.

              Congratulations, you've just invalidated the entire science of statistics.

              Your mother must be proud.

              • Way to minterpret my comment, congratulations. Have you seen many statistics papers where data sources are not included? Have you seen many papers where authors present statistics on data on topics like rape, murder, disabilities without having to go through piles of ethics forms and LOTS of questions about the use of data? You think AI should be immune to all that?
          • There's no viable answer here that doesn't involve tossing a huge percentage of their training images.

            Yea, but that's not a good argument against doing so. Everyone's been training their AIs on massive piles of unverified data, which is a pathologically bad idea, and this is just one tiny example of the many reasons why.

            • by taustin ( 171655 )

              It's not an argument against anything, it's reality. Human checking each image is impossible, no matter what you or anyone else thinks about it. Argue against the entire idea, sure, and many people do, but advocating that they have to do something that literally cannot be done just makes you look stupid or crazy.

              • Come on, it's just a measly few billion web-scraped images and we have an employment crisis globally which people are worried AI will make worse. How is it not obvious to you that these two problems are also each other's solutions?

            • Everyone's been training their AIs on massive piles of unverified data, which is a pathologically bad idea

              Thank you! I totally agree with you. Training these systems with piles of unverified data is akin to throwing shit at the wall and selling whatever sticks.

              We need to have a way to curate the data at scale. We don't have an answer to that problem. I don't think we even know how to have a technical conversation on how to do it. But this is a start (culling a large % of images.)

              Not great. Perhaps not even good. But it has to start from somewhere and see how we can make it better.

              • by piojo ( 995934 )

                Training these systems with piles of unverified data is akin to throwing shit at the wall and selling whatever sticks.

                I've met people that were deeply misguided yet I didn't come out of those conversations thinking that governments aren't necessary, that plastic is dangerous, that redheads don't have souls, or that people as in general do have souls.

                I can tolerate a significant amount of bad data. Why should we build LLMs that require perfect input data? (Why should we improve the data rather than the LLM?)

                • I've met people that were deeply misguided yet I didn't come out of those conversations thinking that governments aren't necessary, that plastic is dangerous, that redheads don't have souls, or that people as in general do have souls.

                  Good for you. I have no clue how this is relevant to my post.

                  I can tolerate a significant amount of bad data.

                  Any system with enough complexity (a human included) is required to tolerate a % of bad data. Depending on the task at hand, the margin of error will vary, however.

                  Tell me your requirements, and I might tell me the margin of error needed to efficiently meet those requirements.

                  Why should we build LLMs that require perfect input data

                  I never made the claim that we require perfect input data. Only that we need to have a way to curate it.

                  These are two distinct propositions (from both technical and epis

                  • by piojo ( 995934 )

                    Okay, fair enough. But the pretty good data with errors I'm talking about is an analogy to LAION-5B. I thought you were using hyperbole and I meant to be discussing the same proposition--imperfect data.

                • That's the hubris of someone who has already been fed a relatively carefully curated set of experiences for at least nearly 2 decades but has forgotten that is the case and taken for granted the benefits of said experiences. I guarantee you that if you had seen those images while your brain was still developing it would have fucked you up too.

          • There's no viable answer here that doesn't involve tossing a huge percentage of their training images.

            Until better solutions exist, this is an acceptable trade-off for most systems. Either you throw a large percentage of the training images (which specifically contain a child and potential nudity), or, as in this case, you end up throwing the entire data set.

            We just don't have a good answer, yet. But that doesn't (nor should) preclude us from having to take a hit (in loss of training data) in order to save as much as possible.

            And if excluding these images is so unacceptable, we must ask ourselves what t

      • One can check the sources, and if they are questionable, refrain from publishing. Not that complicated. But costs time and human effort, and the latter is increasingly unpopular for these mega-scale projects.
      • When humans themselves can have trouble distinguishing between a picture a parent takes of their toddler playing in the bathtub and actual pornographic content, how do you expect AI to do it?

        Don't let the human race be defined by a bunch of moronic Karens.

      • When humans themselves can have trouble distinguishing between a picture a parent takes of their toddler playing in the bathtub and actual pornographic content, how do you expect AI to do it?

        Most humans can do this, and it is certain that the images encountered in the data set are not of babies in bathtubs. These were images of sexual abuse. I am not sure what the confusion is here.

        Now, there is a way to train an AI... but training it with actual images of sexual abuse. Horrifically, law enforcement agencies have data sets of these images (which is how they track perpetrators.)

        So, in theory, an AI system could be trained to recognize them. There are legal implications here, however, because

        • by AmiMoJo ( 196126 )

          Under UK law, child sexual abuse images are anything that a paedophile finds stimulating. That can include things like clothing catalogues and clothed images stolen from social media, if they are stored for the purpose of arousing the accused.

          For an AI, presumably the owners don't want it to generate such material on demand. I don't know if it's better to remove it from the training material, or try to adjust the prompts to prevent it being generated.

    • When it comes to classifying images as "legal abuse," think about "napalm girl" from Vietnam and other wartime journalism:

      If someone who was of an evil mind took a 9 year old girl, burned her with napalm, took a photo that was meant to "look like a Vietnam War photo," and uploaded it to some corner of the dark web or for that matter the "open web," I wouldn't be surprised if that photo was deemed illegal. In fact, I would be shocked if it wasn't. But "The Terror of War," a Pulitzer-prize-winning photo docum

      • by Kisai ( 213879 )

        This is why you want curated image sets to train AI.

        You don't need to ask permission, if the image was publicly available with no cost in the first place. You do need to ask permission for specialized data sets (eg artists, news organizations, and stock artwork sites.)

        Scraping facebook and instagram, is just going to put you in a heap of trouble. Especially from child-protection groups because even if the AI "does know" because an image was properly tagged, eg "child in bathtub", that may result in the AI l

    • by Rei ( 128717 )

      They have filters. But it's a dataset of five billion images. It doesn't take a very high false negative rate for 1k instances to slip through. And it's impossible to have humans manually validate five billion images (and even *they* would have a false native rate as well).

  • LAION (Score:3, Informative)

    by phantomfive ( 622387 ) on Wednesday December 20, 2023 @11:58AM (#64093627) Journal
    LAION-5b dataset has 5.8 billion images, and it was automatically collected. It's not a hand-curated collection. (Note that this also raises questions of fair use, but that's a different topic).
    • by zlives ( 2009072 )

      if you are monetizing it, you should definitely be held liable.
      if your process is so broke that you cannot verify what you are selling, you shouldn't be selling it. but then its tech bros and move fast break things is the new design paradigm.

      • by gweihir ( 88907 )

        Definitely. If there is verified material in there, they should go to prison for commercial distribution.

        • Are you serious? Fucking terrifying.
          I think you'd be at home in the Chinese Central Committee.

          Here, in civilized legal systems, we have the legal concept of mens rea.
          • Here, in civilized legal systems, we have the legal concept of mens rea.

            Bravo. But for how much longer will this be true?

            I continued to be amazed at how the tide has turned in America. We used to be against all this totalitarian police state absolutist shit.
            • Holding companies accountable for the products they sell is totalitarian?

              • Re:LAION (Score:4, Insightful)

                by ihadafivedigituid ( 8391795 ) on Wednesday December 20, 2023 @01:23PM (#64093947)
                Holding companies accountable for the products they sell is totalitarian?

                Every ISP and search engine makes money from child pornography. Should all those companies be shut down and their owners & employees jailed?

                In this case, they identified a problem with about 0.00006% of their data set and they are taking action to correct it. But you want blood?

                Karma is a bitch. I hope you get held to the same standard of perfection someday so you can learn something about life.
              • Don't be a fucking moron.

                If a thief hides a body in your car, and the police prove you didn't do it, should you still go to fucking jail for tampering with a body because you drove to work?

                Think before you fucking post, man.
              • Re:LAION (Score:5, Insightful)

                by javaman235 ( 461502 ) on Wednesday December 20, 2023 @01:57PM (#64094077)

                People lose track of the fact that the problem is the abuse itself.
                Look, that man is robbing an old lady! Get out your phone, take a picture for police! vs
                Look, CS abuse is happening! Cover your eyes and run lest CSAM get encoded in the neural structure of your memory!
                If you take it too far at a certain point focusing on stopping the proliferation of what is in fact the evidence of a crime becomes a cover up for the original real crime, which is the abuse.

          • by gweihir ( 88907 )

            For CP possession and distribution? I doubt it. In most jurisdictions they do not have to prove intent for that. Yes, that is fundamentally broken.

            • Re:LAION (Score:4, Informative)

              by DamnOregonian ( 963763 ) on Wednesday December 20, 2023 @01:47PM (#64094013)
              You doubt wrong.

              In most jurisdictions they do not have to prove intent for that.

              You absolutely must. There is no CP indictment that didn't start have the words, "knowingly possessed or distributed..."

              • by jp10558 ( 748604 )

                Although I might make the case that if you're selling on a library of pictures, the company as a whole ought to "know" what the pictures they are selling *are*. For all sorts of legal reasons including copyright, CSAM, and just basic quality control/fitness for purpose.

                Just because you sold me a couple million bolts doesn't mean you can disclaim them being a specific grade of bolt suitable for a specific purpose. And companies are held liable for messing that up all the time.

                • Although I might make the case that if you're selling on a library of pictures, the company as a whole ought to "know" what the pictures they are selling *are*.

                  You could try to make that legal argument, but your chance of succeeding is very low. That's like trying to sue Google because they have revenge porn for 24 hours before someone catches it. Your criminal liability begins the second you know about it, not a second before.

                  Just because you sold me a couple million bolts doesn't mean you can disclaim them being a specific grade of bolt suitable for a specific purpose. And companies are held liable for messing that up all the time.

                  Not criminally, that's the distinction.

                  You don't need mens rea to be liability for harming someone.

      • The really fascinating thing is that a dataset can be useful while not having the images labeled. You'd kind of expect that would result in overtraining, or outright wrongness.
        • They are labeled using the alt attribute as a caption. Apparently they also use an image classifier to make sure image and caption match, which adds all kinds of interesting questions about the validity of the labelling, but it is labeled.
      • Re:LAION (Score:4, Insightful)

        by Vintermann ( 400722 ) on Wednesday December 20, 2023 @12:53PM (#64093835) Homepage

        They are not selling it. It's also not a dataset of images, it's a dataset of links to images. Precisely because they wanted to limit liability in cases like this (or copyright).

        • Issue though is that a lot of image gen companies used this dataset to train their own products.
          So if they go down, they are going to drag LAION down and vice versa.

      • by ceoyoyo ( 59147 )

        LAION is a non-profit founded by a bunch of academics to make open large datasets available.

    • If I automate firing of a machine gun, am I not liable for the damage? Is "automation" here an excuse for lack of responsibility?
      "Ooops sorry, my image crawling algorithm that was pointing to www.underage-lolitas.com collected some child porn... bad, BAD automated algorithm!!! Btw, here's our image generator that uses those images"
    • But on the plus side, now we know why the only thing you can get all these image generators to generate reliably is porn.

  • by TheMiddleRoad ( 1153113 ) on Wednesday December 20, 2023 @12:11PM (#64093681)

    What as that company motto? Do no evil? I see their new direction is... exciting.

  • ... the unsupervised training models of most of today's systems are doomed to fail. Just because it's "on the Internet" doesn't mean that it's correct or legal. Had this been a manually tagged data set, the kiddie porn could have been labeled as such, excluded from training and perhaps even used to raise an alarm.

    It's also interesting to note that the individual site moderation depended upon to filter out such undesirable content appears to be failing miserably. CP makes it through. But Trump opinions, "O

    • Re:This is why ... (Score:4, Informative)

      by Calydor ( 739835 ) on Wednesday December 20, 2023 @12:32PM (#64093757)

      I'm not sure I would call it failing miserably. They detected around 3k potentially problematic images, of which 1k were verified to really BE a problem, out of a dataset of nearly 6 billion images. That's a very, very, very tiny fraction. It shouldn't be there at all, true, but failing miserably should be a higher threshold.

    • >the kiddie porn could have been labeled as such, excluded from training and perhaps even used to raise an alarm.

      I'm not even sure about that. Kiddie porn is such a hot button that if I ever had any land on my server for any reason other than "I saw that guy over there download it officer, here are the log files"... I'd destroy it.

      "Tag it and call the cops" if there's anything other than rock-solid documentation showing exactly who (who isn't you) is guilty just seems like too big a risk. Possession is

    • It's also interesting to note that the individual site moderation depended upon to filter out such undesirable content appears to be failing miserably. CP makes it through. But Trump opinions, "OMG! Block them!" It says something about the values of the moderators*.

      *Awaiting my -1 Troll score.

      Well when you start paraphrasing Hitler, I can see why people would want the "opinions" removed from their site. Let's take a look at the latest from the great Orange one.

      “They’re destroying the blood of our country. That’s what they’re doing. They’re destroying our country. They don’t like it when I said that — and I never read Mein Kampf,” said Trump, referencing Hitler’s manifesto. “They could be healthy, they could be very unhealthy, they could

  • I got to thinking - everyone's tripping over that category but what else could be in it? By a stretch, diagrams or accurate depictions of bomb-building or a meth lab or something? Maybe realistic depictions of a person or fictional character whose likeness actually is trademarked/copyrighted or whatever? What else could be in there?
    • I'd love to see these "uncopyrightable" AI generations create things too close to existing highly protected IP, sounds like it's going to be the nuclear fission equivalent in the copyright world.

If you aren't rich you should always look useful. -- Louis-Ferdinand Celine

Working...