Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
AI Technology

MIT Apologizes, Permanently Pulls Offline Huge Dataset That Taught AI Systems To Use Racist, Misogynistic Slurs (theregister.com) 128

MIT has taken offline its highly cited dataset that trained AI systems to potentially describe people using racist, misogynistic, and other problematic terms. From a report: The database was removed this week after The Register alerted the American super-college. And MIT urged researchers and developers to stop using the training library, and to delete any copies. "We sincerely apologize," a professor told us. The training set, built by the university, has been used to teach machine-learning models to automatically identify and list the people and objects depicted in still images. For example, if you show one of these systems a photo of a park, it might tell you about the children, adults, pets, picnic spreads, grass, and trees present in the snap. Thanks to MIT's cavalier approach when assembling its training set, though, these systems may also label women as whores or bitches, and Black and Asian people with derogatory language. The database also contained close-up pictures of female genitalia labeled with the C-word.

Applications, websites, and other products relying on neural networks trained using MIT's dataset may therefore end up using these terms when analyzing photographs and camera footage. The problematic training library in question is 80 Million Tiny Images, which was created in 2008 to help produce advanced object detection techniques. It is, essentially, a huge collection of photos with labels describing what's in the pics, all of which can be fed into neural networks to teach them to associate patterns in photos with the descriptive labels. So when a trained neural network is shown a bike, it can accurately predict a bike is present in the snap. It's called Tiny Images because the pictures in library are small enough for computer-vision algorithms in the late-2000s and early-2010s to digest.

This discussion has been archived. No new comments can be posted.

MIT Apologizes, Permanently Pulls Offline Huge Dataset That Taught AI Systems To Use Racist, Misogynistic Slurs

Comments Filter:
  • by Anonymous Coward

    Hope so. Let's keep the internet indelible, if for no other reason than to piss off the easily offended. We can't let these meddlesome fools take over the world

    • by goombah99 ( 560566 ) on Wednesday July 01, 2020 @01:41PM (#60251240)

      Any AI trained on a sanitized database will never be able to pass the Turing test. Nor will it be very good for processing a lot of corpuses.

      • I like this variation of "you can't tell jokes any more", despite scores of successful comedians demonstrating otherwise.
      • by vlad30 ( 44644 )
        Sarcasm

        Movie people are liberal minded and know what to think. Just use Movies as a training set just as many humans do

        Lets see Historically accurate movies are now defined as Racist as are movies made even a few years back

        That leave you with the current crop of SJW inspired versions

        The AI now believes the world is full of LGBT and all the other letters. For couples everyone is married interracial. but we are all Jewish Italians are funny and have large families and the Chinese are all good guys and wh

  • Now, we are teaching AIs how to be racist/misogynist.

    This was one that MIT really fucked up.
    • by Brain-Fu ( 1274756 ) on Wednesday July 01, 2020 @01:20PM (#60251144) Homepage Journal

      This was clearly a problem of quality control. Some pranksters injected these images into the database, against guidelines. It is hard to filter for this sort of thing, since every image must be examined by a human.

      Now, if only they had an AI system that was trained to detect and filter out submissions like that.....

    • Re: (Score:3, Insightful)

      We're not teaching AI to be anything.

      The problem with AI, is that nobody is teaching it anything. Meaning there is no parent to shape the growth of learning. And unfiltered learning is exactly what you see here, where the connotations are simply excluded from the data set.

      Female Genitalia labeled with "C" word is technically correct but misses the connotation that is learned socially. Without "meaning" it would be easy to see how AI would refer to women with that word. Technically correct, but completely vo

      • Then it's not AI.

        O right... I keep forgetting I am on slashdot where

        if( a == b) { "We Just Invented AI!" }
        else (a != b) { "We Just Invented AI!" }

        The things we call AI is a joke!

          • Yes, I am more and well aware of all the efforts of the world to keep calling things they are not... even if you have to go down the route of altering the definitions of things to fit your narrative.

            If you have to remove a "data set" to resolve a problem of bias then you just do not have any intelligence in the program. Just like with humans... you don't remove data to remove bias... you ADD fucking data!

            Are they adding data here? No... Why? Because this is not fucking AI to any degree that ADDing data r

            • ...you ADD fucking data!

              So to fix a problem with bias, you point the program to pornhub? Not that I'm objecting.

              • Lol, as crazy as it sounds, that is exactly what you do. If you actually have an AI you teach it... you don't just add/subtract data...

                You give a kid a massive multiplication table, but you still have to teach! Also, programs are always biased, the trick is making sure their bias works in our favor. For example... we would want a program to have multiple biases... just like humans. We just want those biases to all be what help make humanity better vs the biases that makes humanity worse. A good bias is

            • altering the definitions

              The English language does not have an official authority for definitions. Popular use, alone, alters definitions. This enrages people who prefer the old definitions, but their rage does nothing to prevent the phenomenon.

              The phrase "Artificial Intelligence" is a good example of this, because when John McCarthy originally coined the phrase, it had a really broad meaning that covered several classes of algorithm that existed at the time. The intent was along the lines of "a non-intel

        • Then it's not AI.

          You've been told dozens of times that "AI" is an academic department at Universities that may or may not be part of the Computer Science program.

          It is not a descriptive claim about the intellectual capabilities of a machine.

          You can't comprehend that, because your own "intelligence" is merely an artificial category applied solely due to your taxonomic classification; very much like that machine in that way, in fact.

      • by jythie ( 914043 )
        Well, that is the modern trend. The domains of AI that get all the funding (i.e. ones that are most applicable to shopping recommendations and other such search problems) really come down to combining lots of data with lots of processing power and tweaking things here and there till something passes. Understanding what is happening inside the model is out of fashion
    • Did we really? Seriously a database of 80 Million images, out of that how many where offensive? Do you really expect a few MIT students to manually inspect all these images for offense, they probably just scraped the internet for them. Also from the article the database include words like pedophile, child molester, molester, and rape suspect, all of which I assume where associated with men. What exactly does a child molester look like?

      An article headline and summary made out to look someone was against the

    • by Shaitan ( 22585 )

      This IS one MIT really fucked up. A dataset like this doesn't teach AI to be racist, it provides the source material to allow an AI to understand what slang and racist terms identify. An example of a valid use case is differentiating between classic hip-hop music and Klan speech in a massive library of audio recordings.

      These monsters pushing erasure and the rewriting of history need to be stopped. It is a revolution that ends in a single party solution with a heavily walled and divided disarmed populace and

      • Burn History, Burn everything we don't like, burn anything that offends me.

        Yea, that is pretty much what all people in the wrong like to do. Silencing people, cancel culture, and SJW's are all busy spinning their wheels because none of their arguments ever stand up to any meaningful scrutiny.

        The First Lie usually wins the argument so lie first and lie often!

        Make sure they are called a racist, misogynist, homophobe, xenophobe, bigoted prick, anything so long as you do it before they have a chance to prove y

      • A dataset like this doesn't teach AI to be racist, it provides the source material to allow an AI to understand what slang and racist terms identify.

        Actually, good point. You have to have negatives as well as positive, esp. if you want an AI/person to discern the difference.

        • by Shaitan ( 22585 )

          "Actually, good point. You have to have negatives as well as positive, esp. if you want an AI/person to discern the difference."

          Yes, they are all just labels and images and a linkage between the two. There is nothing innately evil or racist about a grouping of letters. That is just data. Racism is about intent not the labels which are used. Since "AI" as it exists today lacks the ability to have intent, it is the intent of the creator that matters. Then there is also prejudice/bias but no matter how bad the

    • Except they weren't. This is more whiny xian "facts are racist" garbage. They can't deal with the real world so they have to make up lies about invisible men in the sky talking to them. They'd rather have no technology than have technology that tells the damn truth. Their entire belief system is based on lies.

  • Comment removed based on user account deletion
    • by DarkOx ( 621550 )

      It all depends on where the data comes from. Did the data even all come from people who knew they were producing training data, or do think someone should be charged if the deliberately fill out a reCAPTCH wrong?

    • Could have just been a bit that scraped various websites for the images. A little early for pitchforks I think.

      But this does sound like the kind of prank that 4chan or some similar site would pull. They did something similar to a chat bot AI on Twitter several years ago.

      Incidentally the best way to prevent this in the future is to have a crazy racist AI that would be good at detecting these things.
  • Now we have seen everything!

    • Re: (Score:3, Insightful)

      by hey! ( 33014 )

      As someone who's spent decades working with databases, it surprises me anyone would be mystified by a database being "racist". Nothing could be more commonplace than a bad system distorting a decision-making process. A system is only as good as peoples' ability to recognize when it has problems.

      Sure: a database isn't intelligent or self-critical, so it can't *have* racist opinions. But that doesn't mean it can't *embody* or even *enforce* racist attitudes.

      The classic example is redlining, where black nei

  • by Anonymous Coward

    This looks all the world to me to be almost exactly what Orwell wrote in fiction. ANY history that violates the accepted standards MUST be erased, rephrased, torn down or otherwise sanitized or you are going to get yourself torched, figuratively or literally..

  • by Tangential ( 266113 ) on Wednesday July 01, 2020 @01:18PM (#60251124) Homepage
    This is a golden opportunity. Rather than retiring this AI training system, it should be used to develop highly specialized AIs who's whole purpose in life is to answer calls from telemarketers.
    • This is a golden opportunity. Rather than retiring this AI training system, it should be used to develop highly specialized AIs who's whole purpose in life is to answer calls from telemarketers.

      That's too nice. I'd just direct them to a suitable Rickroll.

    • by PPH ( 736903 )

      answer calls from telemarketers

      Lenny from the 'hood.

  • "The database also contained close-up pictures of female genitalia labeled with the C-word". Scientific research is tedious and lonely. Please don't judge.
    • "The database also contained close-up pictures of female genitalia labeled with the C-word". Scientific research is tedious and lonely. Please don't judge.

      how about if instead of the "c-word", they said pussy. Is that less worse?

  • Rap music (Score:5, Insightful)

    by ichthus ( 72442 ) on Wednesday July 01, 2020 @01:20PM (#60251140) Homepage

    these systems may also label women as whores or bitches

    So, basically, they used rap music lyrics in the language training.

  • This explains so much. [wikipedia.org]
  • "AI Systems use Racially Insensitive Language and the Problem is Unsolvable" because you're not allowed to use insensitive language in training sets to train models what not to do.

    Wasn't there a story a couple years ago about pictures of black folks being pulled out of training sets because AI mischaracterized them? And then a story from earlier this year about a black guy misidentified by facial recognition?

    I guess we'll hear it's systemic racism because the AI systems are poorly trained with only politic

    • and the COMPAS AI that didn't give many black offenders bail because it turned out that black offenders more likely committed crimes that are associated with offenders jumping bail. But that was racist too.

    • As with all Machine Learning situations... IF you have garbage training data, you will have poor performance on real data.

  • Search and Replace (Score:3, Interesting)

    by nwaack ( 3482871 ) on Wednesday July 01, 2020 @01:37PM (#60251212)
    Apparently search and replace doesn't work in this database? The snowflake cancel culture strikes again!
  • Well how is an AI supposed to pass the Turing test without being mysogynistic and racist?

  • by DaveV1.0 ( 203135 ) on Wednesday July 01, 2020 @01:47PM (#60251266) Journal
    The unpopular truth is that it isn't racist or sexist. It is acting like members of the groups it is offending.

    these systems may also label women as whores or bitches

    Like how rap songs label women as whores and bitches? Like how many women refer to other women as whores and bitches?

    Black and Asian people with derogatory language

    Like how Black and Asian people will call members of their friends words considered derogatory language in songs and in person?

    The database also contained close-up pictures of female genitalia labeled with the C-word.

    The "C-word" isn't that offensive in many English speaking countries, especially when referring to female genitalia. That literally falls under "talk dirty to me"

    • It really seems to me that this dataset could also be used to teach the network what words are bad and what words to avoid? Like, I know all the words that are being described here even though some of them aren't being fully spelled out. I also know not to use them, and I know that people that DO use them either are part of some in-group that I'm not, or they're jerks. I've been able to learn how to discern those two groups.

      When I was in grade 9, my French teacher told us that she would teach us French swea

      • by xonen ( 774419 ) on Wednesday July 01, 2020 @03:21PM (#60251664) Journal

        The problem with banning words is that other words will pop up spontaneously to replace them. It's a cat-and-mouse game that can't be won.

        • The problem with banning words is that other words will pop up spontaneously to replace them. It's a cat-and-mouse game that can't be won.

          And they can all be found in a thesaurus. Or a dictionary. Or urban dictionary. If that is how you want to enrich your data.
          Not in direct image-word associations used to train dumb-as-dirt AI pattern recognition systems.

          What if AI is not dumb as dirt, it is, but what if? Is there value in this form of knowledge then?
          In what circumstances is it useful to train an adult, a child, an animal, or a machine with picture flash cards labeled c**t or n****r?

          When is that EVER ok, or even in the remotest way usefu

    • The unpopular truth is that it isn't racist or sexist. It is acting like members of the groups it is offending.

      Like how rap songs label women as whores and bitches? Like how many women refer to other women as whores and bitches?

      Who in their right mind would even think to train an image recognition system using rap lyrics. Like they fed it rap music videos... to connect the speech recognition.. of rap music, and the images... of a music video.. to learn some useful fucking thing and you suppose that is likely what happened? Are you for real?

      Human beings labeled the pictures in the dataset, and some of them were being fucking dildos. This isn't some "I learned it from you dad" moment for reflection bullshit, it's a g.d. computer,

      • Are you saying that the urban culture from which rap arose is somehow inferior and it's word choices should be ignored?

        Are you saying that urban culture should be ignored by AI effectively creating an AI that is biased towards Eurocentric normative culture?

        Be careful what you wish for, you might find yourself being called a racists for saying rap music and urban slang shouldn't be used to train AIs.
  • AI Safety (Score:2, Interesting)

    by edi_guy ( 2225738 )

    I did not RTFA but it does seem like some rules around AI safety are in order. Unexpected outcomes like the one mentioned will get worse and worse unless the researchers take AI safety more seriously. I like posts from Robert Miles https://www.youtube.com/channe... [youtube.com] who discusses these matters on Computerphile and his own channel. He is of the opinion that AI researchers need to start taking this seriously now, even though it seems like the work isn't that far along. A comparable and topical field might

  • I mean if I search for pickers of 'pick your expletive' shouldn't I get pictures?
    Is the purpose to allow people to find what they want or to label things that exist.
    OBVIOUSLY you should not label things in an offensive matter, but if you have a poorly educated person who legitimately is looking for information on female anatomy because they have a medical problem and have always called it 'my c-word' why shouldn't the AI be able to tell what the person is asking about?

    • I disagree, I have the freedom and liberty to label things offensively if I so choose. If I want to call Biden a "demented half-dead meat puppet" or Trump "goth eyed orange orangutan" and you are offended, that is your tough shit.

  • This sounds hilarious. Imagine someone making an application based on this data set:

    Maybe a blind guy is using a dating app, and his phone is verbally captioning the images using a computerized voice.
    1. User loads app.
    2. Computerized voice narrates what is on the screen: "Next icon. Previous icon disabled. Like icon. Email icon. Image of sweet milf twerking."
    3. User clicks next.
    4. Computerized voice says: "Next icon. Previous icon. Like icon. Email icon. Image of fat c*nt in a bikini."

    Or a messaging app th

  • Always teaching kids words they shouldn't learn.

  • Thank you for protecting us from bad words at the expense of scientific progress.
    Doing god's work.
  • So it seems a big issue with this in part what the database has been used for, which is probably more than planned. Certainly the inclusion of blatantly racist labels is not something that helps in any way. However, reading this I can see there is a real need to come to an agreement on what things should be excluded from certain databases but must be included in others. For example, the comment by Prabhu and Birhane "You don’t need to include racial slurs, pornographic images, or pictures of children"

  • I don't expect this opinion to be very popular, but any AI that doesn't understand racism, misogyny, misandry, homophobia, or vulgar language will be pretty fucking useless in 2020.
  • ^ this ^ Also, why so many half-formed and tangential arguments on the board when whatâ(TM)s problematic are the implications and actual harm stupid humans inflict in the array of ways they choose to apply/use AI? TL;DR: So, you give a chainsaw to a 3-year old to cut a cake? GTFOâ"
  • So Tay was trained by MIT?
  • Sure those are terms you don't use in polite society.

    They are also correct. In the sense that these are words that are sometimes used to describe those people or objects shown on those pictures.

    We just don't teach our AIs good behaviour and which things to say and which not. That may be a necessary future step. To teach synonyms, including labels that state that this synonym it should understand, but not use.

    Except, of course, in its proper context. Most insults are descriptive terms for other things that a

  • If they're removing sources that ".. label women as whores or bitches..." isn't that exclusionary of black rap culture?

  • Let trained networks be aware of the words, but make sure they know not to use them. Unless they're really mad.

Bus error -- please leave by the rear door.

Working...