Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
The Internet Databases Education

MIT Removes Huge Dataset That Teaches AI Systems To Use Racist, Misogynistic Slurs (theregister.com) 62

An anonymous reader quotes a report from The Register MIT has taken offline its highly cited dataset that trained AI systems to potentially describe people using racist, misogynistic, and other problematic terms. The database was removed this week after The Register alerted the American super-college. MIT also urged researchers and developers to stop using the training library, and to delete any copies. "We sincerely apologize," a professor told us. The training set, built by the university, has been used to teach machine-learning models to automatically identify and list the people and objects depicted in still images. For example, if you show one of these systems a photo of a park, it might tell you about the children, adults, pets, picnic spreads, grass, and trees present in the snap. Thanks to MIT's cavalier approach when assembling its training set, though, these systems may also label women as whores or bitches, and Black and Asian people with derogatory language. The database also contained close-up pictures of female genitalia labeled with the C-word. Applications, websites, and other products relying on neural networks trained using MIT's dataset may therefore end up using these terms when analyzing photographs and camera footage.

The problematic training library in question is 80 Million Tiny Images, which was created in 2008 to help produce advanced object-detection techniques. It is, essentially, a huge collection of photos with labels describing what's in the pics, all of which can be fed into neural networks to teach them to associate patterns in photos with the descriptive labels. So when a trained neural network is shown a bike, it can accurately predict a bike is present in the snap. It's called Tiny Images because the pictures in library are small enough for computer-vision algorithms in the late-2000s and early-2010s to digest. Today, the Tiny Images dataset is used to benchmark computer-vision algorithms along with the better-known ImageNet training collection. Unlike ImageNet, though, no one, until now, has scrutinized Tiny Images for problematic content.

This discussion has been archived. No new comments can be posted.

MIT Removes Huge Dataset That Teaches AI Systems To Use Racist, Misogynistic Slurs

Comments Filter:
  • SAMPE PAGE DUPE! (Score:5, Informative)

    by pz ( 113803 ) on Wednesday July 01, 2020 @10:35PM (#60252706) Journal

    Apparently the Slashdot editors are so rabid in their outrage that they can't help themselves, and the same story of a database so large that there's no way it can be 100% vetted for contamination is present at the same time on the front page. The first instance isn't even halfway down.

    • and with the same righteousness the catholic priests burned all books they considered blasphemous
      again
      • That's because if you're not absolutely pure and inoffensive in every sense now-a-days you're worthless garbage and deserve to be purged by history.

        And with THAT intro, I introduce you to School History, 2020 version, the COMPLETE World Edition for every grade:
        In the beginning, there was .... The end.

        Thanks! I'll be expecting my $250 fee for you reading by book in the mail Any Day Now. Remember, Only Buy and Think Pure -- ignore all inferior imitations! They're worthless by definition !
    • by Kohath ( 38547 )

      ...story of a database so large that there's no way it can be 100% vetted for contamination...

      You'd think they could train an AI to find all the problems.

      If only training AIs with such material wasn't an unbreakable taboo that transcends engineering and science and research and basic inquiry....

      Maybe the whole machine learning field should be ended, just in case. Someone might be offended someday if it doesn't.

      • Re:SAMPE PAGE DUPE! (Score:4, Interesting)

        by Antique Geekmeister ( 740220 ) on Thursday July 02, 2020 @12:37AM (#60252854)

        Avoiding "all the problems" also eliminates the useful data. Education, age, gender and health affect the _likelihood_ of various forms of competence, even though they may not be dominant factors. Being able to differentiate based on capability is inevitably going to correlate with factors hat are considered illegal, such as gender and age. Training the unacceptable criteria out of the AI would make it unable to evaluate capability or likelihood of professional progress.

        • by AmiMoJo ( 196126 )

          This type of argument is build on the assumption that you want to hire the very best and then give them near zero support to develop. They need to already have all the skills and have innate talent and ability because the company sure as hell isn't going to treat them as anything more than a commodity.

          In other words it's the type of company that anyone actually good will run from fast. Good candidates will be able to sniff this kind of thing out at the interview, and will have options to move on if they som

          • > This type of argument is build on the assumption that you want to hire the very best and then give them near zero support to develop.

            What? No. It's based on not wasting time hiring someone who's likely to take off within months or only a few years for other reasons. It's enormously wasteful when someone who is, say, my age, starts at a new company with decades of deep knowledge in the field, a true scholar and leader in the art, and then retires for medical reasons. Or when someone fresh out of colleg

            • by AmiMoJo ( 196126 )

              It's enormously wasteful when someone who is, say, my age, starts at a new company with decades of deep knowledge in the field, a true scholar and leader in the art, and then retires for medical reasons. Or when someone fresh out of college, new to the field, marries and moves to support their partnet's higher paying career.

              In other words you are discriminating by age and thereby guaranteeing you discard many of the best candidates. Obviously you don't pay well or it would make sense for them to stay on instead of moving to support their partner's higher paying career.

              • Goodness. You're reading a lot of negative behavior into what I've said.

                > In other words you are discriminating by age

                It's very difficult not to. If you're not aware that your benefits like good parental leave are appealing to younger married people, or to women approaching "a certain age" who are running out of time to have children, or that strong health insurance and life insurance are strong factors for older candidates, then you can't set priorities for the benefits you provide. And the budget for t

          • by Bert64 ( 520050 )

            Companies are most concerned with short term profits, while employees are concerned with their own career.

            If you hire staff who need training, you have to pay for that training and accept inferior work (in either quality or quantity) until the training has been completed, as well as run the risk that the staff will leave as soon as the training has concluded.
            Companies only generally want to hire such staff if they are significantly cheaper than already trained staff.
            Once those staff have been trained, they

      • An AI that spots offensive material...I'll bet somebody could find a use for that.
    • Pity we can't take dupe-posting editors offline for being so offensively stupid and blind.

  • They didn't use any Robosexual references in their training dataset.

  • Delete all copies (Score:4, Insightful)

    by cygnusvis ( 6168614 ) on Wednesday July 01, 2020 @10:47PM (#60252726)

    delete all copies

    thats not how the internet works.

  • Problematic (Score:4, Insightful)

    by J Story ( 30227 ) on Wednesday July 01, 2020 @10:58PM (#60252754) Homepage

    I see two problems with this. First, limiting the dataset will have the effect of limiting the communication of information. I knew a person whose first language was Russian, and even though his English seemed excellent to me, he once told me his frustration in fully expressing his thoughts in English. Second, to the extent that such communication is possible, it seems to me that all that will happen is that a perjorative is expunged from the language, the actual concept might not be, and will result in language that will fall to the next little Hitler witchhunt of the perpetually offended.

    • Then give the computer a thesaurus FFS, it doesn't need someone to creatively caption a picture of a vagina to figure out what a cunt is any more than a person does.

    • by Shotgun ( 30919 )

      Negro -> colored -> black -> African American

      cripple -> disabled

      Retarded -> mentally challenged

      The PC culture has made a game out of changing names of things, trying to get away from what they are. As soon as people get used to the new name and it takes on the original negative connotations, they move to change again. They think they're helping somebody.

  • The database was removed this week after The Register alerted the American super-college.

    "Because they hid the data rather than left it available with a caveat, knuckling under to changes in the blowing wind, their status as a "super" college, pursuing the highest of investigation unencumbered, has also heen revoked."

  • "It said the dataset is near!"
  • Apparently Slashdot's dupe database got deleted too.

  • by Cylix ( 55374 ) on Thursday July 02, 2020 @01:07AM (#60252904) Homepage Journal

    Time for some old fashioned book burning.

  • by Kohlrabi82 ( 1672654 ) on Thursday July 02, 2020 @01:16AM (#60252922)

    Language is a means to express and transmits thoughts, and a varied language is part of your culture. Artificially limiting what you can do with it will reduce the ways people can communicate. And it is a slippery slope, what is acceptable today might be offensive to the next generation of SJWs tomorrow, and ultimately you end up with a caveman language because every single word is offensive, killing culture (which, arguably, is what the left-wing Maoists want).

    Part of growing up to being an adult in a free society is learning to deal with people or ideas you find "offensive". Making the society less free and cultured will not make you less offended by ideas, because the next one is just around the corner. The problem is in *your* head, not the society or language.

    • by Z00L00K ( 682162 )

      I reserve the right to feel offended, but that's not justifying the right to censor.

    • "It was intended that when Newspeak had been adopted once and for all and Oldspeak forgotten, a heretical thought — that is, a thought diverging from the principles of Ingsoc — should be literally unthinkable, at least so far as thought is dependent on words." - George Orwell, The Principles Of Newspeak

    • I actively teach my 10 year olds how and when to use curse words (both as a direct discussion and via my personal behavior). We talk about it directly, frankly.

      They can say "damn/damn it" or "shit" when an injury occurs.

      They know they can't drop the F bomb yet (but they can text WTF). In front of us at least, I really don't care, it's a very useful word, maybe the most useful.

      They are unaware of the C word, I'd like to keep it that way.

      Bitch is a female dog (the kids know this), but "son of a bitch" is al

      • by ledow ( 319597 )

        "If you understand what it means and use it in context, it's fine."

        That's my rule for my kid. Too many adults waste far too much time not swearing in front of kids who are trying to not swear in front of their parents. Wake up, allow their use, so long as it's proportionate, in-context, and not when in polite company, etc.

        But just because I don't want my kid saying a particular word or not doesn't mean we should purge it from history.

    • by AmiMoJo ( 196126 )

      That's got nothing to do with the issue here though. The problem here is that the AI ends up badly translating stuff.

      Microsoft had that issue many years ago. Some bit of software got translated into Portuguese I think it was, and "woman" in English got translated to "bitch" in the target language.

      Google Translate had issues with it too. When translating Japanese it would get genders mixed up all the time and end up saying borderline offensive stuff as a result. In Japanese gender is often not apparent from

  • Silly college kids. But I laughed when I read this.

  • Can we remove the dataset that teaches slashdot editors to make reposts?
  • We have it all archived on /b/.

  • The database was removed this week after The Register alerted the American super-college.

    What the hell is a 'super-college'?

To be awake is to be alive. -- Henry David Thoreau, in "Walden"

Working...