MIT Removes Huge Dataset That Teaches AI Systems To Use Racist, Misogynistic Slurs (theregister.com) 62

Posted by BeauHD on Wednesday July 01, 2020 @11:30PM from the action-required dept.

An anonymous reader quotes a report from The Register MIT has taken offline its highly cited dataset that trained AI systems to potentially describe people using racist, misogynistic, and other problematic terms. The database was removed this week after The Register alerted the American super-college. MIT also urged researchers and developers to stop using the training library, and to delete any copies. "We sincerely apologize," a professor told us. The training set, built by the university, has been used to teach machine-learning models to automatically identify and list the people and objects depicted in still images. For example, if you show one of these systems a photo of a park, it might tell you about the children, adults, pets, picnic spreads, grass, and trees present in the snap. Thanks to MIT's cavalier approach when assembling its training set, though, these systems may also label women as whores or bitches, and Black and Asian people with derogatory language. The database also contained close-up pictures of female genitalia labeled with the C-word. Applications, websites, and other products relying on neural networks trained using MIT's dataset may therefore end up using these terms when analyzing photographs and camera footage.

The problematic training library in question is 80 Million Tiny Images, which was created in 2008 to help produce advanced object-detection techniques. It is, essentially, a huge collection of photos with labels describing what's in the pics, all of which can be fed into neural networks to teach them to associate patterns in photos with the descriptive labels. So when a trained neural network is shown a bike, it can accurately predict a bike is present in the snap. It's called Tiny Images because the pictures in library are small enough for computer-vision algorithms in the late-2000s and early-2010s to digest. Today, the Tiny Images dataset is used to benchmark computer-vision algorithms along with the better-known ImageNet training collection. Unlike ImageNet, though, no one, until now, has scrutinized Tiny Images for problematic content.

MIT Removes Huge Dataset That Teaches AI Systems To Use Racist, Misogynistic Slurs

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 62 Comments Log In/Create an Account

Comments Filter:

SAMPE PAGE DUPE! (Score:5, Informative)

by pz ( 113803 ) writes: on Wednesday July 01, 2020 @11:35PM (#60252706) Journal

Apparently the Slashdot editors are so rabid in their outrage that they can't help themselves, and the same story of a database so large that there's no way it can be 100% vetted for contamination is present at the same time on the front page. The first instance isn't even halfway down.

- Re: (Score:3)
  
  by LifesABeach ( 234436 ) writes:
  
  and with the same righteousness the catholic priests burned all books they considered blasphemous
  again
  - Re: (Score:2)
    
    by grep -v '.*' * ( 780312 ) writes:
    
    That's because if you're not absolutely pure and inoffensive in every sense now-a-days you're worthless garbage and deserve to be purged by history.
    
    And with THAT intro, I introduce you to School History, 2020 version, the COMPLETE World Edition for every grade:
    In the beginning, there was .... The end.
    
    Thanks! I'll be expecting my $250 fee for you reading by book in the mail Any Day Now. Remember, Only Buy and Think Pure -- ignore all inferior imitations! They're worthless by definition !
- Re: (Score:2)
  
  by Kohath ( 38547 ) writes:
  
  ...story of a database so large that there's no way it can be 100% vetted for contamination...
  You'd think they could train an AI to find all the problems.
  If only training AIs with such material wasn't an unbreakable taboo that transcends engineering and science and research and basic inquiry....
  Maybe the whole machine learning field should be ended, just in case. Someone might be offended someday if it doesn't.
  - Re:SAMPE PAGE DUPE! (Score:4, Interesting)
    
    by Antique Geekmeister ( 740220 ) writes: on Thursday July 02, 2020 @01:37AM (#60252854)
    
    Avoiding "all the problems" also eliminates the useful data. Education, age, gender and health affect the _likelihood_ of various forms of competence, even though they may not be dominant factors. Being able to differentiate based on capability is inevitably going to correlate with factors hat are considered illegal, such as gender and age. Training the unacceptable criteria out of the AI would make it unable to evaluate capability or likelihood of professional progress.
    
    - Re: (Score:3)
      
      by AmiMoJo ( 196126 ) writes:
      
      This type of argument is build on the assumption that you want to hire the very best and then give them near zero support to develop. They need to already have all the skills and have innate talent and ability because the company sure as hell isn't going to treat them as anything more than a commodity.
      In other words it's the type of company that anyone actually good will run from fast. Good candidates will be able to sniff this kind of thing out at the interview, and will have options to move on if they som
      - Re: (Score:2)
        
        by Antique Geekmeister ( 740220 ) writes:
        
        > This type of argument is build on the assumption that you want to hire the very best and then give them near zero support to develop.
        What? No. It's based on not wasting time hiring someone who's likely to take off within months or only a few years for other reasons. It's enormously wasteful when someone who is, say, my age, starts at a new company with decades of deep knowledge in the field, a true scholar and leader in the art, and then retires for medical reasons. Or when someone fresh out of colleg
        
        Re: (Score:2)
        
        by AmiMoJo ( 196126 ) writes:
        
        It's enormously wasteful when someone who is, say, my age, starts at a new company with decades of deep knowledge in the field, a true scholar and leader in the art, and then retires for medical reasons. Or when someone fresh out of college, new to the field, marries and moves to support their partnet's higher paying career.
        In other words you are discriminating by age and thereby guaranteeing you discard many of the best candidates. Obviously you don't pay well or it would make sense for them to stay on instead of moving to support their partner's higher paying career.
        
        Re: (Score:2)
        
        by Antique Geekmeister ( 740220 ) writes:
        
        Goodness. You're reading a lot of negative behavior into what I've said.
        > In other words you are discriminating by age
        It's very difficult not to. If you're not aware that your benefits like good parental leave are appealing to younger married people, or to women approaching "a certain age" who are running out of time to have children, or that strong health insurance and life insurance are strong factors for older candidates, then you can't set priorities for the benefits you provide. And the budget for t
      - Re: (Score:2)
        
        by Bert64 ( 520050 ) writes:
        
        Companies are most concerned with short term profits, while employees are concerned with their own career.
        If you hire staff who need training, you have to pay for that training and accept inferior work (in either quality or quantity) until the training has been completed, as well as run the risk that the staff will leave as soon as the training has concluded.
        Companies only generally want to hire such staff if they are significantly cheaper than already trained staff.
        Once those staff have been trained, they
  - Re: (Score:1)
    
    by Too Late for Cool ID ( 1794870 ) writes:
    
    An AI that spots offensive material...I'll bet somebody could find a use for that.
- Re: (Score:2)
  
  by DrMrLordX ( 559371 ) writes:
  
  Pity we can't take dupe-posting editors offline for being so offensively stupid and blind.
- Re: (Score:2)
  
  by cygnusvis ( 6168614 ) writes:
  
  Interesting point. I have seen in the recent weeks that racism is more of an issue than mee-soggy-knee
Well at least after reposting (Score:2)

by bobstreo ( 1320787 ) writes:

They didn't use any Robosexual references in their training dataset.
Delete all copies (Score:4, Insightful)

by cygnusvis ( 6168614 ) writes: on Wednesday July 01, 2020 @11:47PM (#60252726)

delete all copies
thats not how the internet works.

- Re: (Score:2)
  
  by DrMrLordX ( 559371 ) writes:
  
  That's definitely not how Slashdot works, either.
- Re: (Score:2)
  
  by thegarbz ( 1787294 ) writes:
  
  delete all copies
  thats not how the internet works.
  If it were we wouldn't have slashdot dups.
Problematic (Score:4, Insightful)

by J Story ( 30227 ) writes: on Wednesday July 01, 2020 @11:58PM (#60252754) Homepage

I see two problems with this. First, limiting the dataset will have the effect of limiting the communication of information. I knew a person whose first language was Russian, and even though his English seemed excellent to me, he once told me his frustration in fully expressing his thoughts in English. Second, to the extent that such communication is possible, it seems to me that all that will happen is that a perjorative is expunged from the language, the actual concept might not be, and will result in language that will fall to the next little Hitler witchhunt of the perpetually offended.

- Re: Problematic (Score:2)
  
  by ToasterMonkey ( 467067 ) writes:
  
  Then give the computer a thesaurus FFS, it doesn't need someone to creatively caption a picture of a vagina to figure out what a cunt is any more than a person does.
- Re: (Score:2)
  
  by Shotgun ( 30919 ) writes:
  
  Negro -> colored -> black -> African American
  cripple -> disabled
  Retarded -> mentally challenged
  The PC culture has made a game out of changing names of things, trying to get away from what they are. As soon as people get used to the new name and it takes on the original negative connotations, they move to change again. They think they're helping somebody.
That's just super (Score:2)

by Impy the Impiuos Imp ( 442658 ) writes:

The database was removed this week after The Register alerted the American super-college.
"Because they hid the data rather than left it available with a caveat, knuckling under to changes in the blowing wind, their status as a "super" college, pursuing the highest of investigation unencumbered, has also heen revoked."
What did it say? (Score:1)

by Jahoda ( 2715225 ) writes:

"It said the dataset is near!"
Duplicate posts database (Score:2)

by backslashdot ( 95548 ) writes:

Apparently Slashdot's dupe database got deleted too.
- Re: (Score:2)
  
  by Z00L00K ( 682162 ) writes:
  
  It was a copyright bot that erased the memories of the dupe.
Whatever (Score:3)

by Cylix ( 55374 ) writes: on Thursday July 02, 2020 @02:07AM (#60252904) Homepage Journal

Time for some old fashioned book burning.

Learn to deal with "offenses" (Score:5, Insightful)

by Kohlrabi82 ( 1672654 ) writes: on Thursday July 02, 2020 @02:16AM (#60252922)

Language is a means to express and transmits thoughts, and a varied language is part of your culture. Artificially limiting what you can do with it will reduce the ways people can communicate. And it is a slippery slope, what is acceptable today might be offensive to the next generation of SJWs tomorrow, and ultimately you end up with a caveman language because every single word is offensive, killing culture (which, arguably, is what the left-wing Maoists want).
Part of growing up to being an adult in a free society is learning to deal with people or ideas you find "offensive". Making the society less free and cultured will not make you less offended by ideas, because the next one is just around the corner. The problem is in *your* head, not the society or language.

- Re: (Score:2)
  
  by Z00L00K ( 682162 ) writes:
  
  I reserve the right to feel offended, but that's not justifying the right to censor.
- Re: (Score:3)
  
  by Vinegar Joe ( 998110 ) writes:
  
  "It was intended that when Newspeak had been adopted once and for all and Oldspeak forgotten, a heretical thought — that is, a thought diverging from the principles of Ingsoc — should be literally unthinkable, at least so far as thought is dependent on words." - George Orwell, The Principles Of Newspeak
- Re: (Score:3)
  
  by turp182 ( 1020263 ) writes:
  
  I actively teach my 10 year olds how and when to use curse words (both as a direct discussion and via my personal behavior). We talk about it directly, frankly.
  They can say "damn/damn it" or "shit" when an injury occurs.
  They know they can't drop the F bomb yet (but they can text WTF). In front of us at least, I really don't care, it's a very useful word, maybe the most useful.
  They are unaware of the C word, I'd like to keep it that way.
  Bitch is a female dog (the kids know this), but "son of a bitch" is al
  - Re: (Score:2)
    
    by ledow ( 319597 ) writes:
    
    "If you understand what it means and use it in context, it's fine."
    That's my rule for my kid. Too many adults waste far too much time not swearing in front of kids who are trying to not swear in front of their parents. Wake up, allow their use, so long as it's proportionate, in-context, and not when in polite company, etc.
    But just because I don't want my kid saying a particular word or not doesn't mean we should purge it from history.
- Re: (Score:1)
  
  by AmiMoJo ( 196126 ) writes:
  
  That's got nothing to do with the issue here though. The problem here is that the AI ends up badly translating stuff.
  Microsoft had that issue many years ago. Some bit of software got translated into Portuguese I think it was, and "woman" in English got translated to "bitch" in the target language.
  Google Translate had issues with it too. When translating Japanese it would get genders mixed up all the time and end up saying borderline offensive stuff as a result. In Japanese gender is often not apparent from
- Re: (Score:2)
  
  by PPH ( 736903 ) writes:
  
  All I found was Scunthorpe, Lincolnshire, England.
- Re: Who decides what 'racist' means? (Score:2)
  
  by slasher999 ( 513533 ) writes:
  
  Are you suggesting that the term ârasictâ may potentially be - gasp - subjective? Imagine.
  - Re: (Score:2)
    
    by fish_in_the_c ( 577259 ) writes:
    
    According to many activists white people have no business defining racism and are unable to experience. Not only that but they all are guilty of it even when they think their not.
    https://www.mndaily.com/articl... [mndaily.com]
    https://metro.co.uk/2020/03/09... [metro.co.uk]
    https://www.newyorker.com/book... [newyorker.com]
    https://www.thecrimson.com/col... [thecrimson.com]
    - Re: Who decides what 'racist' means? (Score:2)
      
      by zawarski ( 1381571 ) writes:
      
      And in return, they get to enjoy really awesome privilege. I think it's a pretty sweet deal.
- Re: Who decides what 'racist' means? (Score:1)
  
  by zawarski ( 1381571 ) writes:
  
  It's like porn. You know it when you see it.
- Re: (Score:3)
  
  by nwaack ( 3482871 ) writes:
  
  TDS alert! TDS alert! Look out now, we got an NPC here.
- Re: (Score:3)
  
  by nwaack ( 3482871 ) writes:
  
  TDS alert! TDS alert!
Those rascals! (Score:2)

by slasher999 ( 513533 ) writes:

Silly college kids. But I laughed when I read this.
Slashdot next? (Score:2)

by Headw1nd ( 829599 ) writes:

Can we remove the dataset that teaches slashdot editors to make reposts?
Not a problem (Score:2)

by PPH ( 736903 ) writes:

We have it all archived on /b/.
'Super-College'? (Score:2)

by superdave80 ( 1226592 ) writes:

The database was removed this week after The Register alerted the American super-college.
What the hell is a 'super-college'?

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

SAMPE PAGE DUPE! (Score:5, Informative)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:SAMPE PAGE DUPE! (Score:4, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Well at least after reposting (Score:2)

Delete all copies (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Problematic (Score:4, Insightful)

Re: Problematic (Score:2)

Re: (Score:2)

That's just super (Score:2)

What did it say? (Score:1)

Duplicate posts database (Score:2)

Re: (Score:2)

Whatever (Score:3)

Learn to deal with "offenses" (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: Who decides what 'racist' means? (Score:2)

Re: (Score:2)

Re: Who decides what 'racist' means? (Score:2)

Re: Who decides what 'racist' means? (Score:1)

Re: (Score:3)

Re: (Score:3)

Those rascals! (Score:2)

Slashdot next? (Score:2)

Not a problem (Score:2)

'Super-College'? (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals