Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Microsoft AI China Technology

Microsoft Quietly Deletes Largest Public Face Recognition Data Set (ft.com) 52

Microsoft has quietly pulled from the internet its database of 10 million faces [Editor's note: the link may be paywalled; alternative source], which has been used to train facial recognition systems around the world, including by military researchers and Chinese firms such as SenseTime and Megvii. From a report: The database, known as MS Celeb, was published in 2016 and described by the company as the largest publicly available facial recognition data set in the world, containing more than 10m images of nearly 100,000 individuals. The people whose photos were used were not asked for their consent, their images were scraped off the web from search engines and videos under the terms of the Creative Commons license that allows academic reuse of photos.

Microsoft, which took down the database days after the FT reported on its use by companies, said: "The site was intended for academic purposes. It was run by an employee that is no longer with Microsoft and has since been removed." Two other data sets have also been taken down since the FT report was published in April, including the Duke MTMC surveillance data set built by Duke University researchers, and a Stanford University data set called Brainwash.

This discussion has been archived. No new comments can be posted.

Microsoft Quietly Deletes Largest Public Face Recognition Data Set

Comments Filter:
  • by Anonymous Coward

    The 'net is not supposed to forget. Let's see if it's true

  • Increased scrutiny from Congress?

    • by Holi ( 250190 )
      pretty sure the government is one of the "academics" who were using the database.
  • You can't take something down from the internet. Once it's up, it there forever.

    • by Anonymous Coward

      I use to think this way, but over the last several years I've changed my mind. There have been plenty of things that were once on the internet that I can no longer find. I agree that it may exist on someone's computer or data storage. It may even exist somewhere on the internet. However, it has become so obscure and hard to find that it basically no longer exists.

      There was a time when Trump's House of Wings seemed to completely disappear.

      CAPTCHA - oppress

      • by mysidia ( 191772 )

        There have been plenty of things that were once on the internet that I can no longer find.

        The question is: What exactly are you looking for that you can no longer find; is it substantial and important really, and what exactly have you put into your efforts to find it so far?

        For example: Have you posted a request for it somewhere?
        Did you include a bounty? Perhaps the bounty you are offering was too low to surface the materials, at the time you were looking for it.

    • by mysidia ( 191772 )

      Do you have a link to the mirror? I am much more interested in a data set now since it has had some wide value/contribution toward advancement in AI, and
      that Microsoft now thinks its worth deleting the site ---- my suspicion would be this means MS now has technology they wish to productize and sell that this dataset helped develop, and now they have reason to try to withdraw public resources so they couldn't be used to help competitors develop something similar or better in the future, perha

      • Re: Taken down? (Score:3, Informative)

        by Anonymous Coward

        https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech

  • by WoodstockJeff ( 568111 ) on Thursday June 06, 2019 @12:21PM (#58719632) Homepage

    ... you have already given up control of it.

    "We didn't give permission for someone to scrape our pictures off our website." Yes, you did. If you didn't want people to use it, don't post it.

    "We only wanted academics to use it!" Then you shouldn't have made it public, and access should have been limited to academics.

    "We didn't want our academic research to be used for [insert bad thing here]!" You wanted research money to pay for doing that research... who do you think gave you that money?

    • I can understand some of the concern... a lot of folks hold the opinion that "academics" is purely blue-sky knowledge-for-its-own-sake research. They don't think about what happens after that.

      As an anecdote, I'll tell a story, that starts well before "machine learning" was done by machines. Instead, it was done by professional academics, literally paid to sit and think all day. One academic in particular spent a good amount of time working on a particular pattern-matching algorithm by trying various combina

    • "We didn't give permission for someone to scrape our pictures off our website." Yes, you did. If you didn't want people to use it, don't post it.

      In the case of celebrities, I suspect the vast majority of their pictures on the web (not by number of copies of the picture but by number of different individual pictures, which is what matters for training a facial recognition algorithm) were put online without their consent. The photos were taken by paparazzi and posted using the "newsworthy" exclusion to pers

    • ... you have already given up control of it.

      Who says you published it? Your smug superior argument falls down when someone else takes a photo of you and posts it. You didn't take the photo, you didn't give consent, yet there you are, for all to see, recognise, track, whatever. Even if you did, is it reasonable to expect everyone to have always understood the consequences of data sharing? The internet never forgets, so anything you did a decade ago when the net was a different animal still affects you now.

      Our laws and social norms are *not* built to d

  • }}} The people whose photos were used were not asked for their consent, their images were scraped off the web from search engines {{{ --- could this be why the BingPreview spider seemed to ignore robots.txt instructions, and downloaded images that it should not have been accessing?
  • A database of faces with names attached is trivially easy to collect.

    This will set back facial recognition by hours.

  • by Rosco P. Coltrane ( 209368 ) on Thursday June 06, 2019 @12:25PM (#58719664)

    - They deleted the stash of photos - Proof?

    - It was collected by "an employee who no longer works here" - Why, how convenient. Pay me 10 cents each time I hear that one and I'll be a rich man...

    • by drinkypoo ( 153816 ) <drink@hyperlogos.org> on Thursday June 06, 2019 @01:42PM (#58720248) Homepage Journal

      - They deleted the stash of photos - Proof?

      TFS doesn't even say they deleted the photos!

      Microsoft has quietly pulled from the internet its database of 10 million faces

      Microsoft, which took down the database

      It was run by an employee that is no longer with Microsoft and has since been removed.

      The only claim made in TFS is that they removed it from the internet, not that they deleted it. Granted, ATFA [engadget.com] repeatedly uses the word "deleted", but it offers no citations which suggest that it was actually deleted.

      TL;DR: Nobody said it was "deleted", except Engadget, and they do not back that claim up in any way. (Maybe there's something in the paywalled article, but fuck paywalled articles. I'm sure I could get access to the text somehow, but I don't want to.)

      • by Holi ( 250190 )
        "but fuck paywalled articles"

        because ad sponsored news is such a great way to get unbiased information.
        • because ad sponsored news is such a great way to get unbiased information.

          If it's subscriber-sponsored, it's going to be biased towards the majority of subscribers' opinions, and I'm not the majority of subscribers any more than I am a major corporation.

          • because ad sponsored news is such a great way to get unbiased information.

            If it's subscriber-sponsored, it's going to be biased towards the majority of subscribers' opinions, and I'm not the majority of subscribers any more than I am a major corporation.

            I'm intrigued... what do you trust for information? Because you have to trust somebody, though I get your concern, if ad-sponsored & subscriber-funded sources are out... what is in?

  • More like training databases for Nazi Concentration Camps, is more likely.

  • They have the working math from the dataset.
    Once the math is perfected, the dataset is of no use.
    A dataset is not the resulting work that was created.

What is research but a blind date with knowledge? -- Will Harvey

Working...