Forgot your password?
typodupeerror
The Internet News

The Internet Archive Has Saved Over 10,000,000,000,000,000 Bytes of the Web 135

Posted by Soulskill
from the or-one-kilo-library-of-congress dept.
An anonymous reader writes "Last night, the Internet Archive threw a party; hundreds of Internet Archive supporters, volunteers, and staff celebrated that the site had passed the 10,000,000,000,000,000 byte mark for archiving the Internet. As the non-profit digital library, known for its Wayback Machine service, points out, the organization has thus now saved 10 petabytes of cultural material." The announcement coincided with the release of an 80-terabyte dataset for researchers and, for the first time, the complete literature of a people: the Balinese.
This discussion has been archived. No new comments can be posted.

The Internet Archive Has Saved Over 10,000,000,000,000,000 Bytes of the Web

Comments Filter:
  • by Anonymous Coward on Saturday October 27, 2012 @10:24PM (#41794099)

    How much of that is porn, I wonder.

    • by martin-boundary (547041) on Saturday October 27, 2012 @10:25PM (#41794109)
      If only one of those files is a MP3, the RIAA is going to have an orgasm.
      • by Xtifr (1323) on Saturday October 27, 2012 @10:47PM (#41794203) Homepage

        They have over 1.5 million unique audio files in the Live Music Archive alone. I know because I helped them count. (That's unique files, not counting the duplicates in different formats.) If the RIAA has anything to say about it, they're serious slacking.

        • by Anonymous Coward

          Sweet! Did you guys save my Geocities page too?

          • by GofG (1288820) on Sunday October 28, 2012 @02:01AM (#41794787)

            There is a torrent on thepiratebay of every single geocities site. It's an archive, but i've downloaded it. What was your site? I'll rar it up for you.

            • by Anonymous Coward
              Sorry my finger slipped when I went to mod this up. Can I undo!
          • by Xtifr (1323)

            Probably. I found a copy of my first-ever homepage, which actually predated Geocities, and was probably even more useless than your average Geocities page. :)

          • by mcgrew (92797) *

            Nope, as good as Archive.org is, most of the pre-2000 stuff is gone forever. Much of my old gaming site is there, but not all of it. The only surviving page of Janet "Kneel" Harriott's Yello There is one I posted on my gaming site. Very liitle of mcgrew.info survives.

        • by pongo000 (97357)

          They have over 1.5 million unique audio files in the Live Music Archive alone.

          Since they can't be copied per the terms of the TOS, what good do they serve? Why bother counting something you technically can't access?

          • Re: (Score:2, Insightful)

            by Anonymous Coward

            Because eventually they WILL be accessable when copyright runs out. But if nobody other than the 'rightsholders' have copies, that wouldn't matter, they could trivially remaster them, then have copyright over the remasters for another century after destroying the originals so they could never get out.

          • by Xtifr (1323)

            I think you must be looking at the wrong part of the Archive. Everything in the Live Music section and the Netlabels section is public domain or licensed under a CC license or equivalent. The media collections are separate from the Wayback Machine.

          • by mcgrew (92797) *

            Since they can't be copied per the terms of the TOS

            What are you talking about? A lot of friends of mine host their music on Archive.org.

      • If only one of those files is a MP3, the RIAA is going to have an orgasm.

        Correction: Evilgasm.

    • by dohzer (867770)

      4kB should be enough for anyone.

    • by Mikkeles (698461)

      They're exaggerating; I know there are only 256 bytes, so I think they're counting duplicates!

    • by jc42 (318812)

      How much of that is porn, I wonder.

      Actually, only about half. The other half is lolcats.

    • 70%
  • by Anonymous Coward

    Well, I guess they didn't have time to write much, being busy dealing with Orcs and Balrogs.

    What about the Thorinim?

  • Indeed! (Score:3, Funny)

    by Frosty Piss (770223) * on Saturday October 27, 2012 @10:35PM (#41794139)

    And nothing of value was saved...

  • Yes, but... (Score:5, Funny)

    by Lordfly (590616) on Saturday October 27, 2012 @10:43PM (#41794181) Homepage Journal

    I need a car analogy about the Library of Congress before i can understand that number.

  • by guttentag (313541) on Saturday October 27, 2012 @11:05PM (#41794255) Journal
    For instance, note the archived film [archive.org] "Dating: Do's and Don'ts" (1949) It begins thus:

    How do you choose a date? Whose company would you enjoy?

    Well, one thing you can consider is looks. Woody thought of Janice and how good looking she was. He'd really have to rate to date her. Yes, he'd enjoy that, except... Well, it's too bad Janice always acts so superior. She'd make a fellow feel awkward and bored.

    Well, perhaps someone who doesn't feel so superior. There's Betty. And yet, it just doesn't seem as if she'd be much fun.

    What about Anne? She knows how to have a good time, and how to make the fellow with her relax, too. Yes, that's what a boy likes.

    Yes, the Internet now provides everything you ever needed to know but were afraid to ask.

    • by Anonymous Coward

      How do you choose a date? Whose company would you enjoy? ... Well, it's too bad Janice always acts so superior. She'd make a fellow feel awkward and bored. ... What about Anne? She knows how to have a good time, and how to make the fellow with her relax, too. Yes, that's what a boy likes.

      Yes Janice, get your head out of your ass. You could take a few tips from Anne -- she's a pro."

  • by pongo000 (97357) on Saturday October 27, 2012 @11:30PM (#41794335)

    ...since the TOS specifically prohibits copying data from the site:

    "Our terms of use specify that users of the Wayback Machine are not to copy data from the collection. If there are special circumstances that you think the Archive should consider, please contact info at archive dot org. "

    Warrick hasn't been taking new requests for months (and I'm sure it's more of a research tool than an actual service for the public), and the site effectively blocks attempts to backup data using wget. It makes me wonder who (or what) this archive really serves, because it's most certainly not the general public.

    • Besides the fact that it is not open, how are you supposed to compute against it? Download the 80TB and run it on your private data center? It should be fully open and available on a platform like AWS so people can actually use it.
    • by Xtifr (1323)

      A) You can read it just like you can read normal webpages on the main web, most of which also don't allow you to copy them.
      B) The Archive is more than just the Wayback machine. They also have what is almost certainly the worlds largest digital collection of public domain and CC-licensed media files in their media collections.

      • by Anonymous Coward

        On A: reading webpages IS copying them. Any attempt at distinction, given the technical details, is INSANE.

  • I have never understood why the few archive sites, that I have been to, never back up the entire web site, instead of just a few important pages and images. I can understand not accessing pages that are supposed to be secure, but all other pages should be fair game. This is most important for product knowledge. Some times a company takes down its site and images. It would be nice to have an archive to go to.

  • by Anonymous Coward on Sunday October 28, 2012 @12:11AM (#41794471)

    10,000,000,000,000,000 Bytes = 8.88 Petabytes

    • Re: (Score:3, Informative)

      by Anonymous Coward

      looks like you forgot to spell pebibytes correctly

    • by AmiMoJo (196126)

      They should have made a pebibyte 1,000,000,000,000,000. Trying to redefine petabyte was stupid.

      • by Anonymous Coward on Sunday October 28, 2012 @04:45AM (#41795199)

        You have that backwards, kilo, mega, giga, tera and so forth are base ten prefixes and have been for quite a bit longer than people have been misusing them to refer to base 2 numbers. As such it made more sense to leave it consistent with everything else and make a new prefix for the binary numbers.

  • by linebackn (131821) on Sunday October 28, 2012 @12:25AM (#41794515)

    I don't know if they have done anything about this recently, but there was a problem with domain parking sites putting up a robots.txt that instructs Archive.org to delete or suppress any archives of the site that was there previously. Have run in to a few sites like that. If someone dies and their site goes with them, it isn't right for some squatter to remove their work from history.

    And I wish I could pull up historic copies of the original altavista.digital.com.

  • Download Link? (Score:5, Interesting)

    by mysidia (191772) on Sunday October 28, 2012 @12:27AM (#41794521)

    How nice of them to do the archiving and release such a large dataset.

    Where can I download the file?

  • by TechyImmigrant (175943) on Sunday October 28, 2012 @12:34AM (#41794541) Journal

    It looks like they've copied my website and are therefore infringing my copyright.

    But I won't be suing them because I don't mind, because I'm not Apple.

  • What the hell (Score:5, Interesting)

    by nuckfuts (690967) on Sunday October 28, 2012 @02:46AM (#41794907)
    are they using for backups?
  • I know the prefix invokes unpleasant connotations, but it also means 10^15.

    • (This is in reference to the headline.)

    • by tehcyder (746570)

      I know the prefix invokes unpleasant connotations, but it also means 10^15.

      When I see the word "peta" I think of naked supermodels in public protesting about animals, or something. Call me superficial but I'm prepared not to worry about the animals they're insulting if I get to see more naked supermodels.

  • Were's my page then? (Score:3, Informative)

    by AndyKron (937105) on Sunday October 28, 2012 @05:56AM (#41795395)
    With all those pages stored why does it always tell me that page can't be found?
  • Shame about the lack of images*, archive.org is the only remaining evidence of Cliff Bleszinski's Cat-Scan.com [archive.org]. The site doesn't have the same comedy value without all the scans of squished cats.

    *Yes, yes, I know that archiving images would require many extra fucktons of storage, but it would be worth it in some cases.

  • Private archive (Score:4, Interesting)

    by fa2k (881632) <pmbjornstad AT gmail DOT com> on Sunday October 28, 2012 @10:08AM (#41796467)

    It's great that archive.org is doing this, but it's such an important part of history so I thought I would do a mini-version for the pages I visit, just to be able to refer back to stuff. I've been using the Firefox addon called Shelve to save all pages I visit on my home computer for about 2 months now (at most one version for each day). It's a total of 5.8 GB. It's not useful for browsing though, I'd love it if it was better integrated with Firefox such that I could choose among all versions of each page. There's sometimes some excellent information on university pages or cheap hosting, that could be 10 years old, and you never really know how long it's going to stay up..

    Anyway, this may give some perspective too; 2 months of daily snapshots of slashdot, other news, some tech stuff and a little Facebook takes just 5.8 GB.

    • by fa2k (881632)

      It's a total of 5.8 GB.

      Seems I forgot the most important part: It's a total of over 6,000,000,000 bytes!!1

  • What OS and file system are they using to store all that data?

"Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats." -- Howard Aiken

Working...