The Internet Archive Has Saved Over 10,000,000,000,000,000 Bytes of the Web

Slashdot is powered by your submissions, so send in your scoop

The Internet Archive Has Saved Over 10,000,000,000,000,000 Bytes of the Web 135

Posted by Soulskill on Saturday October 27, 2012 @11:14PM from the or-one-kilo-library-of-congress dept.

An anonymous reader writes "Last night, the Internet Archive threw a party; hundreds of Internet Archive supporters, volunteers, and staff celebrated that the site had passed the 10,000,000,000,000,000 byte mark for archiving the Internet. As the non-profit digital library, known for its Wayback Machine service, points out, the organization has thus now saved 10 petabytes of cultural material." The announcement coincided with the release of an 80-terabyte dataset for researchers and, for the first time, the complete literature of a people: the Balinese.

This discussion has been archived. No new comments can be posted.

The Internet Archive Has Saved Over 10,000,000,000,000,000 Bytes of the Web

Load All Comments

Search 135 Comments Log In/Create an Account

Comments Filter:

Relevance of byte count (Score:5, Funny)

by Anonymous Coward writes: on Saturday October 27, 2012 @11:24PM (#41794099)

How much of that is porn, I wonder.

Share
twitter facebook
- Re:Relevance of byte count (Score:5, Funny)
  
  by martin-boundary ( 547041 ) writes: on Saturday October 27, 2012 @11:25PM (#41794109)
  
  If only one of those files is a MP3, the RIAA is going to have an orgasm.
  
  Parent Share
  twitter facebook
  - Re:Relevance of byte count (Score:5, Insightful)
    
    by Xtifr ( 1323 ) writes: on Saturday October 27, 2012 @11:47PM (#41794203) Homepage
    
    They have over 1.5 million unique audio files in the Live Music Archive alone. I know because I helped them count. (That's unique files, not counting the duplicates in different formats.) If the RIAA has anything to say about it, they're serious slacking.
    
    Parent Share
    twitter facebook
    - Re: (Score:1)
      
      by Anonymous Coward writes:
      
      Sweet! Did you guys save my Geocities page too?
      - Re:Relevance of byte count (Score:5, Funny)
        
        by GofG ( 1288820 ) writes: on Sunday October 28, 2012 @03:01AM (#41794787)
        
        There is a torrent on thepiratebay of every single geocities site. It's an archive, but i've downloaded it. What was your site? I'll rar it up for you.
        
        Parent Share
        twitter facebook
        
        Re: (Score:1)
        
        by Anonymous Coward writes:
        
        Sorry my finger slipped when I went to mod this up. Can I undo!
        
        Re:Relevance of byte count (Score:5, Interesting)
        
        by GofG ( 1288820 ) writes: on Sunday October 28, 2012 @05:00AM (#41795083)
        
        No, go ahead and mod me down. Every time i post, I look at my user ID and think "GOD FUCKING DAMNIT IF I HAD WAITED LIKE TEN MINUTES I WOULD HAVE HAD A PALINDROME AUAUUUUUUGGGHHH"
        i deserve all the downmods i get, accidental or otherwise.
        
        Parent Share
        twitter facebook
        
        Re: (Score:1)
        
        by maxwell demon ( 590494 ) writes:
        
        You probably wouldn't have gotten a palindrome. Instead, you'd be even more angry for having missed the palindrome even more closely.
        
        Re: (Score:1)
        
        by GofG ( 1288820 ) writes:
        
        I'm only one off of a palindrome. I don't think it's possible to be any closer.
      - Re: (Score:3)
        
        by Xtifr ( 1323 ) writes:
        
        Probably. I found a copy of my first-ever homepage, which actually predated Geocities, and was probably even more useless than your average Geocities page. :)
      - Re: (Score:2)
        
        by mcgrew ( 92797 ) * writes:
        
        Nope, as good as Archive.org is, most of the pre-2000 stuff is gone forever. Much of my old gaming site is there, but not all of it. The only surviving page of Janet "Kneel" Harriott's Yello There is one I posted on my gaming site. Very liitle of mcgrew.info survives.
    - Re: (Score:2)
      
      by pongo000 ( 97357 ) writes:
      
      They have over 1.5 million unique audio files in the Live Music Archive alone.
      Since they can't be copied per the terms of the TOS, what good do they serve? Why bother counting something you technically can't access?
      - Re: (Score:2, Insightful)
        
        by Anonymous Coward writes:
        
        Because eventually they WILL be accessable when copyright runs out. But if nobody other than the 'rightsholders' have copies, that wouldn't matter, they could trivially remaster them, then have copyright over the remasters for another century after destroying the originals so they could never get out.
        
        Re:Relevance of byte count (Score:5, Funny)
        
        by Raenex ( 947668 ) writes: on Sunday October 28, 2012 @05:48AM (#41795205)
        
        when copyright runs out
        Thanks for the laugh.
        
        Parent Share
        twitter facebook
      - Re: (Score:2)
        
        by Xtifr ( 1323 ) writes:
        
        I think you must be looking at the wrong part of the Archive. Everything in the Live Music section and the Netlabels section is public domain or licensed under a CC license or equivalent. The media collections are separate from the Wayback Machine.
      - Re: (Score:2)
        
        by mcgrew ( 92797 ) * writes:
        
        Since they can't be copied per the terms of the TOS
        What are you talking about? A lot of friends of mine host their music on Archive.org.
  - Re: (Score:2)
    
    by girlintraining ( 1395911 ) writes:
    
    If only one of those files is a MP3, the RIAA is going to have an orgasm.
    Correction: Evilgasm.
- Re: (Score:1)
  
  by dohzer ( 867770 ) writes:
  
  4kB should be enough for anyone.
- Re: (Score:3)
  
  by Mikkeles ( 698461 ) writes:
  
  They're exaggerating; I know there are only 256 bytes, so I think they're counting duplicates!
- Re: (Score:2)
  
  by jc42 ( 318812 ) writes:
  
  How much of that is porn, I wonder.
  Actually, only about half. The other half is lolcats.
- Re: (Score:1)
  
  by bikubarat ( 2757441 ) writes:
  
  70%
Balinese, huh? (Score:2, Funny)

by Anonymous Coward writes:

Well, I guess they didn't have time to write much, being busy dealing with Orcs and Balrogs.
What about the Thorinim?
- Re: (Score:2)
  
  by K. S. Kyosuke ( 729550 ) writes:
  
  Hey, be glad it's not written in Palinese. Now *that* would have been nasty.
Indeed! (Score:3, Funny)

by Frosty Piss ( 770223 ) * writes: on Saturday October 27, 2012 @11:35PM (#41794139)

And nothing of value was saved...

Share
twitter facebook
Yes, but... (Score:5, Funny)

by Lordfly ( 590616 ) writes: on Saturday October 27, 2012 @11:43PM (#41794181) Journal

I need a car analogy about the Library of Congress before i can understand that number.

Share
twitter facebook
- Re: (Score:2)
  
  by MangoCats ( 2757129 ) writes:
  
  It's like the Library of Congress stuffed floor to ceiling with Service Manuals?
- Re:Yes, but... (Score:4, Interesting)
  
  by Squeeself ( 729802 ) writes: on Sunday October 28, 2012 @12:25AM (#41794317)
  
  I know this was in jest, but in this case, unlike so many other times this joke is made, it's slightly relevant. A quick Google turned up the following incomplete info http://www.quora.com/Library-of-Congress/How-much-data-does-the-library-of-congress-actually-represent [quora.com] which states tape storage capacity of the Library of Congress circa 2011 at 4.5 petabytes. The answer, then, is the this is approximately ~2 Library of Congresses of data, which is just a tad bit much to fit in the trunk of your car. It's going to take a few trips to the Library and back to move that data around.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by Voyager529 ( 1363959 ) writes:
    
    this is approximately ~2 Library of Congresses of data, which is just a tad bit much to fit in the trunk of your car. It's going to take a few trips to the Library and back to move that data around.
    In books, yes. In 32GByte MicroSD cards, it might be possible to do it in one trip with a large enough vehicle.
- Re: (Score:1)
  
  by thygate ( 1590197 ) writes:
  
  Well it's about 30 libraries of congress and 3 SUV's, plus or minus a minivan.
  - Re: (Score:2)
    
    by jimmydevice ( 699057 ) writes:
    
    At what tape density?
- Re: (Score:3)
  
  by deblau ( 68023 ) writes:
  
  If you live in Vancouver, it's roughly the number of nanometers you would cover on a round trip drive to the Library of Congress.
- Re: (Score:2)
  
  by oodaloop ( 1229816 ) writes:
  
  It's roughly 13,000 VW Beetles filled with telephone books.
- Re: (Score:1)
  
  by metalmaster ( 1005171 ) writes:
  
  You mean like hurricane sandy?
- Re: (Score:1)
  
  by ixnaay ( 662250 ) writes:
  
  The First Council of the Druids will find a way to recover the data.
  - Re: (Score:2)
    
    by Voyager529 ( 1363959 ) writes:
    
    The First Council of the Druids will find a way to recover the data.
    And when they do, they will be known as the Disk Druids.
Indispensable reference for slashdotters (Score:5, Insightful)

by guttentag ( 313541 ) writes: on Sunday October 28, 2012 @12:05AM (#41794255) Journal

For instance, note the archived film [archive.org] "Dating: Do's and Don'ts" (1949) It begins thus:
How do you choose a date? Whose company would you enjoy?

Well, one thing you can consider is looks. Woody thought of Janice and how good looking she was. He'd really have to rate to date her. Yes, he'd enjoy that, except... Well, it's too bad Janice always acts so superior. She'd make a fellow feel awkward and bored.

Well, perhaps someone who doesn't feel so superior. There's Betty. And yet, it just doesn't seem as if she'd be much fun.

What about Anne? She knows how to have a good time, and how to make the fellow with her relax, too. Yes, that's what a boy likes.
Yes, the Internet now provides everything you ever needed to know but were afraid to ask.

Share
twitter facebook
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  How do you choose a date? Whose company would you enjoy? ... Well, it's too bad Janice always acts so superior. She'd make a fellow feel awkward and bored. ... What about Anne? She knows how to have a good time, and how to make the fellow with her relax, too. Yes, that's what a boy likes.
  Yes Janice, get your head out of your ass. You could take a few tips from Anne -- she's a pro."
- Re: (Score:1, Informative)
  
  by Aldanga ( 1757414 ) writes:
  
  Incorrect. A kibibyte is 1024 bytes, while a kilobyte is 1000 bytes. [wikipedia.org]
  I don't usually care enough to point out the distinction, but since you did, I figured a correction was appropriate.
  - - Re: (Score:2)
      
      by 91degrees ( 207121 ) writes:
      
      If you are referring to storage sizes in relation to computers, be it RAM, disk sizes, etc., it is correct to express them in powers of 2.
      No it's not. It's sometimes convenient to do so, especially for RAM, but the prefixes used are defined by the SI and recognised by a large number of international organisations including the IEEE.
      
      Yes, marketing people find this useful. But it's also recognised as correct by many engineers. It's actually quite useful. Using a certain type of modulation, a 1KHz signal
      - Re: (Score:1)
        
        by 91degrees ( 207121 ) writes:
        
        For storage sizes they're defined based on powers of two, which overrides the SI definition because more specific rules always override more general ones.
        Which rules? Where is a kilobyte defined as 1024 bytes by any organisation with any influence?
        
        And why base it on powers of two? It's illogical. The only time you're forced into a power of two is in the address space available to a CPU.
All of which is rather useless... (Score:5, Interesting)

by pongo000 ( 97357 ) writes: on Sunday October 28, 2012 @12:30AM (#41794335)

...since the TOS specifically prohibits copying data from the site:
"Our terms of use specify that users of the Wayback Machine are not to copy data from the collection. If there are special circumstances that you think the Archive should consider, please contact info at archive dot org. "
Warrick hasn't been taking new requests for months (and I'm sure it's more of a research tool than an actual service for the public), and the site effectively blocks attempts to backup data using wget. It makes me wonder who (or what) this archive really serves, because it's most certainly not the general public.

Share
twitter facebook
- Re: (Score:1)
  
  by happyscientist ( 2508556 ) writes:
  
  Besides the fact that it is not open, how are you supposed to compute against it? Download the 80TB and run it on your private data center? It should be fully open and available on a platform like AWS so people can actually use it.
- Re: (Score:3)
  
  by Xtifr ( 1323 ) writes:
  
  A) You can read it just like you can read normal webpages on the main web, most of which also don't allow you to copy them.
  B) The Archive is more than just the Wayback machine. They also have what is almost certainly the worlds largest digital collection of public domain and CC-licensed media files in their media collections.
  - Re: (Score:1)
    
    by Anonymous Coward writes:
    
    On A: reading webpages IS copying them. Any attempt at distinction, given the technical details, is INSANE.
- Re: (Score:1)
  
  by maxwell demon ( 590494 ) writes:
  
  No, -h always takes the largest applicable unit. Thus it would report 9 Petabytes. No wait, 8 Petabytes, because it always rounds down.
They Should Copy All Of The Web Site (Score:2)

by eugene ts wong ( 231154 ) writes:

I have never understood why the few archive sites, that I have been to, never back up the entire web site, instead of just a few important pages and images. I can understand not accessing pages that are supposed to be secure, but all other pages should be fair game. This is most important for product knowledge. Some times a company takes down its site and images. It would be nice to have an archive to go to.
- Re: (Score:1)
  
  by fustakrakich ( 1673220 ) writes:
  
  Very confusing [illinois.edu]
looks like you forgot to add '-h' switch (Score:4, Insightful)

by Anonymous Coward writes: on Sunday October 28, 2012 @01:11AM (#41794471)

10,000,000,000,000,000 Bytes = 8.88 Petabytes

Share
twitter facebook
- Re: (Score:3, Informative)
  
  by Anonymous Coward writes:
  
  looks like you forgot to spell pebibytes correctly
- Re: (Score:1)
  
  by AmiMoJo ( 196126 ) writes:
  
  They should have made a pebibyte 1,000,000,000,000,000. Trying to redefine petabyte was stupid.
  - Re:looks like you forgot to add '-h' switch (Score:5, Informative)
    
    by Anonymous Coward writes: on Sunday October 28, 2012 @05:45AM (#41795199)
    
    You have that backwards, kilo, mega, giga, tera and so forth are base ten prefixes and have been for quite a bit longer than people have been misusing them to refer to base 2 numbers. As such it made more sense to leave it consistent with everything else and make a new prefix for the binary numbers.
    
    Parent Share
    twitter facebook
    - - Re: (Score:3)
        
        by swillden ( 191260 ) writes:
        
        As such it made more sense to leave it consistent with everything else and make a new prefix for the binary numbers.
        In the context of storage sizes they were well-established with the binary-based definitions, so changing them to be decimal-based isn't "leaving" anything.
        Not true. In the context of storage sizes they were well-established to be base-10 definitions from the dawn of the computer age up until the 1980s or so. Only in the last 30 years or so have we started using powers of two units, and then only for RAM. Up until then, RAM was measured in powers-of-10 words, and in disk-based storage base 10 was and still is the norm. Network data rates likewise are and always have been in powers-of-10 units.
        This is why it's useful to be careful to use the proper prefix
  - - Re: (Score:1)
      
      by maxwell demon ( 590494 ) writes:
      
      No. If they were to rename the decimal prefixes they would have to call it peDEbyte. Bi stands for binary after all. Incidently, pede is French for "gay" in the sense of homosexual.
      Is there another meaning of "gay"?
      - Re: (Score:2)
        
        by mcgrew ( 92797 ) * writes:
        
        Is there another meaning of "gay"?
        Within living memory (my own memory in fact) "Gay" didn't mean "homosexual", it meant happy and carefree. The Christmas song "deck the Halls" isn't about transvestites ("Don we now our gay apparell").
        
        Re: (Score:2)
        
        by sysrammer ( 446839 ) writes:
        
        Yeah, when I read through my old Heinlein, I get a chuckle nowadays. He tended to use gay fairly often...the old definition, of course.
Domain parkers deleting archives (Score:5, Informative)

by linebackn ( 131821 ) writes: on Sunday October 28, 2012 @01:25AM (#41794515)

I don't know if they have done anything about this recently, but there was a problem with domain parking sites putting up a robots.txt that instructs Archive.org to delete or suppress any archives of the site that was there previously. Have run in to a few sites like that. If someone dies and their site goes with them, it isn't right for some squatter to remove their work from history.
And I wish I could pull up historic copies of the original altavista.digital.com.

Share
twitter facebook
Download Link? (Score:5, Interesting)

by mysidia ( 191772 ) writes: on Sunday October 28, 2012 @01:27AM (#41794521)

How nice of them to do the archiving and release such a large dataset.
Where can I download the file?

Share
twitter facebook
My Poor Infringed Copyright!! (Score:5, Funny)

by TechyImmigrant ( 175943 ) writes: on Sunday October 28, 2012 @01:34AM (#41794541) Homepage Journal

It looks like they've copied my website and are therefore infringing my copyright.
But I won't be suing them because I don't mind, because I'm not Apple.

Share
twitter facebook
- - Re: (Score:1)
    
    by archen ( 447353 ) writes:
    
    I'd be very interested in that. One thing I've started to wonder about is what will happen to my website after my death. Archive.org stopped archiving changes on my site in 2005 and it only did a so-so job of capturing things anyway. Ages after I'm gone, it's likely http websites may simply have gone away. I've started looking into services that will preserve my site for historical reasons, but I'd feel a lot better having it among a dedicated catalogue in a historical preservation.
- Re: (Score:2)
  
  by icebraining ( 1313345 ) writes:
  
  If you want to keep something private, maybe you shouldn't make it available to everyone on the Web?
What the hell (Score:5, Interesting)

by nuckfuts ( 690967 ) writes: on Sunday October 28, 2012 @03:46AM (#41794907)

are they using for backups?

Share
twitter facebook
- Re: (Score:3)
  
  by rubycodez ( 864176 ) writes:
  
  more disks, and they send a copy to euroarchive and the Library of Alexandria. in 2006, that copy & verify process to remote site took two weeks.
  http://www.enterprisestorageforum.com/technology/features/article.php/3633256/The-Wayback-Machine-From-Petabytes-to-PetaBoxes.htm [enterprise...eforum.com]
- Re: (Score:2)
  
  by SlashDread ( 38969 ) writes:
  
  Why, the internet of course.
  - Re: (Score:2)
    
    by Trogre ( 513942 ) writes:
    
    Isn't that a bit like doing a Google search for "google" ?
Just fucking say Petabytes. (Score:3)

by Arancaytar ( 966377 ) writes: <arancaytar.ilyaran@gmail.com> on Sunday October 28, 2012 @04:21AM (#41794987) Homepage

I know the prefix invokes unpleasant connotations, but it also means 10^15.

Share
twitter facebook
- Re: (Score:2)
  
  by Arancaytar ( 966377 ) writes:
  
  (This is in reference to the headline.)
- Re: (Score:2)
  
  by tehcyder ( 746570 ) writes:
  
  I know the prefix invokes unpleasant connotations, but it also means 10^15.
  When I see the word "peta" I think of naked supermodels in public protesting about animals, or something. Call me superficial but I'm prepared not to worry about the animals they're insulting if I get to see more naked supermodels.
- Re: (Score:2)
  
  by tehcyder ( 746570 ) writes:
  
  They should print it all off, for safekeeping.
  We would then be able to get a more realistic Libraries of Congress measurement.
Were's my page then? (Score:3, Informative)

by AndyKron ( 937105 ) writes: on Sunday October 28, 2012 @06:56AM (#41795395)

With all those pages stored why does it always tell me that page can't be found?

Share
twitter facebook
Moar Pics! (Score:2)

by CodeheadUK ( 2717911 ) writes:

Shame about the lack of images*, archive.org is the only remaining evidence of Cliff Bleszinski's Cat-Scan.com [archive.org]. The site doesn't have the same comedy value without all the scans of squished cats.
*Yes, yes, I know that archiving images would require many extra fucktons of storage, but it would be worth it in some cases.
Private archive (Score:4, Interesting)

by fa2k ( 881632 ) writes: <pmbjornstad@noSPAm.gmail.com> on Sunday October 28, 2012 @11:08AM (#41796467)

It's great that archive.org is doing this, but it's such an important part of history so I thought I would do a mini-version for the pages I visit, just to be able to refer back to stuff. I've been using the Firefox addon called Shelve to save all pages I visit on my home computer for about 2 months now (at most one version for each day). It's a total of 5.8 GB. It's not useful for browsing though, I'd love it if it was better integrated with Firefox such that I could choose among all versions of each page. There's sometimes some excellent information on university pages or cheap hosting, that could be 10 years old, and you never really know how long it's going to stay up..
Anyway, this may give some perspective too; 2 months of daily snapshots of slashdot, other news, some tech stuff and a little Facebook takes just 5.8 GB.

Share
twitter facebook
- Re: (Score:2)
  
  by fa2k ( 881632 ) writes:
  
  It's a total of 5.8 GB.
  Seems I forgot the most important part: It's a total of over 6,000,000,000 bytes!!1
What file system are they using (Score:2)

by QuietLagoon ( 813062 ) writes:

What OS and file system are they using to store all that data?
- Re: (Score:1)
  
  by maxwell demon ( 590494 ) writes:
  
  Given how incomplete the stored sites are, I guess most of the data is stored on /dev/null.
- Re: (Score:2)
  
  by tehcyder ( 746570 ) writes:
  
  10 Petabytes of information is insignificant. My corporate network has that much data, and backs up several hundred Terabytes nightly.
  data!=information

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Relevance of byte count (Score:5, Funny)

Re:Relevance of byte count (Score:5, Funny)

Re:Relevance of byte count (Score:5, Insightful)

Re: (Score:1)

Re:Relevance of byte count (Score:5, Funny)

Re: (Score:1)

Re:Relevance of byte count (Score:5, Interesting)

Re: (Score:1)

Re: (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Insightful)

Re:Relevance of byte count (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Balinese, huh? (Score:2, Funny)

Re: (Score:2)

Indeed! (Score:3, Funny)

Yes, but... (Score:5, Funny)

Re: (Score:2)

Re:Yes, but... (Score:4, Interesting)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Indispensable reference for slashdotters (Score:5, Insightful)

Re: (Score:1)

Re: (Score:1, Informative)

Re: (Score:2)

Re: (Score:1)

All of which is rather useless... (Score:5, Interesting)

Re: (Score:1)

Re: (Score:3)

Re: (Score:1)

Re: (Score:1)

They Should Copy All Of The Web Site (Score:2)

Re: (Score:1)

looks like you forgot to add '-h' switch (Score:4, Insightful)

Re: (Score:3, Informative)

Re: (Score:1)

Re:looks like you forgot to add '-h' switch (Score:5, Informative)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Domain parkers deleting archives (Score:5, Informative)

Download Link? (Score:5, Interesting)

My Poor Infringed Copyright!! (Score:5, Funny)

Re: (Score:1)

Re: (Score:2)

What the hell (Score:5, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Just fucking say Petabytes. (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Were's my page then? (Score:3, Informative)

Moar Pics! (Score:2)

Private archive (Score:4, Interesting)

Re: (Score:2)

What file system are they using (Score:2)

Re: (Score:1)

Re: (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals