Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!


Forgot your password?

Distributed Internet Backup System 306

deadfx writes "Since disk drives are cheap, backup should be cheap too. Of course it does not help to mirror your data by adding more disks to your own computer because a fire, flood, power surge, etc. could still wipe out your local data center. Instead, you should give your files to peers (and in return store their files) so that if a catastrophe strikes your area, you can recover data from surviving peers. The Distributed Internet Backup System (DIBS) is designed to implement this vision."
This discussion has been archived. No new comments can be posted.

Distributed Internet Backup System

Comments Filter:
  • by caluml ( 551744 ) <slashdot@@@spamgoeshere...calum...org> on Friday January 31, 2003 @12:08PM (#5196541) Homepage
    The main problem with this approach (and for that matter Freenet) is that it is slow for all but the smallest files.

    Bandwidth is still the most precious commodity in computing. Once we get fibre to every house, then distributed storage will make sense.
    • by nano2nd ( 205661 ) on Friday January 31, 2003 @12:15PM (#5196618) Homepage
      You're right in that today's infrastructure isn't made for chuffing massive, hard-drive-sized hunks of data back and forth.

      But what about incremental backups?

      OK so you've got to get your base image uploaded -somehow- but after that, data changes very little on a daily basis and this level of data transfer to some secure backup repository won't be a problem at all with current bandwidth.
    • Maybe true at home, but not true for campus networks w/ Gb+ infrastructure in place.
    • by gmuslera ( 3436 ) on Friday January 31, 2003 @12:20PM (#5196656) Homepage Journal
      For internal networks where you have a lot of fast connected servers, sparing a bit of bandwidth and disk space to have a distributed backup across the LAN could be useful, specially when you can backup servers data in workstations and so on.
    • Depends how much you need to back up.
      For home use it could be very useful. Especially
      if you only back up changes (like rsync).

      The important stuff are things like:

      1. Your digital photo album.
      On average it probably grows >1 MByte/day.

      2. Personal email and documents.
      A few 100KByte/day if you use an efficient document format
      and dont receive movies as attachments.

      3. System settings, list of installed software etc.
      Very small updates.

      By important I mean stuff you would be missing the day
      your house burns down.

    • Most pc's come with a recovery CD. Only backup across the net stuff that isn't on the recovery CD. (globally attrib everything as backed up when the PC is installed, and do incrimental backups.)

      An alternative for home built PCs, burn two CD-RW backup sets on alternate weeks, storing the previous week's collection at a buddies home, or in a safe depostit box, or some other secure location, do daily incremental backups online, with a discard option for any backup over two weeks old.

      One option with the collection of CD-RW's would be if you keep them with whomever provides your storage online, the CD-RW's could be put online to download across a broadband connection. This would be faster than overnight delivery, but not as fast as a courier across town.

      Just some idea's.

      • Ideally, you should be able to make your computer fail *COMPLETELY* and still be able to recover completely. The distributed backup plan seems to have different specific advantages for two specific groups of home users, but has the same overall beneficial results.

        For the average Joe with only one computer running that ancient copy of Windows98 on a P133, the massive ammount of data-cruft is bound to be the weakest point of upgrading or even backing up. I've found that most families only have that one computer, and only have the option of backing up onto floppies. Usually their data can fit on one or two CDR/CDRW discs, but their system is also usually too old to get a cd burner to work reliably. In addition, they're just too stingy with the purse-strings to shell out the $100 or so for a decent, middle-of-the-pack drive, anyway. Sending critical data over the internet might be a better option, if a bit more time-consuming (no broadband, only 56k modem). Frequent backups like this has the potential to be substantially more reliable, not to mention scores easier, than a pile of floppies as you're ideally only sending the new data. I can't tell you how often I wished for something like this when working on a friend's/family's system across town and away from my own network.

        And that brings me to my second group that can really take advantage of something like this: Power-users with a small network running at home. My network has a file-server that stores *EVERYTHING* on it for backup purposes. It's got ISO's of all my software and OS's, drivers, stand-alone programs, documents, and media files. Currently, there's about 80GB of data on there. Backing up that data is a Travan-5 drive (10GB/tape, native) and 9 cartridges. At about 3 hours per tape, backing up to 9 TR-5 tapes takes days, not hours. There's two additional tapes for backup of the server's OS and configuration and it easily fits on one tape. But if there are any significant changes to the system, I rotate the tape so that there's always a working copy in case things go terribly wrong. That's a total of 11 tapes. They're not exactly cheap, but it's probably the least expensive backup I can find right now without going to removable HDs (I'm avoiding that solution as HDs are, in my opinion, less reliable and durable than tapes). Using this distributed backup plan would allow me to recover my server's OS from the single tape and retrieve the data from the network when I have time.

        The 2 desktops and 2 laptops can be fully recovered with an OS or system recovery cd and the rest is available on the server. In fact, I usually have one of each type of computer down at any given time for something-or-other. Having the data on the server allows me to blow away any of the systems I run at any time and completely recover the system to a working state in just over an hour.

        Actually, I had been setting up a distributed backup plan for my own server with some of my friends so we'd all have each others' server's backup. More accurately, the plan was to merge the changes between all the servers' data and share it between all of us in a manner similar to CVS. There's only 3 of us, but we're located all over the state and we all have broadband. 80GB of data is a large ammount to initially transfer. Really, though, all we'd be transmitting is the changes we've made which would limit the total bandwidth used. We'd probably only set it up for once per week in automatic mode to further decrease the load with an option to manually update. In the event of a complete failure of one of the systems, there should be a copy from one of the other two servers that's no older than 1 week. As the storage requirements grow, each server can be updated with additional storage in sequence so that it recovers in a manner similar to how a RAID5 array rebuilds the data on a replaced drive.

        Unfortunately, neither of my two friends in question have the resources to afford the hardware and set up their own server to the reliability standards that I'm requiring, so it kind of fell through for now. I'm working with them on how to get everything running, and I may just maintain it for them from a remote console. They'll still host the server on their network and have access to it, of course. But the responsibility of maintaining the system may just have to lie with me.

        In short, it's not terribly difficult to implement a solution like this, but there are serious bandwidth concerns. If you're only doing this amongst your friends/peers, it's possible to mitigate the bandwidth issue by using a single removable hard disk to sneakernet the data to a fresh server. This allows for a much more reliable home network for power-users, and gives some peace-of-mind to the average user (and their power-user friends who fix their computer for them)

    • Bandwidth is still the most precious commodity in computing.

      never underestimate the bandwidth of a station wagon full of dlt tapes!

    • There is a solution to this - it is called rsync [anu.edu.au]
  • by Quarters ( 18322 ) on Friday January 31, 2003 @12:08PM (#5196547)
    I've got my terrabyte array setup. Your, "Worlds of Warcraft" data will be completely secure on my backup node.

    Go ahead, send it.

    I'm waiting....
  • by ackthpt ( 218170 ) on Friday January 31, 2003 @12:08PM (#5196549) Homepage Journal
    All my data and software are backed up on crackers computers.

    I'm not worried. %-)

  • do this with schools (Score:5, Interesting)

    by octalgirl ( 580949 ) on Friday January 31, 2003 @12:09PM (#5196554) Journal
    We do this with neighbor school districts. We also backup all buildings, over the WAN and at night, to a file on the hard drive of another building. We do this in two places, so backups criss-cross. Because of the size and time it takes, this can only happen at night and only one building per night, so there is a downside. But if a building goes down, I know I have a secondary (besides the tape in that building) to fall back on.
    • We also backup all buildings, over the WAN and at night, to a file on the hard drive of another building.

      I think this is also common among universities for registrar data. At the univerisity I attended, there was a big-ass HP server at each corner of campus running replicated databases. A disaster would have to take out several square miles of land before all hope was lost for the data. This makes all but atomic or cosmic disasters survivable.
  • ...just share everything on a P2P network. Then, after a crash, just fire up your favorite client and go get your invaluable porn^H^H^H^H data files!
    • or... just seat up a drbd [slackworks.com]"lan" mirror via vpn.

      the drawback to drbd is that it's not encrypted on the backup device. the advantage is that you can hook it up with something like heartbeat to have failover.

  • There are lots of .ISOs, MP3s and DivXs out there as my backup.
  • by Roarkk ( 303058 )

    What's with all of the "cut and paste" stories lately?

    One of the things I like about Slashdot is the different takes on existing news presented by user submissions. Lately, though, many stories seem to be just copied directly from the link's website.

  • Distributed Internet Backup System = Gnutella

    • There's a diff between gnutella and DIRS. Gnutella is pull where-as your backup system would need to push. If I'm on Gnutella, and I want my mp3's backed up, I can't garantee that all of them will. How many people are gonna like all the music that I like.

      And no, I don't like N'sync or Britney Spears.
  • Security? (Score:5, Interesting)

    by vano2001 ( 617789 ) on Friday January 31, 2003 @12:12PM (#5196580)
    What if it is sensitive data? Do you think even with all that cryptography and secure computing blabla people will trust storing their important files on other people's computers? think not. There are companies who put their backups into safes ... ask *them* to put it online on a slashdot reader's PC. See what they answer. Freenet and similar networks are only good for general [public] domain data
  • Just copy your drive to another drive, oh say an 80 gig drive or so [make sure it's at least 80 even if you're only backing up a gig of data, this is important]. Then just ship the drive to me. I'll watch over it like a hawk. Heck, I'll even throw it into my machine so that I can monitor the drive and make sure no one is tampering with the data. Also, I'll be storing my data on there as well, but you already knew that. And this way, it's distributed! So you get to use that word! I mean, "distributed computing" is an important field of research but it's starting to get to the level of mindless buzzword because non-CS people are using it so much. Certainly it'll never be as bad as the buzzwords that the software engineering managers throw around, but it's a problem.

    Hello extreme programming fans? Please leave the building.

  • by saskboy ( 600063 ) on Friday January 31, 2003 @12:12PM (#5196584) Homepage Journal
    As has been mentioned already, [no this is not redundant, because I am writing this myself] the potential for data being stolen is too great an issue to overlook. This is not a viable option because the potential for theft is too great, and no ammount of encryption will make a difference. Encryption will always be broken.
    • True, but in some lights having a backup at all is insecure. But, if you have confidential information, I would not imagine you would choose this as your solution---I would envision that this would be more for stuff along the lines of personal photographs and other nostalgia...and perhaps term papers...in other words, the set of stuff that is not confidential and is not something like videos and mp3s that everyone wants to share. I guess for truly confidential data , your best security is locking your computer in a room, and your only backup would be that USB drive you carry around your neck, under your shirt, next to your gun.
    • Encryption will always be broken.

      That's not neccessarily true. Algorithms can be mathematically shown to be at best brute-force crackable. With a long enough key, that could be shown to take at least as long as you're alive.

      And even if all encryption can be broken.. So? My mother's a school teacher. She works for hours every night on lesson plans for her third grade class, making sure she has dittos and lessons and things. She backs up regularly, because if the harddrive went down, *months* or years of hard work would be lost.

      It'd be nice to encrypt it if such data were sent over the net. But you know what? Who really cares? If my mom's lesson plans are decrypted, sure, maybe some enterprising third grader somewhere will get an advance peak at the next arithmatic test, but really, it makes no difference. Still, having off-site backups would be a *good* thing for my mom.

      If your data is mission-critical and MUST be kept secret, well, then you do what you have to do -- send tapes to Iron Mountain, or whatever, but for the other 90% of us, the photos of our friends, etc, are nice to have automatically backed up to some offsite node, but it really doesn't matter if somebody sneaks a peak supposing the encryption's broken.
  • by Michalson ( 638911 ) on Friday January 31, 2003 @12:12PM (#5196585)
    What is to say that the FBI/RIAA won't come to your house, claiming you have terrorest information/stolen music stored on your harddrive? And assuming it was true, would you be legally/crimminally liable for it? This gives a whole new meaning to the excuse "well I was just holding it for a friend".
    • by kryzx ( 178628 ) on Friday January 31, 2003 @12:20PM (#5196652) Homepage Journal
      This is actually a good question. If I back up my music file on your computer, does that fall under "fair use"? Would whether you access them or not effect the legal position? Is it possible to build something like this so my files can only be accessed, or at least can only be decrypted, by me, and hence are not usable to the person providing the disk space? If so, would that change the legal implications?

      This raises all sorts of interesting questions. Unfortunately the answer to all of these questions is most likely "we won't know until it goes to court and there is a ruling to estabish precedent."

      • Unfortunately I think it would be bad *either* way. Now since "stolen music" is somewhat debateble here on /., and most people aren't too worried about being charged with terrorism, I'll try something more clear cut: Kiddie pron. Ruling 1: You are responsible for what is on your HD Result: Someone backs up their illegal pics to your harddrive (you don't know this because it's encrypted), you (innocent) get charged for it and sent to jail. Ruling 2: You are not responsible for encrypted content that appears to have been generated by this netbackup program. Result: Every pedophiles dream has come true. They simply encrypt their stuff and spoof it to look like someone elses backup file. They are now immune from procecution because "it's someone elses". Same applies to anyone else that wants to store something illegal on a computer system. Obviously there needs to be a way to positively indentify who "owns" what content on your harddrive before a system like this could become [legally] safe.

      • Is it possible to build something like this so my files can only be accessed, or at least can only be decrypted, by me, and hence are not usable to the person providing the disk space?

        If you had read the DIBS introduction on the linked page, you would have seen the following:

        Note that DIBS is a backup system not a file sharing system like Napster, Gnutella, Kazaa, etc. In fact, DIBS encrypts all data transmissions so that the peers you trade files with can not access your data.

  • I have a shell script that sends contents of a directory on my home systems to a machine of mine at a hosting company in another state, and vis-versa. Cron runs it on a nighly basis.

    I always figured it was a fairly common thing for "data conscience geeks" to do.

    Of course this is aimed at users who don't have their own off-site servers.

    • As well as doing that (using ssh and rsync), I also tar and feather up /home and a few other important directories on a nightly basis, and copy them to my iPod.

      And I gave a friend of mine an FTP account on my system so he can copy his files to my system.

      One of these days I'll get off my ass and reinstall the DLT drive that I bought off eBay. I had it working for a year or two, but I had power and heat problems on my machine and took it out.
  • With this system all other P2P networks will go bye-bye
    Why bother searching for files when I have my friends 200GB movies and mp3 collection backed up on my machine!
    Its not copying its a Back-up! 8)

  • by PepperedApple ( 645980 ) on Friday January 31, 2003 @12:15PM (#5196613) Homepage
    It's not so much that I wouldn't trust someone not to break the encryption, but what if the person who's holding your backup copies gets tired of giving up disk storage and just deletes the software from his/her computer. Or what if their computer happens to be off when you want to retrieve the backup?
    • what if the person who's holding your backup copies gets tired of giving up disk storage and just deletes the software from his/her computer

      That's the same as a simple failure, which the software is designed to handle anyway. What's not clear from the documentation (and I'm too pressed for time to read the code right now) is whether it does The Right Thing when a peer comes back.

    • What the system needs is the concept of a heartbeat-based contract; i.e., a line in the partner data file which says that both machines will attempt to ping each other so often (every hour perhaps, or more often if they're both always online) and that if you don't hear from each other for a certain period (say, 48 hours, a week, a month, depending on circumstances and urgency), you can assume that they're gone and nuke their data (and vice versa).

      Ideally, the ping mechanism should have some sort of cryptographic handshaking so that the other party can't falsely claim that you were offline if they prematurely delete your data. (If the data is lost, there should be a mechanism for signalling this back to the data's owner so it can be replaced or the contract ended. Perhaps a reputation-based mechanism for dealing with cheats could also be useful.)
  • It's not a bad idea. The website talks more about security (PGP) and such, which would be my primary concern. (My porn, not their's...)

    Seriously, though... Just as with P2P networks, it depends on a strong, diverse, and reliable mesh. Any natural disaster, bandwidth failure, or even power failure could wipe out most, if not all, of your peer backups. Tried and true remains for me.

  • affordable jukeboxes.

    People should be able to burn DVDs and have a keg-refrigerator sized juke box with a few hundred of these in it hooked up as a near-line SCSI device.

    You CAN get these but the cheap ones are 25 grand.

    Anyone know why they're so expensive? I'd love a non-volitile terabyte or two.

  • I certaninly know my company would never give it's confidential data to others to backup ... and isn't that the most important type of data?

    The obvious solution is to encrypt. BUT ... how long will encryption of today last? If I have plans for a product that will last 15 years, I don't want the plans out there to be decrypted in 10. Also... where do I store my decryption key? If that get's lost, I might as well have no backup at all.
    • Compress it, encrypt it, split it into 1K chunks, and interleave it among backup servers indexed by hash value. Cracking the encryption and getting anything useful out of it will depend on knowing where each chunk belongs. The low-entropy compressed plaintext will also help to make cryptanalysis difficult.
  • And what if (Score:4, Interesting)

    by Apparition-X ( 617975 ) on Friday January 31, 2003 @12:17PM (#5196632)
    I grant that personal backup is time consuming and it is tough to find a good method without resorting to expensive tape or hundreds of CDs. But as intriguing as this approach is, there seems like a lot of problems with it.

    What if the reason you need to do a recovery is because your system with internet access is toast? How long does it take to restore several hundred thousand files? What about peers that drop off the network, or that are only on sporadically (no, that never happens in peer to peer filesharing networks!).

    Even aside from the issues of speed of restoration, I can't imagine too many circumstances in which you want to rely on a internet network connection as a prerequisite for a successful restore... Although perhaps as a way of complimenting existing backup methodologies (i.e. backup root and critical config information to tape or CD, and the rest of your schiznit to DIBS) this might have a place.
  • by mao che minh ( 611166 ) on Friday January 31, 2003 @12:19PM (#5196647) Journal
    I hereby volunteer to aid in the storage and backup duties of everyone's data that has at least three instances of the letter "x" or the string "britney" within it's file name. This is because my backup scripts only save files that satisfy these requirements. In return, could someoneplease help me store my vast collection Star Trek bloopers. It's just funny to hear Patrick Stewart cuss.

    Additionally, I extend a warm hand of support to Microsoft. I will accept any request by chairman Bill Gates to store sensitive files.

  • "Note that DIBS is a backup system not a file sharing system like Napster, Gnutella, Kazaa, etc. In fact, DIBS encrypts all data transmissions so that the peers you trade files with can not access your data."

    as much as the page says it isn't a file sharing system, it essentially is - a special-purpose, secure file-sharing system. as a p2p developer, i know that this system could be built off gnutella and benefit from some of the innovations occurring in gnutella land.
  • by 4/3PI*R^3 ( 102276 ) on Friday January 31, 2003 @12:25PM (#5196687)
    This is just the next evolutionary change in P2P. Encrypting data and exchanging the encryption key so that only those "in the know" can exchange files and the *AA groups don't know what you are trading.

    In the "Pefect Example of Talking Out of Both Sides Of Your Mouth" Department:

    This is posted on the home page:
    Note that DIBS is a backup system not a file sharing system like Napster, Gnutella, Kazaa, etc. In fact, DIBS encrypts all data transmissions so that the peers you trade files with can not access your data.[emphasis mine]

    This is posted on the documentation page:
    Make sure you give your gpg public key to any peers you want to trade files with.[emphasis mine]
    • First, I think you're misunderstanding the point of DIBS... a public key is required to encode, but doesn't do any good for decoding, so giving someone your public key only allows them to give you things you could decode.

      I wouldn't read too much into the fact that they say you're "trading files"... because that is, after all, what you're doing, even if you can't read the files that you recieved in trade.

      On the P2P thing, I'm not sure public key cryptosystems would be advantageous at all. First off, the public keys would uniquely identify the participants. On the other hand, if a P2P client were to generate its own keys, then it would be trivial for authorities to join the network and see the traffic unencrypted.

      There might be interest in "private" P2P, but that kind of defeats the purpose of P2P, right? Getting files from unknown sources and searching millions of clients worldwide?

      Napster would have been boring if it were just me and my friends.
  • by wfrp01 ( 82831 ) on Friday January 31, 2003 @12:26PM (#5196702) Journal
    Some nice folks at Stanford are also creating a different flavor of network backup called rdiff-backup. I'll just plagiarize the description from the homepage:

    rdiff-backup backs up one directory to another, possibly over a network. The target directory ends up a copy of the source directory, but extra reverse diffs are stored in a special subdirectory of that target directory, so you can still recover files lost some time ago. The idea is to combine the best features of a mirror and an incremental backup. rdiff-backup also preserves subdirectories, hard links, dev files, permissions, uid/gid ownership (if it is running as root), and modification times. Finally, rdiff-backup can operate in a bandwidth efficient manner over a pipe, like rsync. Thus you can use rdiff-backup and ssh to securely back a hard drive up to a remote location, and only the differences will be transmitted.

    The homepage also links to a project called duplicity [nongnu.org], which operates on a similar principle, but uses GnuPG to encrypt data to prevent spying/modification.
    • Some nice folks at Stanford are also creating a different flavor of network backup called rdiff-backup.

      Yes, it looks like a great solution, I've been looking into this lately. The only downside is that the remote system can access and change your data as it's not encrypted. The actual communication of the data can be wrapped in SSL or through a SSH tunnel, so that part is secure.

      You can only use it amongst people you trust, for non-personal data storage (unlike the linked article). I am presently trying to persude a friend to implement this, or possibly rsync to back-up my large media drive.

      With rsync however, we get another advantage...we both can access and add to the data store, with confidence that the data is pretty safe. Who needs p2p when you have all your friends media available to you as well as your own? ;-) Get a new album, it gets copied across at some point during the night.

    • I use both rdiff-backup and duplicity, and I can't speak highly enough of them. Top notch software. Easy to use, well documented, and with great functionality.
  • by fudgefactor7 ( 581449 ) on Friday January 31, 2003 @12:27PM (#5196706)
    It's been discussed (and even tried) before, the problems were many, namely security speed, and availability. One cannot guarantee any of those three every important variables. As a result it (the idea) died a horrible death--let's hope it dies again.
  • Fault Tolerance? (Score:3, Informative)

    by nlinecomputers ( 602059 ) on Friday January 31, 2003 @12:28PM (#5196712)
    I haven't had the chance to read the article yet. Just skimed the site. How fault tolerant is this? What happpends if I need my data and a chunk is on a member that is offline. Is the data stored redundantly?

  • This sounds good, except that mirrors of my massive pr0n collection could threaten the stability of the internet...nevermind the threat of uploading mine and the millions of other pervs out there!
  • by angry_beaver ( 458910 ) on Friday January 31, 2003 @12:29PM (#5196721)
    This should work a little differently.
    Why not stripe your data accross many hosts with parity data being stored on serveral. A central server would maintain a list of servers containing your data. In the event of a failure, you would simply fireup the client, that would contact this server for a list of your backup "devices" and it would start pulling in, reconstructing and decrypting the data.
    This would have a couple bonuses...

    1) You could stripe it accross 100 machines, and have another 100 with parity data so that any 50% of the machines can be unavaliable and you can still get your data back.

    2) Security - Rather than having a full copy of your data on their machine, each node only has a small subset of your data, and does not know where to find the rest of the data making reconstruction nearly impossible for the storage node. GPG would be used on top of this.
    • You could call this RAIN (Redundant Array of Internet Nodes)!!!

      But I kind of like sticking with RAID since this is exactly what the Feds will do to every person who participates in the scheme and sends you to federal "pound you in the ass" prison.
    • Yeah. this is kinda what I was alluding to in another post. Let's just pray that nobody compromises this central server - they would own us all.
      • The data will be encrypted with a key that you have to safeguard (CD, floppy, hardcopy, ...) before it is backed-up.
        The central server only knows where the bits 'n pieces are stored of your encrypted data, but it does not ever get the key to decrypt it. The worst that could happen when the server is compromised is that somebody else could get the full encrypted datastream, which is only a bit more useful than polling /dev/random

  • are what you give away.

  • And I don't want anyone else to have mine.

    What if you back up something illegal?

    I can keep all my files on CD-R's, CD-RW's, or DVD-R's.

    (not including MP3's movies etc stuff I can always get again)

    Hell I could keep them on Zip's if it weren't for some graphics I want to save.

    Just back up your data, you can reinstall your programs and OS later. tarball your project files and burn them to a CD. Most project will fit on a CD assuming you're not a photographer.
  • ... to an enterprise with multiple locations.

    Suppose you have corporate offices, an office on the other coast, and locations in 5 Colo's.

    With this, you could set up a distributed backup so that important files are distributed over all 7 sites. Since all these sites are yours, security is not such an issue.

    The biggest problem I see is that you have to put files in a specific directory to back them up. You'd have to write scripts to, say, back up a rarely changing database stored on a 15 disk RAID 10.


    1.) What level of RAID equivalent is this ?
    I.e., how many sites can die and still enable you to get your data back ? (This had better be more than one _in addition to the data source_ for this to be worthwhile.)

    2.) Can this be used to _mirror_ data - i.e., can I do a distributed backup and mirror the data seamlessly on another site?

    3.) Does all of the bandwidth for my files come from me, or is that distributed too in a peer to peer fashion ?
  • ...the internet ate my homework.
    I wanted to turn in that report but in was going for the night and his/her computer crashed!

    Granted this is only for the backup, but I can not see this being worthwhile effort without having MASSIVE amounts of bandwidth to toss around.
  • Magnetic disk is always 10-20 times more expensive than archival tape or CD. The former is $1 a gigbyte (new 200GB disks) and tape is about 7 cents a GB. Both are decreasing in price in concert.

    An hour of video media media is about $2 disk and 10 cents analog video tape.
    • most large corporations would probably rather you use disk for archiving, and tape for offsite storage. nothing beats the disk-to-disk copy speed when you're restoring a production server that just crashed. having to dig through tapes, and then seeking, and searching, and rewinding, and fast forwarding, and...., and...., and then not finding all of the data because some of the data was spanned to another tape can get frustrating.

      The time taken to find the data should also be factored into the cost of a whole backup solution. The more expensive disk option will save in the long run, since less time is taken to backup and restore the data.
  • by rindeee ( 530084 ) on Friday January 31, 2003 @12:36PM (#5196779)
    It was designed for use in low-bandwidth envrionments. Not only do you get the benefit of a distributed backup system, but you get inherant (sp?) fault-tolerance, load-balancing, etc. Yes, over a low-bandwidth connection a file still takes a long time to copy, but OpenAFS is designed to accomodate this (not going into detail here, go to the OpenAFS site if you're curious). I am a fanatic OpenAFS user so I am somewhat biased. We have however implemented OpenAFS on a 1.4TB datastore at one of our customer sites (medical market) that has key data (a couple hundred Gig) distribted to 3 slave RO cells (again, read up on OpenAFS for answers). Rock solid reliability is an understatement.
  • Since disk drives are cheap, backup should be cheap too.

    Ah, if only this were true. (Actually, it begs the question. =) Every time I hear "disk is cheap" I try to correct the speaker - "disk drives are cheap".

    Long term storage, and and subsequent retrieval, which implies administration and a reasonable expectation of longevity on the backup medium, can be very expensive.

    I don't think I'd trust anything valuable and volatile to a bunch of mirrors that I don't have service agreements with. Maintaining lots of data is costly, and I don't expect Joe Mirror to pay for it for me.

  • dibs vs rsync (Score:5, Interesting)

    by bromoseltzer ( 23292 ) on Friday January 31, 2003 @12:37PM (#5196789) Homepage Journal
    I peer with another system at another institution using rsync. They rsync their files to a folder on my disk, and I rsync to a folder on theirs. No encryption, but very good performance - 128 kbs DSL upload is fine, running overnight.

    This requires a lot of trust, which is OK because I'm the sysadmin at both places.

    Without trust, you need DIBS-like encryption, which (probably) means no rsync-like differential backups, and you need a "safe" way to find partners.

    How about "DIBS-raid" where your data is spread over many peers? If a peer blows up, you can still recover, and no one peer should have a recognizable piece of your data.


    This .sig donated to Poets Against the War [poetsagainstthewar.org].

  • While this might not work so well in the public domain, I can see where it could be feasible in an enterprise backup scheme.

    Basically, your client can take advantage of peers to discover places to backup your data. Peers can be local (onsite backup) and remote (offsite backup), and when peers come offline can redistribute their data accordingly.
  • I don't see companies using this to backup valuable/private information on the greater internet. But what about those hundreds of work stations with large hard drives that your peons are using? use the DIBS system to back up all your shared company data, it's still all on systems you own, behind your own firewalls, etc. but it gives you untold gigabytes of back up space that is at least as fast as decent tape backup system, but inherently cheaper.

    the IT department could distribute the daemon to all work stations, and the users of the systems aren't even required to be aware of it.

    Sounds great to me!
  • by 4/3PI*R^3 ( 102276 ) on Friday January 31, 2003 @12:40PM (#5196806)
    Redundant Internet Archival Administration (RIAA)
    Multiple Peer Access Archive (MPAA)
    Duplicate Media Copy Archive (DMCA)
  • by Kaz Riprock ( 590115 ) on Friday January 31, 2003 @12:40PM (#5196810)

    People, people, people, realize that if there is a fire in your house that takes out your local copy of "The Sims Hot Date", then it is also going to burn up your serial number. Be sure when you send me your iso's that you include a text file with your serial numbers...for archival purposes.
  • I love the idea of this, although I would be more comfortable knowing that data was not only encrypted but that each file was multiplexed across multiple hosts. Even if the encyption was cracked, the cracker would not have the full picture. Does anyone know if this is the case?

    Security aside, I fear that we would see a similar situation to the one we encounter all too frequently on the P2P networks. Users set their download speed to the maximum possible, yet throttle back outgoing data to the absolute minimum, rendering them useless to others. I would hope that this won't happen, but I'm becoming cynical in my old age.

  • by Nintendork ( 411169 ) on Friday January 31, 2003 @12:44PM (#5196826) Homepage
    I'm the system (NT network) and network admin for a small startup telco (CLEC). We also have a lot of data from our data processing of payphone records for the midwest region and yet more data for the litigation we go through to fight SBC.

    I have about half a terabyte of sensitive, important data that needs to be backed up and stored securely offsite every day (This data is just the important stuff. No OS files, etc.) and archives of records stored on several CD-Rs that also need to be stored offsite. The only dependable(?) solution we can commit to is tape backup. We use an Exabyte EZ17 autoloader and Veritas Backup Exec.

    You guys wouldn't believe the nightmares I've gone through to get it running smoothly and keeping it there. 5 or so replaced EZ17s, 50 $80 tapes replaced, hours upon hours spent on the phone with Veritas because their software is buggy as hell and their open file option is a piece of shit written by another company (Veritas support was the one to tell me that!). My boss seems to think that we're the only ones that have issues with backups (He's the type that has no opinions. He KNOWS everything.), but I've talked with other administrators with a lot of servers and data using a plethora (Three Amigos vocabulary) of various backup products. We all agreed that backups are a pain in the ass.

  • Hivecache (Score:3, Interesting)

    by Glass of Water ( 537481 ) on Friday January 31, 2003 @12:52PM (#5196890) Journal
    This is similar to hivecache [hivecache.com]. I believe hivecache's in use in the wild. The difference is that hivecache seems to be specifically oriented to large enterprize.

    I think that people who worry about "putting their files on other people's machines" should go over the docs once more.

  • So what if your entire drive is backed up across a huge distributed network. And let's say Joe User had backed up cache files, etc that contained personal info (credit numbers, child pr0n, etc). Joe User is could become one screwed individual. It's a huge risk that the average user might be making unknowingly...
  • Yes, I know, fire, flood etc. are the common reasons for not keeping the backups at the same location. But have you considered this one [the5thwave.com] ?

    You never know what can enter your server room =)
  • This sounds good, but it's not exactly original. I'm doing this right now (DFS and file replication) between our servers to replace our offsite backup service. And, I can tell you firsthand that it's as easy as 1-2-3 in Windows Server 2003 (no more ".Net" in the name).

    I will probably get modded as a Troll, but I have to honestly say that it has never been easy to accomplish this in Linux or even in Windows 2000. I hope Linux better supports this in the future -- it simply lost a place on five of our servers because of the pitiful support for DFS or DFS-like file replication. And I'm not talking about some custom server solution package, IT people should be able to add it easily to an existing server.
  • ..this is exactly one of the tenets of a good personal survival/preparedness plan. You exchange with a friend or relative in another geographical area a set of "basics". Basics as in long term stored food, extra clothing, various gear, copies of important legal documents, etc, etc. whatever you consider to be important, and that is a personal variable. Then in case one of the two homes is destroyed in some manner,or you are forced to evacuate, you still have something to start over with and live on rather than losing ALL your day to day tangible wealth.

    Makes sense to do it with data as well. On a personal level with computing, it could be as simple as snail mailing burned cd's to each other, along with sending it over the net, but you can't beat that snail mail price and effectiveness for mass quantities, especially if all you have is dialup speed access. The important part is it should be "more" than just one building over, it really needs to be at least in another city as a minimum distance.
  • Any reason why this would be better than one of the following using cron:

    • Create an NFS connection between PC's and the backup host. Directly tar or copy files to the host via a simple backup script (same as a tape script, but pointing to a file on the host)
    • Tar files, then securecopy (SCP) them to a remote host - or even do so directly
    • You could even (in a pinch) use samba (smbd, smbclient) to connect two PC's, and run a backup script
    Just wondering... I'm actually looking at implementing some of these so it would be nice to know why this project is better.
  • by someguyintoronto ( 415253 ) on Friday January 31, 2003 @01:08PM (#5197006)
    Seriously, what would be the legal ramifications if illegal data was stored on someone else computer?

    Would this back system, be an easy way to hide illegal content?

    What if the RIAA went after someone for keeping a bunch of legal MP3s?

    Too many cans... Too many worms...
  • by almaw ( 444279 ) on Friday January 31, 2003 @01:18PM (#5197089) Homepage
    Reasons why this is a truly impressively bad idea:
    • Poor availability: If you're storing it on home-type machines, typical availability is probably <50%. Assuming no hardware failure, if you store your data across four machines, you have a 6.25% chance that all four machines will be down at once and you can't get the data back when you want it.
    • Slow networks cause slow backup retrieval.
    • Most people want to back up all their data, as sifting through it to find the bits you do or don't want to backup is difficult. Now, once you've performed the initial backup, you can do incremental backups, which cuts bandwidth requirements, but you still have to initially transfer up to multiple gigabytes over a slow internet connection.
    • If a peer drops off the network, you must transfer all the data across to a new machine to maintain the same level of availability.
    • If it's properly distributed, you can place no guarantees on the quality of service (i.e. the speed/reliability). Peers can go away and never come back without warning. Data would have to be massively replicated (1000 to 1 or more) for it to be considered vaguely secure. If there is implied trust between peers (i.e. two people know each other and authorise the data movement, this problem is mitigated.
    • Massively prone to poor cryptography. If you use very strong cryptography, the system becomes very slow. You really need physical data separation for this.
    • Requires an internet connection. Won't work from behind firewalls, etc. This is pretty obvious, but is still a factor
    • Bugs are difficult to fix, as you have to maintain backwards compatibility between versions. Hardware solutions (or simple software ones like mirroring) aren't so prone to bugs. Because this is a complex software solution, there are bound to be bugs. Anything that can go wrong will. :)
    • Due to the lower reliability of this system per node compared to say a RAID array, it's more expensive per megabyte. Note that it *has* to be lower - you're comparing the reliability of a HDD/tape in a normal backup scenario to a HDD+network+supporting computers.
    • Prolly lots of other stuff I've missed that other people have covered.
  • Sure it wouldn't save me if my house burned down, but I'd like to find a tool that would do this easily and efficiently between machines in my house, keeping track of the free space available on each machine and deciding where to put the backup copies for me.

    I have plenty of storage to keep two copies of everything that matters, but it won't all fit in one place and it's a pain to try to figure out where I can back everything up, and to rearrange it when disk space gets too low on one machine. I'm imagining a program that would run on each machine, watching the space available and the list of "local" files that have been designated as important enough to back up. Each machine could then "negotiate" with the others to make sure that everything exists on at least two hard drives, and could notify me via e-mail that I need to buy more disk whenever there's not enough room for all of the backups. The database showing what files are where would need to be on all of the machines.

    Of course, this wouldn't eliminate the need for *real* backups of the important stuff (e.g. finances), but that stuff tends to be small enough that I can burn it on a CD and put it in my safe deposit box. I have plenty of other stuff that is too big for CD, not quite important enough for off-site storage, but would be a real pain to lose just because a drive went down. For example, I recently thought I might have lost my MP3/Ogg collection, and it took me a long time to rip and encode that 25GB of music. As it turned out, the music was on a partition on the second HDD on my fileserver, not the first HDD, which was toast.

    It seems like this might be of significant use for small offices as well.

    Does anything like this exist?

  • After a few months of working on my thesis, I started to think [I know, I should have started to think before...]

    "What would happen if the building and computer burned down? My thesis on the hard drive would be lost and with it months of work! I would have to do this same damn thing all over again!"
    So I .tar.Z'd it up and ftp'd it (on a typical state of the internet then 56k line) to a computer 400 miles away, just in case.

    It relieved a little of the anxiety. [OTOH, if any of your data causes you that much worry, a redundant backup will still not reduce your anxiety to zero.]

  • DIBS is a great idea, but it seems to me that a simpler solution would be to just to cook up some shell / perl scripts that use gpg and rsync.

    However, if DIBS could immitate a network version of something like the RAID striping so that you could recover entire files from various portions stored on multiple hosts, and thereby increase the probability of getting all of your files back whenever you wanted them regardless of who happens to be online / accessible at the time - that would be cool! Although it seems to me that such a situation would require several times more disk space on the part of other computers, in order to store redundant copies, than the files require themselves - maybe such a system would require that you "donate" to the network 3 times more disk space than you want to use.
  • by emin ( 149044 ) on Friday January 31, 2003 @02:58PM (#5197906) Homepage
    A lot of people have pointed out issues related to security, bandwidth, efficiency, etc. My vision is that DIBS will be designed to take things into account.

    For example, DIBS uses GPG to encrypt and sign all communications so that peers can't read the data they are storing for you and so that other people can't pretend to be you and store their files with your peers.

    Also, my vision is to include state-of-the-art erasure correction codes so DIBS uses redundancy efficiently. (Erasure correction codes are a generlaization of parity checks used by RAID). In fact, I have already written a python implementation of Reed-Solomon codes available at www.csua.berkeley.edu/~emin/source_code/py_ecc. I haven't had time to put this into DIBS yet since I'm currently working on my PhD at MIT and that keeps me pretty busy.

    Incremental backup is another feature I'm planning to add. There are some issues with how incremental backup interacts with encryption and erasure correction. I think resolving these issues may take a little more thought so I might have to wait until I graduate, become a professor and get some grad students of my own to help me.

    A Slashdot post isn't the place to go into all the arguments for or against DIBS. However, I think distributed backup is a viable idea. While there are some serious issues, I believe that through clever engineering, we can solve them and create a cheap, simple, efficient, and secure backup system usable by anyone with a network connection.

    I decided to start writing a distributed backup prototype like DIBS in order to find out what the major issues are and how to address them. Sure, currently DIBS has some flaws, but it is a prototype written by a grad student. With more feedback from the community and some more development effort I believe DIBS can become a valuable tool. If you agree, I invite you to join the development effort, or try it out and tell me how you think it could be improved, or even take whatever parts you find useful and make something better. The project page is at sourceforge.

  • Already been done (Score:3, Informative)

    by FJ ( 18034 ) on Friday January 31, 2003 @03:06PM (#5197969)
    A similan product Bacula [bacula.org] performs a similar function.
  • by rice_burners_suck ( 243660 ) on Friday January 31, 2003 @11:05PM (#5201536)
    I live in Indiana. My mother lives in Georgia. My father lives in Arizona. My grandmother lives in Quebec. My aunt lives in Brazil. My brother lives in France. I have put together a datacenter in a closet in each of their houses. Each datacenter consists of two OpenBSD boxes serving as a multihost firewall and six FreeBSD boxes running the services I require. All of my data is mirrored daily to all of these centers. Most of my files are managed with CVS, too. Thus, I am confident that even in a disaster of biblical proportions, such as my toilet overflowing and damaging the hard drive, my data will be safe.

I like work; it fascinates me; I can sit and look at it for hours.