Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Sun Microsystems Software

ZFS Gets Built-In Deduplication 386

elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."
This discussion has been archived. No new comments can be posted.

ZFS Gets Built-In Deduplication

Comments Filter:
  • Hash Collisions (Score:2, Interesting)

    by UltimApe ( 991552 ) on Monday November 02, 2009 @07:29PM (#29956720)

    Surely with high amounts of data (that zfs is supposed to be able to handle), a hash collision may occur? I'm sure a block is > 256bits. Do they just expect this never to happen?

    Although I suppose they could just be using it as a way to narrow down candidates for deduplication... doing a final bit for bit check before deciding the data is the same.

  • by icebike ( 68054 ) on Monday November 02, 2009 @07:38PM (#29956796)

    Imagine he amount of stuff you could (unreliably) store on a hard disk if massive de-duplication was built into the drive electronics. It could even do this quietly in the background.

    I say unreliably, because years ago we had a Novell server that used an automated compression scheme. Eventually, the drive got full anyway, and we had to migrate to a larger disk.

    But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive. What ensued was a nightmare of copy and delete files beginning with the smallest, and working our way up to the largest. It took over a day of manual effort before we freed up enough space to mass-move the remaining files.

    De-duplication is pretty much the same thing, compression by recording and eliminating duplicates. But any minor automated update of some files runs the risk of changing them such that what was a duplicate, must now be stored separately.

    This could trigger a similar situation where there was suddenly not enough room to store the same amount of data that was already on the device. (For some values of "suddenly" and "already").

    For archival stuff or OS components (executables, and source code etc) which virtually never change this would be great.

    But there is a hell to pay somewhere down the road.

  • Re:Hash Collisions (Score:3, Interesting)

    by pclminion ( 145572 ) on Monday November 02, 2009 @07:47PM (#29956894)
    Oops. I didn't mean 10^-18 per-block, I meant 10^-18 for the entire filesystem. (Obviously it doesn't make sense the other way)
  • by ZerdZerd ( 1250080 ) on Monday November 02, 2009 @08:06PM (#29957078)

    I hope btrfs will get it. Or else you will have to add it :)

  • by icebike ( 68054 ) on Monday November 02, 2009 @08:15PM (#29957208)

    Bad design on Novell's part, but the problem persists in the de-duplicated world, where de-duplicating to memory only is not a solution.

    Imagine a hundred very large file containing largely the same content. Not imagine CHANGING just a few characters in each file via some automated process. Now 100 files which were actually stored as ONE file balloon to 100 large files.

    On a drive that was already full, changing just a few characters (not adding any total content) could cause a disk full error.

    You really can't fake what you don't have. You either have enough disk to store all of your data or you run the risk of hind-sight telling you it was a really bad design.

  • What I'm wondering about all of this is what happens when you edit one of the files? Does it "reduplicate" them? And if so, isn't that inefficient in terms of the time needed to update a large file (in that it would need to recopy the file over to another section of the disk in order to maintain the fact that there are two now-different copies)?

  • by buchner.johannes ( 1139593 ) on Monday November 02, 2009 @08:39PM (#29957536) Homepage Journal

    I'm sure btrfs -- once fully implemented and tested -- will also have problems reaching the performance of reiser4.

  • Re:Hash Collisions (Score:2, Interesting)

    by dotgain ( 630123 ) on Monday November 02, 2009 @08:39PM (#29957546) Homepage Journal
    Before the instruction you posted, I found this explanation in TFA:

    An enormous amount of the world's commerce operates on this assumption, including your daily credit card transactions. However, if this makes you uneasy, that's OK: ZFS provies a 'verify' option that performs a full comparison of every incoming block with any alleged duplicate to ensure that they really are the same, and ZFS resolves the conflict if not. To enable this variant of dedup, just specify 'verify' instead of 'on':

    I fail to see how someone can sit down and rationally decide whether their data will be more susceptible to hash collisions or not. While I would be very surprised if any two blocks on my computer hash to the same value in spite of being different, it seems to me that someone's going to get hit by this sooner rather than later. And what a nasty way to find hash collisions! Who would have thought my Aunt's chocolate cake recipe had the same SHA1 as hello.jpg from goatse.cx!.

    On one hand, 2^256 is a damn big keyspace. I've heard people say a collision is about as likely as winning every lottery in the world simultaneously, and then doing it again next week. Bug give enough computers with enough blocks enough time, and find a SHA1 collision you will. Depending on what kind of data it happens to, you might not even notice it.

  • by SLi ( 132609 ) on Monday November 02, 2009 @09:05PM (#29957918)

    Mod parent up. These are all legit deficiencies in ZFS that really need to be fixed at some point.

    Only if it's costworthy. For a case I know about XFS lacks filesystem shrinking too, and it has been asked for many times. It has been estimated that it would take months for a skilled XFS engineer to code. If it's so important that someone is willing to put up that money (or effort), it may happen; otherwise it will not. I'm sure the same applies to ZFS.

  • Par for the course.. (Score:5, Interesting)

    by Junta ( 36770 ) on Monday November 02, 2009 @09:08PM (#29957962)

    Any filesystem implementing copy-on-write at all, data dedupe, and/or compression is already a strategy where the risk of exhausting oversubscribed storage due to unanticipated compression ratios or uniqueness is a risk. It's a reason why you have to be pretty explicit to NetApp filers implementing these features that you are accepting the risk of exhausting allocations if you actually make use of these features to the point of advertising more storage capacity than you actually have.

    You don't even need a fancy filesystem to expose yourself to this today:
    $ dd if=/dev/zero of=bigfile bs=1M seek=8191 count=1
    1+0 records in
    1+0 records out
    1048576 bytes (1.0 MB) copied, 0.00426769 s, 246 MB/s
    jbjohnso@wirbelwind:~$ ls -lh bigfile
      8.0G 2009-11-02 20:06 bigfile
    ~$ du -sh bigfile
    1.0M bigfile

    This possibility has been around a long file and the world hasn't melted. Essentially, if someone is using these features, they should be well aware of the risks incurred.

  • by Animats ( 122034 ) on Monday November 02, 2009 @09:44PM (#29958384) Homepage

    I'd argue that file systems should know about and support three types of files:

    • Unit files. Unit files are written once, and change only by being replaced. Most common files are unit files. Program executables, HTML files, etc. are unit files. The file system should guarantee that if you open a unit file, you will always read a consistent version; it will never change underneath a read. Unit files are replaced by opening for write, writing a new version, and closing; upon close, the new version replaces the old. In the event of a system crash during writing, the old version of the file remains. If the writing program crashes before an explicit close, the old file remains. Unit files are good candidates for unduplication via hashing. While the file is open for writing, attempts to open for reading open the old version. This should be the default mode. (This would be a big convenience; you always read a good version. Good programs try to fake this by writing a new file, then renaming it to replace the old file, but most operating systems and file systems don't support atomic multiple rename, so there's a window of vulnerability. The file system should give you that for free.)
    • Log files Log files can only be appended to. UNIX supports this, with an open mode of O_APPEND. But it doesn't enforce it (you can still seek) and NFS doesn't implement it properly. Nor does Windows. Opens of a log file for reading should be guaranteed that they will always read exactly out to the last write. In the event of a system crash during writing, log files may be truncated, but must be truncated at an exact write boundary; trailing off into junk is unacceptable. Unduplication via hashing probably isn't worth the trouble.
    • Managed files Managed files are random-access files managed by a database or archive program. Random access is supported. The use of open modes O_SYNC, O_EXCL, or O_DIRECT during file creation indicates a managed file. Seeks while open for write are permitted, multiple opens access the same file, and O_SYNC and O_EXCL must work as documented. Unduplication via hashing probably isn't worth the trouble and is bad for database integrity.

    That's a useful way to look at files. Almost all files are "unit" files; they're written once and are never changed; they're only replaced. A relatively small number of programs and libraries use "managed" files, and they're mostly databases of one kind or another. Those are the programs that have to manage files very carefully, and those programs are usually written to be aware of concurrency and caching issues.

    Unix and Linux have the right modes defined. File systems just need to use them properly.

  • by binaryspiral ( 784263 ) on Monday November 02, 2009 @10:27PM (#29958892)

    Microsoft's SIS is a joke. A few folks have dedupe down to a science - Data Domain and NetApp.

    We virtualized our filers into an ESX 3.5 cluster and dropped the VMDK files onto a NetApp 3140... deduped them to 18% of their original size. No performance impact, actually faster than our original servers and much more efficient.

    ROI - three months.

    Difficulty to implement dedup? A checkmark and the OK button.

  • BTRFS is better (Score:3, Interesting)

    by Theovon ( 109752 ) on Monday November 02, 2009 @11:18PM (#29959318)

    At first, BTRFS started out as an also-ran, trying to duplicate a bunch of ZFS features for Linux (where licensing wasn't compatible to incorporate ZFS into Linux). But then BTRFS took a number of things that were overly rigid about ZFS (shrinking volumes, block sizes, and some other stuff), and made it better, including totally unifying how data and metadata are stored. I'm sure there are a number of ways in which ZFS is still better (RAIDZ), but putting aside some of the enterprise features that most of us don't need, BTRFS is turning out to be more flexible, more expandable, more efficient, and better supported.

  • Re:well ... (Score:2, Interesting)

    by TrevorDoom ( 228025 ) on Monday November 02, 2009 @11:57PM (#29959562) Homepage

    My company used a X4500 and we discovered the bug that caused Sun to make the X4540 - the Marvell SATA chipset in the X4500 had a serious bug in firmware that was exacerbated by the Solaris X86 Marvell chipset driver.
    Under heavy small block random IO intermingled with heavy sequential large block IO, the box would kernel panic and hang - only a power cycle would reset the box.

    Sun ended up refunding us the cost of the servers and providing us exceptionally large incentives to purchase Sun StorageTek storage.

    It wouldn't surprise me if the X4540 would have similar issues because they were rushing to replace the X4500 to try and minimize the possibility about bad PR over the X4500 being amazingly unstable.

    This is why I'll be waiting for FreeBSD to support this because they will probably have better SATA chipset drivers and the chances of the system hanging because the Solaris kernel drivers for the SATA chipset (nevermind that it's a SATA chipset that Sun put into their own board).

  • by jimicus ( 737525 ) on Tuesday November 03, 2009 @05:39AM (#29961394)

    If a hash were a replacement for data. that's all we'd need....goedelize the universe?

    Sometimes I just want to scream, or weep, or shoot everybody....or just drop to my knees and beg them to think - just a little tiny insignificant bit - think. Maybe it'll add up. Probably not, but it's the best I can do.

    Which is why ZFS allows you to specify using a proper file comparison rather than just a hash.

    It's unlikely you'll have a collision considering it's a 256-bit hash but, as you allude, that likelihood does go up somewhat when you're dealing with a filesystem which is designed to (and therefore presumably does) handle terabytes of information.

  • by sjames ( 1099 ) on Tuesday November 03, 2009 @10:34AM (#29963200) Homepage Journal

    The same people you call when your proprietary system breaks and you discover that the official tech support people can't find their posterior with both hands and a map. Most cities have a number of grief councilors ready to support you in your time of need. If it was really critical, try the suicide hotline.

  • Re:Does that mean... (Score:2, Interesting)

    by mr crypto ( 229724 ) on Tuesday November 03, 2009 @11:32AM (#29963854)

    De-dup also means some unexpected behavior. Want to copy a 5 GB file? Done in less than a second.

    Over-write a section of a dup'ed file with new content? Suddenly you're using more disk space, or could even get a "disk full" message even though you were just replacing data, but not increasing it's size in an obvious way.

    Trying to make space on a drive by deleting lots of big files that happen to be dup'ed? No effect.

Disclaimer: "These opinions are my own, though for a small fee they be yours too." -- Dave Haynie

Working...