Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Sun Microsystems Software

ZFS Gets Built-In Deduplication 386

elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."
This discussion has been archived. No new comments can be posted.

ZFS Gets Built-In Deduplication

Comments Filter:
  • Re:Hash Collisions (Score:4, Informative)

    by CMonk ( 20789 ) on Monday November 02, 2009 @07:32PM (#29956748)

    That is covered very clearly in the blog article referenced from the Register article. http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup [sun.com]

  • by iMaple ( 769378 ) * on Monday November 02, 2009 @07:38PM (#29956802)

    Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
    http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a [technet.com]

  • Re:Hash Collisions (Score:3, Informative)

    by Rising Ape ( 1620461 ) on Monday November 02, 2009 @07:44PM (#29956856)

    The probability of a hash collision for a 256 bit hash (or even a 128 bit one) is negligible.

    How negligible? Well, the probability of a collision is never more then N^2 / 2^h, where N is the number of blocks stored and h is the number of bits in the hash. So, if we have 2^64 blocks stored (a mere billion terabytes or so for 128 byte blocks) , the probability of a collision is less than 2^(-128), or 10^(-38). Hardly worth worrying about.

    And that's an upper limit, not the actual value.

  • Re:Hash Collisions (Score:5, Informative)

    by shutdown -p now ( 807394 ) on Monday November 02, 2009 @07:55PM (#29956960) Journal

    Before I left Acronis, I was the lead developer and designer for deduplication in Acronis Backup & Recovery 10 [acronis.com]. We also used SHA256 there, and naturally the possibility of a hash collision was investigated. After we did the math, it turned out that you're about 10^6 times more likely to lose data because of hardware failure (even considering RAID) than you are to lose it because of a hash collision.

  • by hapalibashi ( 1104507 ) on Monday November 02, 2009 @07:56PM (#29956982)
    Yes, Venti. I believe it originated in Plan9 from Bell Labs.
  • by HockeyPuck ( 141947 ) on Monday November 02, 2009 @07:59PM (#29957016)

    The advantages of SANs are easy to realize, they need not necessarily be FibreChannel vs NAS (NFS/CIFS) as a SAN could be iSCSI, FCOE, FCIP, FICON etc..

    -Storage Consolidation compared with internal disk.
    -Fewer components in your servers that can break.
    -Server admins don't have to focus on Storage except at the VolMgr/Filesystem level
    -Higher Utilization (a WebServer might not need 500GB of internal disk).
    -Offloading storage based functions (RAID in the array vs RAID on your server's CPU, I'd rather the CPU perform application work rather than calculating parity, replacing failed disks etc). This increases when you want to replicate to a DR site.

    This is not a ZFS vs SANs argument. I think ZFS running on SAN based storage is a great idea as ZFS replaces/combines two applications that are already on the host (volmgr & filesystem).

  • by Anonymous Coward on Monday November 02, 2009 @08:06PM (#29957072)

    How about this: you can't remove a top-level vdev without destroying your storage pool. That means that if you accidentally use the "zpool add" command instead of "zpool attach" to add a new disk to a mirror, you are in a world of hurt.

    How about this: after years of ZFS being around, you still can't add or remove disks from a RAID-Z.

    How about this: If you have a mirror between two devices of different sizes, and you remove the smaller one, you won't be able to add it back. The vdev will autoexpand to fill the larger disk, even if no data is actually written, and the disk that was just a moment ago part of the mirror is now "too small".

    How about this: the whole system was designed with the implicit assumption that your storage needs would only ever grow, with the result that in nearly all cases it's impossible to ever scale a ZFS pool down.

  • by Methlin ( 604355 ) on Monday November 02, 2009 @08:27PM (#29957380)
    Mod parent up. These are all legit deficiencies in ZFS that really need to be fixed at some point. Currently the only solutions to these is to build a new storage pool, either on the same system or different system, and export/import; big PITA and potentially expensive. Off the top of my head I can't think of anyone that lets you do #2 except enterprise storage solutions and Drobo.
  • by ArsonSmith ( 13997 ) on Monday November 02, 2009 @08:28PM (#29957392) Journal

    No you still have it stored the size of one file + 100 block sizes, in size. You'd need a substantially large number of random changes through all 100 files to balloon up from 1x file size, to 100x file size.

  • by buchner.johannes ( 1139593 ) on Monday November 02, 2009 @08:33PM (#29957456) Homepage Journal

    From that link: It is file-based and a service indexes it (whereas in ZFS it is block-based and on-the-fly). And they first introduced it in Windows Server 2000. Amazing. I'm sure it is a ugly hack since Windows has no soft/hard-links IIRC.

  • by hedwards ( 940851 ) on Monday November 02, 2009 @08:40PM (#29957554)
    ZFS is a copy on write filesystem, it already creates a temporary second copy so that the file system is always consistent if not quite up to date. I'd venture to guess that the new version of the file, not being identical to the old file would just be treated like copying it to a new name.
  • by Tynin ( 634655 ) on Monday November 02, 2009 @08:49PM (#29957714)
    Not sure when you tried building it, but I build cheap computers for friends / family, at least 2 or 3 computers a year. Almost a decade ago... maybe really only 8 years ago, all cheapo generic cases stopped having razor sharp edges. I used to get cuts all the time, but cheap cases, at least in the realm of having sharp edges, haven't been an issue in a long time. (I purchase all my cheapo cases from newegg these days)
  • by Trepidity ( 597 ) <[gro.hsikcah] [ta] [todhsals-muiriled]> on Monday November 02, 2009 @08:50PM (#29957720)

    If you're running a normal desktop or laptop, this isn't likely to be of great use in any case. There's non-negligible overhead in doing the deduplication process, and drive space at consumer-level sizes is dirt-cheap, so it's only really worth doing this you have a lot of block-level duplicate data. That might be the case if e.g. you have 30 VMs on the same machine each with a separate install of the same OS, but is unlikely to be the case on a normal Mac laptop.

  • by Anonymous Coward on Monday November 02, 2009 @08:57PM (#29957810)

    I say unreliably, because years ago we had a Novell server that used an automated compression scheme. Eventually, the drive got full anyway, and we had to migrate to a larger disk.

    But since the copy operation de-compressed files on the fly we couldn't copy because any attempt to reference several large compressed files instantly consumed all remaining space on the drive. What ensued was a nightmare of copy and delete files beginning with the smallest, and working our way up to the largest. It took over a day of manual effort before we freed up enough space to mass-move the remaining files.

    This is because you didn't use NetWare's tools to copy the files - the command line NCOPY, for example, with /Ror and /RU (available when file compression was introduced with NetWare 4) would have copied the files in their compressed format, avoiding this (Link: http://support.novell.com/techcenter/articles/ana19940603.html [novell.com]). Using the Novell Client for Windows, I'd imagine that its Explorer shell integration would give you GUI tools, too, though I no longer have a NetWare server to verify this, and always preferred the command line anyway :).

    No offense, but the scenario you describe is the result of ignorance, nor poor design.

  • by jpmorgan ( 517966 ) on Monday November 02, 2009 @09:10PM (#29957986) Homepage
    You recall wrong. NTFS has long supported both hard links and a mechanism called 'reparse points,' which are much more powerful than simple symlinks.
  • by bertok ( 226922 ) on Monday November 02, 2009 @09:55PM (#29958518)

    Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
    http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a [technet.com]

    It's not even close to the same thing.

    We investigated this a while back, and it is basically a dirty, filthy hack on top of vanilla NTFS.

    First of all, it doesn't compare blocks or byte-ranges, but entire files only. If two files are 99% identical, then they are different, and SIS won't merge them.

    Second, it uses a reparse point to merge the files, which has significant overhead, at least 4KB for each file, if I remember correctly. That is, SIS won't save you any disk space for small files, which is actually quite common on file servers. The overhead erases much of the benefit even for larger files, to the level that SIS will skip files smaller than 32KB by default.

    Third, it operates in the background, after files have been written. This means that files have to be written out in their entirety, read back in, compared byte-for-byte to another file, and then erased later. This is incredibly inefficient. On large file servers, the disk was thrashed like crazy.

    Lastly, we found that the Copy-on-Write mechanism immediately copied out the entire file if it was changed even slightly. For small files, this is not noticable, but for large files this can be a massive performance hog. A 4kb write can be potentially translated into a multi-GB copy!

    Proper single-instancing systems use in-memory hash tables that are often partitioned using "file similarity" heuristics to prevent cache thrashing. Even more advanced systems can maintain single-instancing during replication and backups, reducing bandwidth requirements enormously. Take a look at the features of the Data Domain [datadomain.com] filers for an idea of what the current state of the art is.

  • by KonoWatakushi ( 910213 ) on Monday November 02, 2009 @10:03PM (#29958598)

    How alarmist and uninformed; borderline FUD. The reality is as follows...

    First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now. Same with crypto.

    Second, mistakenly typing add instead of attach will result in a warning that the specified redundancy is different, and refuse to add it.

    Third, yes, you can't expand the width of a RAID-Z. You can still grow it though, by replacing it with larger drives. Once the block pointer rewrite work is merged, removal will be possible, and expansion won't be far off either.

    Forth, vdevs no longer autoexpand by default. If you want that behavior, you can to set the autoexpand property to yes.

    Last, there was no such assumption, it is simply a matter of priorities. If it were an easier problem, it would have been done long ago, but I'm happy to be patient, knowing that it will be done right. Most everyone who has seriously used ZFS will understand that the advantages will hugely outweigh these minor nits, which are easily worked around.

  • by greg1104 ( 461138 ) <gsmith@gregsmith.com> on Monday November 02, 2009 @10:55PM (#29959144) Homepage

    How alarmist and uninformed; borderline FUD. The reality is as follows...

    First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now.

    The bug report for this problem goes back to at least April of 2003 [opensolaris.org]. With that background, and that I've been hearing ZFS proponents suggesting this is coming "very soon now" for years without a fix, I'll believe it when I see it. Raising awareness that Sun's development priorities clearly haven't been toward any shrinking operation isn't FUD, it's the truth. Now, to be fair, that class of operations isn't very well supported on anything short of really expensive hardware either, but if you need these capabilities the weaknesses of ZFS here do reduce its ability to work for every use case.

  • by Anonymous Coward on Monday November 02, 2009 @11:47PM (#29959492)

    How alarmist and uninformed; borderline FUD. The reality is as follows...

    First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now.

    The bug report for this problem goes back to at least April of 2003 [opensolaris.org]. With that background, and that I've been hearing ZFS proponents suggesting this is coming "very soon now" for years without a fix, I'll believe it when I see it. Raising awareness that Sun's development priorities clearly haven't been toward any shrinking operation isn't FUD, it's the truth. Now, to be fair, that class of operations isn't very well supported on anything short of really expensive hardware either, but if you need these capabilities the weaknesses of ZFS here do reduce its ability to work for every use case.

    This is called "block pointer (bp) rewrite" in ZFS parlance. It was talked about at SNIA 2009 (p. 18):

    http://www.snia.org/events/storage-developer2009/presentations/monday/JeffBonwick_zfs-What_Next-SDC09.pdf

    As well as Kernel Conference Australia 2009 (~40:00):

    http://blogs.sun.com/video/entry/kernel_conference_australia_2009_jeff

    Jeff Bonwick and Bill Moore said that it'd be committed by the end of this year. (Along with dedupe (done) and crypto.) They're cutting it a bit close, but I think the issues mentioned GP will not be a problem Real Soon Now.

  • by paulhar ( 652995 ) on Tuesday November 03, 2009 @03:28AM (#29960870)

    TCP overhead at 1GbE for a modern processor is negligible - you're only talking about processing 120MB/sec or so.

    Here is a document including a pretty graph: http://media.netapp.com/documents/tr-3628.pdf [netapp.com]

    "...enabling the TCP Offload Engine (TOE) on the Linux hosts did not noticeably affect performance on the IBM blade side."

  • by odie_q ( 130040 ) on Tuesday November 03, 2009 @06:57AM (#29961708)

    The trick isn't using /dev/zero, the trick is using the seek parameter. The dd command skips nearly 8 GiB into a newly created file and writes something there. This creates a file that is 8 GiB large, but with no data (not zero, just nothing at all) in the first 8191 MiB. Therefore, the system doesn't actually write anything there, and doesn't even allocate the storage. If you read from these blocks, you will get generated zeros. This is called a sparse file.

Old programmers never die, they just hit account block limit.

Working...