ZFS Gets Built-In Deduplication 386
elREG writes to mention that Sun's ZFS now has built-in deduplication utilizing a master hash function to map duplicate blocks of data to a single block instead of storing multiples. "File-level deduplication has the lowest processing overhead but is the least efficient method. Block-level dedupe requires more processing power, and is said to be good for virtual machine images. Byte-range dedupe uses the most processing power and is ideal for small pieces of data that may be replicated and are not block-aligned, such as e-mail attachments. Sun reckons such deduplication is best done at the application level since an app would know about the data. ZFS provides block-level deduplication, using SHA256 hashing, and it maps naturally to ZFS's 256-bit block checksums. The deduplication is done inline, with ZFS assuming it's running with a multi-threaded operating system and on a server with lots of processing power. A multi-core server, in other words."
Re:Hash Collisions (Score:4, Informative)
That is covered very clearly in the blog article referenced from the Register article. http://blogs.sun.com/bonwick/en_US/entry/zfs_dedup [sun.com]
Re:Any other file systems with that feature? (Score:5, Informative)
Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a [technet.com]
Re:Hash Collisions (Score:3, Informative)
The probability of a hash collision for a 256 bit hash (or even a 128 bit one) is negligible.
How negligible? Well, the probability of a collision is never more then N^2 / 2^h, where N is the number of blocks stored and h is the number of bits in the hash. So, if we have 2^64 blocks stored (a mere billion terabytes or so for 128 byte blocks) , the probability of a collision is less than 2^(-128), or 10^(-38). Hardly worth worrying about.
And that's an upper limit, not the actual value.
Re:Hash Collisions (Score:5, Informative)
Before I left Acronis, I was the lead developer and designer for deduplication in Acronis Backup & Recovery 10 [acronis.com]. We also used SHA256 there, and naturally the possibility of a hash collision was investigated. After we did the math, it turned out that you're about 10^6 times more likely to lose data because of hardware failure (even considering RAID) than you are to lose it because of a hash collision.
Re:Any other file systems with that feature? (Score:2, Informative)
Re:More reason to be a ZFS fanboy (Score:4, Informative)
The advantages of SANs are easy to realize, they need not necessarily be FibreChannel vs NAS (NFS/CIFS) as a SAN could be iSCSI, FCOE, FCIP, FICON etc..
-Storage Consolidation compared with internal disk.
-Fewer components in your servers that can break.
-Server admins don't have to focus on Storage except at the VolMgr/Filesystem level
-Higher Utilization (a WebServer might not need 500GB of internal disk).
-Offloading storage based functions (RAID in the array vs RAID on your server's CPU, I'd rather the CPU perform application work rather than calculating parity, replacing failed disks etc). This increases when you want to replicate to a DR site.
This is not a ZFS vs SANs argument. I think ZFS running on SAN based storage is a great idea as ZFS replaces/combines two applications that are already on the host (volmgr & filesystem).
Re:More reason to be a ZFS fanboy (Score:5, Informative)
How about this: you can't remove a top-level vdev without destroying your storage pool. That means that if you accidentally use the "zpool add" command instead of "zpool attach" to add a new disk to a mirror, you are in a world of hurt.
How about this: after years of ZFS being around, you still can't add or remove disks from a RAID-Z.
How about this: If you have a mirror between two devices of different sizes, and you remove the smaller one, you won't be able to add it back. The vdev will autoexpand to fill the larger disk, even if no data is actually written, and the disk that was just a moment ago part of the mirror is now "too small".
How about this: the whole system was designed with the implicit assumption that your storage needs would only ever grow, with the result that in nearly all cases it's impossible to ever scale a ZFS pool down.
Re:More reason to be a ZFS fanboy (Score:4, Informative)
Re:Wake me when they build it into the hard disk (Score:4, Informative)
No you still have it stored the size of one file + 100 block sizes, in size. You'd need a substantially large number of random changes through all 100 files to balloon up from 1x file size, to 100x file size.
Re:Any other file systems with that feature? (Score:5, Informative)
From that link: It is file-based and a service indexes it (whereas in ZFS it is block-based and on-the-fly). And they first introduced it in Windows Server 2000. Amazing. I'm sure it is a ugly hack since Windows has no soft/hard-links IIRC.
Re:Any other file systems with that feature? (Score:4, Informative)
Re:This is good news... (Score:3, Informative)
Re:This is good news... (Score:5, Informative)
If you're running a normal desktop or laptop, this isn't likely to be of great use in any case. There's non-negligible overhead in doing the deduplication process, and drive space at consumer-level sizes is dirt-cheap, so it's only really worth doing this you have a lot of block-level duplicate data. That might be the case if e.g. you have 30 VMs on the same machine each with a separate install of the same OS, but is unlikely to be the case on a normal Mac laptop.
Re:Wake me when they build it into the hard disk (Score:1, Informative)
This is because you didn't use NetWare's tools to copy the files - the command line NCOPY, for example, with /Ror and /RU (available when file compression was introduced with NetWare 4) would have copied the files in their compressed format, avoiding this (Link: http://support.novell.com/techcenter/articles/ana19940603.html [novell.com]). Using the Novell Client for Windows, I'd imagine that its Explorer shell integration would give you GUI tools, too, though I no longer have a NetWare server to verify this, and always preferred the command line anyway :).
No offense, but the scenario you describe is the result of ignorance, nor poor design.
Re:Any other file systems with that feature? (Score:5, Informative)
Re:Any other file systems with that feature? (Score:4, Informative)
Windows Storage Server 2003 (yes, yes I know its from Microsoft) shipped with this feature (that is called Single Instance Storage)
http://blogs.technet.com/josebda/archive/2008/01/02/the-basics-of-single-instance-storage-sis-in-wss-2003-r2-and-wudss-2003.a [technet.com]
It's not even close to the same thing.
We investigated this a while back, and it is basically a dirty, filthy hack on top of vanilla NTFS.
First of all, it doesn't compare blocks or byte-ranges, but entire files only. If two files are 99% identical, then they are different, and SIS won't merge them.
Second, it uses a reparse point to merge the files, which has significant overhead, at least 4KB for each file, if I remember correctly. That is, SIS won't save you any disk space for small files, which is actually quite common on file servers. The overhead erases much of the benefit even for larger files, to the level that SIS will skip files smaller than 32KB by default.
Third, it operates in the background, after files have been written. This means that files have to be written out in their entirety, read back in, compared byte-for-byte to another file, and then erased later. This is incredibly inefficient. On large file servers, the disk was thrashed like crazy.
Lastly, we found that the Copy-on-Write mechanism immediately copied out the entire file if it was changed even slightly. For small files, this is not noticable, but for large files this can be a massive performance hog. A 4kb write can be potentially translated into a multi-GB copy!
Proper single-instancing systems use in-memory hash tables that are often partitioned using "file similarity" heuristics to prevent cache thrashing. Even more advanced systems can maintain single-instancing during replication and backups, reducing bandwidth requirements enormously. Take a look at the features of the Data Domain [datadomain.com] filers for an idea of what the current state of the art is.
Re:More reason to be a ZFS fanboy (Score:3, Informative)
How alarmist and uninformed; borderline FUD. The reality is as follows...
First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now. Same with crypto.
Second, mistakenly typing add instead of attach will result in a warning that the specified redundancy is different, and refuse to add it.
Third, yes, you can't expand the width of a RAID-Z. You can still grow it though, by replacing it with larger drives. Once the block pointer rewrite work is merged, removal will be possible, and expansion won't be far off either.
Forth, vdevs no longer autoexpand by default. If you want that behavior, you can to set the autoexpand property to yes.
Last, there was no such assumption, it is simply a matter of priorities. If it were an easier problem, it would have been done long ago, but I'm happy to be patient, knowing that it will be done right. Most everyone who has seriously used ZFS will understand that the advantages will hugely outweigh these minor nits, which are easily worked around.
Re:More reason to be a ZFS fanboy (Score:5, Informative)
How alarmist and uninformed; borderline FUD. The reality is as follows...
First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now.
The bug report for this problem goes back to at least April of 2003 [opensolaris.org]. With that background, and that I've been hearing ZFS proponents suggesting this is coming "very soon now" for years without a fix, I'll believe it when I see it. Raising awareness that Sun's development priorities clearly haven't been toward any shrinking operation isn't FUD, it's the truth. Now, to be fair, that class of operations isn't very well supported on anything short of really expensive hardware either, but if you need these capabilities the weaknesses of ZFS here do reduce its ability to work for every use case.
Re:More reason to be a ZFS fanboy (Score:1, Informative)
How alarmist and uninformed; borderline FUD. The reality is as follows...
First, you can't remove a vdev yet, but development is in progress, and support is expected very soon now.
The bug report for this problem goes back to at least April of 2003 [opensolaris.org]. With that background, and that I've been hearing ZFS proponents suggesting this is coming "very soon now" for years without a fix, I'll believe it when I see it. Raising awareness that Sun's development priorities clearly haven't been toward any shrinking operation isn't FUD, it's the truth. Now, to be fair, that class of operations isn't very well supported on anything short of really expensive hardware either, but if you need these capabilities the weaknesses of ZFS here do reduce its ability to work for every use case.
This is called "block pointer (bp) rewrite" in ZFS parlance. It was talked about at SNIA 2009 (p. 18):
http://www.snia.org/events/storage-developer2009/presentations/monday/JeffBonwick_zfs-What_Next-SDC09.pdf
As well as Kernel Conference Australia 2009 (~40:00):
http://blogs.sun.com/video/entry/kernel_conference_australia_2009_jeff
Jeff Bonwick and Bill Moore said that it'd be committed by the end of this year. (Along with dedupe (done) and crypto.) They're cutting it a bit close, but I think the issues mentioned GP will not be a problem Real Soon Now.
Re:More reason to be a ZFS fanboy (Score:3, Informative)
TCP overhead at 1GbE for a modern processor is negligible - you're only talking about processing 120MB/sec or so.
Here is a document including a pretty graph: http://media.netapp.com/documents/tr-3628.pdf [netapp.com]
"...enabling the TCP Offload Engine (TOE) on the Linux hosts did not noticeably affect performance on the IBM blade side."
Re:Par for the course.. (Score:3, Informative)
The trick isn't using /dev/zero, the trick is using the seek parameter. The dd command skips nearly 8 GiB into a newly created file and writes something there. This creates a file that is 8 GiB large, but with no data (not zero, just nothing at all) in the first 8191 MiB. Therefore, the system doesn't actually write anything there, and doesn't even allocate the storage. If you read from these blocks, you will get generated zeros. This is called a sparse file.