A Look at Data Compression 252
With the new year fast approaching many of us look to the unenviable task of backing up last years data to make room for more of the same. That being said, rojakpot has taken a look at some of the data compression programs available and has a few insights that may help when looking for the best fit. From the article: "The best compressor of the aggregated fileset was, unsurprisingly, WinRK. It saved over 54MB more than its nearest competitor - Squeez. But both Squeez and SBC Archiver did very well, compared to the other compressors. The worst compressors were gzip and WinZip. Both compressors failed to save even 200MB of space in the aggregated results."
Re:Speed (Score:5, Informative)
This is a surprisingly big subject (Score:5, Informative)
Re:More time = More compression (Score:1, Informative)
Re:Compressia (Score:2, Informative)
Re:Speed (Score:5, Informative)
Nothing to see here (Score:5, Informative)
I can't believe TFA made /. The only thing more defective than the benchmark data set (Hint: who cares how much a generic compressor can save on JPEGs?) is the absolutely hilarious part where the author just took "fastest" for each compressor and then tried to compare the compression. Indeed, StuffIt did what I consider the only sensible thing for "fastest" in an archiver, which is to just not even try to compress content that is unlikely to get significant savings. Oddly, the list for fastest compression is almost exactly the reverse of the list for best compression on every test. The "efficiency" is a metric that illuminates nothing. An ROC plot of rate vs compression for each test would have been a good idea; better would be to build ROC curves for each compressor, but I don't see that happening anytime soon.
I wouldn't try to draw any conclusions from this "study". Given the methodology, I wouldn't wait with bated breath for parts two and three of the study, where the author actually promises to try to set up the compressors for reasonable compression, either.
Ouch.
Maximum Compression has efficiency comparisons (Score:5, Informative)
Re:Why compress in the first place? (Score:3, Informative)
Parity reconstruction
Think of it like the year 2805 where scientists can regrow someones arm if they happen to lose it
Unicode support? (Score:3, Informative)
I've been using 7-Zip for this reason, and also because it compresses well while also working on Windows and Linux.
Re:Speed (Score:3, Informative)
gzip -9. (My fastest computer is 900MHz AMD Duron.)
For quick backups, gzip; or gzip -6.
For REALLY quick stuff, gzip -1.
When I want the most space saved, I (rarely) use bzip2 because rar, while useful for splitting files and retaining recovery metadata, is far too slow for my taste 99% of the time.
Really, disk space is so cheap these days that Getting the Backup Done is more important than saving (on average) a few megs here and there.
But if you Really Need that last few megs of free space, this is an OK guide to which compressor does that the best -- even if it takes *days.*
Completely out of context (Score:5, Informative)
This test is worthless (Score:4, Informative)
Compression bad for data's health. (Score:1, Informative)
Because data backups corrupt, but often they do not corrupt all the way.
Which leaves the possiblity open for partial recovery. Especially if only part of the data is needed, this can be "good enough".
However if the entire data set is compressed and a part of it corrupts it can make it very difficult to recover the data that is still uncorrupted. In this case think of data compression as a low-grade form of data encryption.
That's not to say that you CAN'T compress data in a safe way, it's just that you have to be very smart about it.
Case in point.
Lets say your a Unix/Linux user. You have nice choices between tar, dd, dump, cpio, and other forms of data copying utilities. Each with their own strengths and weaknesses.
Then you have different compression technics to choose from, bzip2, zip, gzip, rar, etc. etc.
So lets say you choose to use gzip and tar, which are good old standbyes that do a good job and are almost universally recognized and supported.
However your directory system you want to backup is bigger then the medium your backing up to. Say you have a 8 gig directory system and your backing up to 650 meg cdroms.
So the kneejerk response is to:
tar czf - source | split -b 650m backup-
So that will create a bunch of backup-aa, backup-ab, backup-ac, etc etc files that are 650 megs each which is a nice size for backing up to cdroms.
However, if one of the cdroms is burned incorrectly or gets lost, then when you go:
cat backup-* > tar zxf -
Then you have hosed all your data.
So instead of doing compression along with the tarball, THEN splitting, you do the tarball to split then compress. And then do smaller sizes of files to make it easier to handle, since now that data has different compression rates then you can make it so that it fits all neatly into cdrom-sized nuggets.
tar cf - source | split -d 50m backup-
then do the gzip and copy with a simple script or whatnot.
Now if you have a missing cdrom or part of the cdrom is toast...
you gunzip all the remaining files...
cat backup-* | tar xf -
it will bitch when comes accross a partially-their file and exit.
Then with some manual work you remove the backup files that worked out so far, then finish up with the backup. Tar will complain that it's missing the starting point for some data and ignore that, but when it comes to a file header then it will happily finish up with what it has left.
You still loose some data, but the rest of the data is easily recoverable.
Also keep in mind that different data compresses differently.
If you have uncompressed audio data it makes more sense to compress it to ogg, mp3, or flac before backing it up. Also with images it makes sense to agressively compress them to png (lossless) or jpeg (lossy) before backing them up. You'll get much more efficient compression in sizes. (However seeing that most of us get our information, of this type, off the net then it's probably already compressed.)
Re:Nothing to see here (Score:3, Informative)
As a matter of practicality, right now you need zip or gzip, and bzip2 is gaining ground. If you're going to create new content, you should offer both bz2 and zip. In the future, maybe you should use 7z or sit instead, it depends on the rate of adoption. Personally, I don't think zip will ever die.
Generally, you will pick a special-purpose compressor for lossy compression, and a GPC for lossless compression. Your audio compressor will probably be MP3 or OGG, your images will probably be JPG, videos will be MPG. It's not efficient to use MP3 compression on your images, it's designed with different constraints. Either for the same bitrate the image is much worse quality, or for the same quality the file will be much larger than necessary. The same goes for lossless compressors too, FLAC works much better than ZIP on audio data, but I would bet if you used a BMP file as the source for compression FLAC would probably be bad and ZIP would probably be average.
If you want to compress 300 files of various types, you need a GPC. That doesn't mean that the GPC doesn't have special-purpose algorithms built into it, it just means that on-average it will perform better than a special-purpose compressor.
Kolmogorov complexity [wikipedia.org], or at least an estimate thereof, is what you're talking about. For any specific dataset, the Kolmogorov complexity is the minimum size of compressed data + decompressor. It can't [wikipedia.org] be calculated, but it is a measure of performance for any combination of compressor and dataset. For WAVs, you will probably see this:
K(FLAC, WAVs) < K(GPC, WAVs)
However, for an evenly-distributed general dataset of generic binary files, TXT, JPG, PDF, TIF, PNG, MP3, WAV, and MPG, you will probably find that for any SPC (special-purpose compressor for any of the individual data types):
K(GPC, dataset) < K(SPC, dataset)
The analysis is kind of silly (Score:2, Informative)
Lossless data compression is a pretty well studied subject. Shannon started it back in the 40's and plenty of research has gone into it since.
There are basically three ways to do lossless compression: Huffman, Arithmetic, and LZW. Technically Huffman can achieve the best of three, however its generally the worst because of implmentation issues (it would take a lot of processing to do rigourous Huffman encoding).
Arithmetic coding is generally better but is difficult to implment. I think IBM is the company who actually sells an arithmetic coder (I could be wrong though).
LZW is by far the best of the three (you can read online how it works), but alas it is patented and anyone who gives away free copies of it will get sued.
I know for a fact that gzip uses Huffman, which would explain its lackluster performance. I haven't researched it further, but I would not be suprised if the three proprietary compression programs which "won" this review use LZW. I also wouldn't be suprised if they pay a good amount to LZW's patent holders (Unisys I think).
I'd be interested to see how gzip performs on its "maximum compression" setting. Like I said earlier, Huffman can can achieve the theortical limit on compression where LZW cannot.
Wrong, wrong, wrong, wrong, wrong. (Score:3, Informative)
Huffman coding and arithmetic coding are both entropy encoding algorithms. While perfectly fine compression algorithms in their own right, they're also commonly used to squeeze the last bits of entropy out of a data stream produced by another compression or transformation algorithm. Arithmetic coding suffers from chilling effects caused by IBM patents, and so isn't as commonly used as it might. An unencumbered alternative is range encoding, which gives performance not too far off that of arithmetic coding. Range encoding and arithmetic coding are both variants of the same basic technique of entropy encoding. That said, the compression difference between huffman coding and arithmetic coding is minimal. I think (though I'm not entirely sure), entropy encoding might be a subset of a larger family of algorithms called markov modelling.
LZW is a refinement on LZ78, which has other variants such as LZSS. It is a dictionary coding algorithm. Similarly, the DEFLATE algorithm is based on LZ77, another variant of dictionary coding. gzip uses DEFLATE, as does xZip and PNG. DEFLATE first compresses the stream with an LZ77 variant, and then compresses the resulting stream with huffman coding to squeeze out some redundancy. LZW is no longer covered by patents, at least not here in Europe.
So what you wrote about huffman coding, arithmetic coding and LZW was largely misinformed. There are two lossless methods: entropy encoding and dictionary coding, huffman coding and arithmetic coding representing the former and LZW representing the latter. Some compression algorithms combine the two, DEFLATE being an example.