Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Software IT

A Look at Data Compression 252

With the new year fast approaching many of us look to the unenviable task of backing up last years data to make room for more of the same. That being said, rojakpot has taken a look at some of the data compression programs available and has a few insights that may help when looking for the best fit. From the article: "The best compressor of the aggregated fileset was, unsurprisingly, WinRK. It saved over 54MB more than its nearest competitor - Squeez. But both Squeez and SBC Archiver did very well, compared to the other compressors. The worst compressors were gzip and WinZip. Both compressors failed to save even 200MB of space in the aggregated results."
This discussion has been archived. No new comments can be posted.

A Look at Data Compression

Comments Filter:
  • Re:Speed (Score:5, Informative)

    by sedmonds ( 94908 ) on Monday December 26, 2005 @02:42PM (#14340397) Homepage
    Seems to be a compression speed section on page 12 - Aggregated Results. Ranging from gzip really fast, to winrk really slow.
  • There are some amazing compression programs out there, trouble is they tend to take a while and consume lots of memory. PAQ [fit.edu] gives some impressive results, but the latest benchmark figures [maximumcompression.com] are regularly improving. Let's not forget that compression is not good unless it is integrated into a usable tool. 7-zip [7-zip.org] seems to be the new archiver on the block at the moment. A closely related, but different, set of tools are the archivers [freebsd.org], of which there are lots with many older formats still not supported by open source tools
  • by Anonymous Coward on Monday December 26, 2005 @03:35PM (#14340681)
    looks like you can still grab a copy of it here [www.sac.sk].
  • Re:Compressia (Score:2, Informative)

    by Insurgent2 ( 615836 ) on Monday December 26, 2005 @03:58PM (#14340809)
  • Re:Speed (Score:5, Informative)

    by sshore ( 50665 ) on Monday December 26, 2005 @04:09PM (#14340861)
    They do it to sell more ad impressions. Each time you go to the next page you load a new ad.
  • Nothing to see here (Score:5, Informative)

    by Anonymous Coward on Monday December 26, 2005 @04:15PM (#14340892)

    I can't believe TFA made /. The only thing more defective than the benchmark data set (Hint: who cares how much a generic compressor can save on JPEGs?) is the absolutely hilarious part where the author just took "fastest" for each compressor and then tried to compare the compression. Indeed, StuffIt did what I consider the only sensible thing for "fastest" in an archiver, which is to just not even try to compress content that is unlikely to get significant savings. Oddly, the list for fastest compression is almost exactly the reverse of the list for best compression on every test. The "efficiency" is a metric that illuminates nothing. An ROC plot of rate vs compression for each test would have been a good idea; better would be to build ROC curves for each compressor, but I don't see that happening anytime soon.

    I wouldn't try to draw any conclusions from this "study". Given the methodology, I wouldn't wait with bated breath for parts two and three of the study, where the author actually promises to try to set up the compressors for reasonable compression, either.

    Ouch.

  • by bigbigbison ( 104532 ) on Monday December 26, 2005 @04:18PM (#14340900) Homepage
    Since the original site seems to be really slow and split into a billion pages, those who aren't aware of it might want to look at MaximumCompression [maximumcompression.com] since it has tests for several file formats and also has a multiple file compression test that is sorted by efficiency [maximumcompression.com]. A program called SBC [netfirms.com] does the best, but the much more common WinRAR [rarlab.com] comes in a respectable third.
  • by DeadboltX ( 751907 ) on Monday December 26, 2005 @04:42PM (#14341031)
    Sounds like you need to introduce yourself to the world of par2 ( http://www.quickpar.org.uk/ [quickpar.org.uk] )

    Parity reconstruction

    Think of it like the year 2805 where scientists can regrow someones arm if they happen to lose it
  • Unicode support? (Score:3, Informative)

    by icydog ( 923695 ) on Monday December 26, 2005 @05:01PM (#14341105) Homepage
    Is there any mention made about unicode support? I know that WinZip is out of the question for me because I can't compress anything with Chinese filenames with it. They'll either not work at all, or become compressed but the filenames will turn into garbage. Even though the data stays intact, it doesn't help much if it's a binary and has no intelligible filename.

    I've been using 7-Zip for this reason, and also because it compresses well while also working on Windows and Linux.
  • Re:Speed (Score:3, Informative)

    by Wolfrider ( 856 ) <kingneutron@@@gmail...com> on Monday December 26, 2005 @05:22PM (#14341187) Homepage Journal
    Yah, when I'm running backups and it has to Get Done in a reasonable amount of time with decent space savings, I use
    gzip -9. (My fastest computer is 900MHz AMD Duron.)

    For quick backups, gzip; or gzip -6.

    For REALLY quick stuff, gzip -1.

    When I want the most space saved, I (rarely) use bzip2 because rar, while useful for splitting files and retaining recovery metadata, is far too slow for my taste 99% of the time.

    Really, disk space is so cheap these days that Getting the Backup Done is more important than saving (on average) a few megs here and there.

    But if you Really Need that last few megs of free space, this is an OK guide to which compressor does that the best -- even if it takes *days.*
  • by EdMcMan ( 70171 ) <moo.slashdot2.z.edmcman@xoxy.net> on Monday December 26, 2005 @05:36PM (#14341252) Homepage Journal
    It's a crime that the submitter didn't mention this was with the fastest compression settings.
  • by Dwedit ( 232252 ) on Monday December 26, 2005 @06:04PM (#14341355) Homepage
    They are testing 7-zip at the FAST setting, which does a poor job compared to the BEST setting.
  • by Anonymous Coward on Monday December 26, 2005 @09:32PM (#14342206)
    Why is it bad?

    Because data backups corrupt, but often they do not corrupt all the way.

    Which leaves the possiblity open for partial recovery. Especially if only part of the data is needed, this can be "good enough".

    However if the entire data set is compressed and a part of it corrupts it can make it very difficult to recover the data that is still uncorrupted. In this case think of data compression as a low-grade form of data encryption.

    That's not to say that you CAN'T compress data in a safe way, it's just that you have to be very smart about it.

    Case in point.

    Lets say your a Unix/Linux user. You have nice choices between tar, dd, dump, cpio, and other forms of data copying utilities. Each with their own strengths and weaknesses.

    Then you have different compression technics to choose from, bzip2, zip, gzip, rar, etc. etc.

    So lets say you choose to use gzip and tar, which are good old standbyes that do a good job and are almost universally recognized and supported.

    However your directory system you want to backup is bigger then the medium your backing up to. Say you have a 8 gig directory system and your backing up to 650 meg cdroms.

    So the kneejerk response is to:
    tar czf - source | split -b 650m backup-
    So that will create a bunch of backup-aa, backup-ab, backup-ac, etc etc files that are 650 megs each which is a nice size for backing up to cdroms.

    However, if one of the cdroms is burned incorrectly or gets lost, then when you go:
    cat backup-* > tar zxf -

    Then you have hosed all your data.

    So instead of doing compression along with the tarball, THEN splitting, you do the tarball to split then compress. And then do smaller sizes of files to make it easier to handle, since now that data has different compression rates then you can make it so that it fits all neatly into cdrom-sized nuggets.
    tar cf - source | split -d 50m backup-
    then do the gzip and copy with a simple script or whatnot.

    Now if you have a missing cdrom or part of the cdrom is toast...
    you gunzip all the remaining files...
    cat backup-* | tar xf -
    it will bitch when comes accross a partially-their file and exit.

    Then with some manual work you remove the backup files that worked out so far, then finish up with the backup. Tar will complain that it's missing the starting point for some data and ignore that, but when it comes to a file header then it will happily finish up with what it has left.

    You still loose some data, but the rest of the data is easily recoverable.

    Also keep in mind that different data compresses differently.

    If you have uncompressed audio data it makes more sense to compress it to ogg, mp3, or flac before backing it up. Also with images it makes sense to agressively compress them to png (lossless) or jpeg (lossy) before backing them up. You'll get much more efficient compression in sizes. (However seeing that most of us get our information, of this type, off the net then it's probably already compressed.)
  • by Meostro ( 788797 ) on Monday December 26, 2005 @09:53PM (#14342279) Homepage Journal
    If you even take the most basic/well studied Lempel-Ziv and Huffman algorithms you'll quickly find cases where each would be preferred over another.
    That's sort of the point of this test though, to see which of the general-purpose compressors (GPC) is going to give you the best overall results. Yes, you should use FLAC [sourceforge.net] for WAVs, and probably StuffIt [stuffit.com] for JPEGs, but what is your best choice if you're going to have just one, or just a few? I don't want 200 different compressors for 200 different content types, I want one.

    As a matter of practicality, right now you need zip or gzip, and bzip2 is gaining ground. If you're going to create new content, you should offer both bz2 and zip. In the future, maybe you should use 7z or sit instead, it depends on the rate of adoption. Personally, I don't think zip will ever die.
    And since different algm's identify different patterns in the file their compressing, certain files will be compressed better by different algorithms and do much worse on the next file. Besides, we're not even getting into any discussion of lossy/lossless algm's here. (Think jpeg vs bmp).
    Generally, you will pick a special-purpose compressor for lossy compression, and a GPC for lossless compression. Your audio compressor will probably be MP3 or OGG, your images will probably be JPG, videos will be MPG. It's not efficient to use MP3 compression on your images, it's designed with different constraints. Either for the same bitrate the image is much worse quality, or for the same quality the file will be much larger than necessary. The same goes for lossless compressors too, FLAC works much better than ZIP on audio data, but I would bet if you used a BMP file as the source for compression FLAC would probably be bad and ZIP would probably be average.

    If you want to compress 300 files of various types, you need a GPC. That doesn't mean that the GPC doesn't have special-purpose algorithms built into it, it just means that on-average it will perform better than a special-purpose compressor.

    Kolmogorov complexity [wikipedia.org], or at least an estimate thereof, is what you're talking about. For any specific dataset, the Kolmogorov complexity is the minimum size of compressed data + decompressor. It can't [wikipedia.org] be calculated, but it is a measure of performance for any combination of compressor and dataset. For WAVs, you will probably see this:
    K(FLAC, WAVs) < K(GPC, WAVs)

    However, for an evenly-distributed general dataset of generic binary files, TXT, JPG, PDF, TIF, PNG, MP3, WAV, and MPG, you will probably find that for any SPC (special-purpose compressor for any of the individual data types):
    K(GPC, dataset) < K(SPC, dataset)
  • by speedplane ( 552872 ) on Monday December 26, 2005 @11:28PM (#14342653) Homepage

    Lossless data compression is a pretty well studied subject. Shannon started it back in the 40's and plenty of research has gone into it since.

    There are basically three ways to do lossless compression: Huffman, Arithmetic, and LZW. Technically Huffman can achieve the best of three, however its generally the worst because of implmentation issues (it would take a lot of processing to do rigourous Huffman encoding).

    Arithmetic coding is generally better but is difficult to implment. I think IBM is the company who actually sells an arithmetic coder (I could be wrong though).

    LZW is by far the best of the three (you can read online how it works), but alas it is patented and anyone who gives away free copies of it will get sued.

    I know for a fact that gzip uses Huffman, which would explain its lackluster performance. I haven't researched it further, but I would not be suprised if the three proprietary compression programs which "won" this review use LZW. I also wouldn't be suprised if they pay a good amount to LZW's patent holders (Unisys I think).

    I'd be interested to see how gzip performs on its "maximum compression" setting. Like I said earlier, Huffman can can achieve the theortical limit on compression where LZW cannot.

  • by hereticmessiah ( 416132 ) on Tuesday December 27, 2005 @03:30AM (#14343461) Homepage

    Huffman coding and arithmetic coding are both entropy encoding algorithms. While perfectly fine compression algorithms in their own right, they're also commonly used to squeeze the last bits of entropy out of a data stream produced by another compression or transformation algorithm. Arithmetic coding suffers from chilling effects caused by IBM patents, and so isn't as commonly used as it might. An unencumbered alternative is range encoding, which gives performance not too far off that of arithmetic coding. Range encoding and arithmetic coding are both variants of the same basic technique of entropy encoding. That said, the compression difference between huffman coding and arithmetic coding is minimal. I think (though I'm not entirely sure), entropy encoding might be a subset of a larger family of algorithms called markov modelling.

    LZW is a refinement on LZ78, which has other variants such as LZSS. It is a dictionary coding algorithm. Similarly, the DEFLATE algorithm is based on LZ77, another variant of dictionary coding. gzip uses DEFLATE, as does xZip and PNG. DEFLATE first compresses the stream with an LZ77 variant, and then compresses the resulting stream with huffman coding to squeeze out some redundancy. LZW is no longer covered by patents, at least not here in Europe.

    So what you wrote about huffman coding, arithmetic coding and LZW was largely misinformed. There are two lossless methods: entropy encoding and dictionary coding, huffman coding and arithmetic coding representing the former and LZW representing the latter. Some compression algorithms combine the two, DEFLATE being an example.

When bad men combine, the good must associate; else they will fall one by one, an unpitied sacrifice in a contemptible struggle. - Edmund Burke

Working...