Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Software IT

Exhaustive Data Compressor Comparison 305

crazyeyes writes "This is easily the best article I've seen comparing data compression software. The author tests 11 compressors: 7-zip, ARJ32, bzip2, gzip, SBC Archiver, Squeez, StuffIt, WinAce, WinRAR, WinRK, and WinZip. All are tested using 8 filesets: audio (WAV and MP3), documents, e-books, movies (DivX and MPEG), and pictures (PSD and JPEG). He tests them at different settings and includes the aggregated results. Spoilers: WinRK gives the best compression but operates slowest; AJR32 is fastest but compresses least."
This discussion has been archived. No new comments can be posted.

Exhaustive Data Compressor Comparison

Comments Filter:
  • Maximum compression? (Score:1, Informative)

    by Anonymous Coward on Sunday April 22, 2007 @10:18PM (#18836261)
    http://www.maximumcompression.com/ [maximumcompression.com] ?
  • Skip the blogspam (Score:5, Informative)

    by Anonymous Coward on Sunday April 22, 2007 @10:19PM (#18836271)

    as its slashdotted

    this site
    http://www.maximumcompression.com/ [maximumcompression.com]
    has been up for years and performs tests on all the compressors with various input sources, much more comprehensive
  • This is nothing new (Score:1, Informative)

    by Anonymous Coward on Sunday April 22, 2007 @10:20PM (#18836277)
    I remember people did MUCH more exhaustive (30+ programs) comparisons back in the BBS days. Yes... it was a much simpler time.
  • by MBCook ( 132727 ) <foobarsoft@foobarsoft.com> on Sunday April 22, 2007 @10:23PM (#18836297) Homepage

    I read this earlier today through the firehose. It was interesting, but the graphs are what struck me. It seems to me all the graphs should have been XY plots instead of pairs of histograms. That way you could easily see the relationship between compression ratio and time taken. Their "metric" for showing this, basically multiplying the two numbers, is pretty bogus and isn't nearly as easy to compare. With the XY plot the four corners are all very meaningful. One is slow with no compression, one each good compression/time, and the sweet spot of good compression and good time. It's easy to tell those on two opposing corners apart (good compression vs good time), where as with the article's metric they could look very similar.

    Still, interesting to see. The popular formats are VERY well established at this point (ZIP in Windows and Mac (stuffit seems to be fading fast), and GZIP and BZIP2 on Linux). They are so common (especially with ZIP support built into Windows since XP and also built into OS X) I don't think we'll see them replaced any time soon. Of course, with CPU power getting cheaper and cheaper we are seeing formats that are more and compressed (MP3, H264, Divx, JPEG, etc) so these utilities are becoming less and less necessary. I no longer need to stuff files on floppies (I've got the net, DVD-Rs, and flash drives). Heck, if you look at some of the formats they "compressed" (at like 4% max) you almost might as well use TAR.

  • Re:duh (Score:3, Informative)

    by aarusso ( 1091183 ) on Sunday April 22, 2007 @10:27PM (#18836321)
    Well, since the dawn of ages I saw ZIP v ARJ, bzip2 vs gzip.
    What's the point? Same programs compressing same data on a different computer.

    I use gzip for big files (takes less time)
    I use bzip2 for small files (compresses better)
    I use zip to send data to Windows people
    I really, really miss ARJ32. It was my favorite on DOS Days.
  • by 644bd346996 ( 1012333 ) on Sunday April 22, 2007 @10:35PM (#18836393)
    These days, file compression is pretty much only used for large downloads. In those instances, you really have to use either gzip, pkzip, or bzip2 format, so that your users can extract the file.

    Yes, having a good compression algorithm is nice, but unless you can get it to partially supplant zip, you'll never make much money off it. Also, most things these days don't need to be compressed. Video and audio are already encoded with lossy compression, web pages are so full of crap that compressing them is pointless, and hard drives are big enough. Although, I haven't seen any research lately about whether compression is useful for entire filesystems to reduce the bottleneck from hard drives. Still, I suspect that it is not worth the effort.
  • by 644bd346996 ( 1012333 ) on Sunday April 22, 2007 @10:42PM (#18836443)
    TAR is not a compressor.
  • Exhaustive?! (Score:5, Informative)

    by jagilbertvt ( 447707 ) on Sunday April 22, 2007 @11:12PM (#18836625)
    It seems odd that they didn't include executables/dlls in the comparison (where maxmumcompression.com does). I also find it odd that they are compressing items that normally don't compress very well with most data compression programs (divx/mpegs/jpegs/etc). I'm guessing this is why 7-zip ranked a bit lower than most.

    I did some comparison last year, and found 7-zip to do the best job for what I needed (great compression ratio without requiring days to complete). It also doesn't take into account the network speed at which the file is going to be transmitted. I use 7-zipfor pushing application updates and such to remote offices (most over 384k/768k WAN links). Compressing w/ 7-zip has saved users quite a bit of time compared to winrar or winzip.

    I would definitely recommend checking out maximumcompression.com (As others have, as well) over this article. It goes into a lot greater detail.
  • by moronoxyd ( 1000371 ) on Sunday April 22, 2007 @11:22PM (#18836681)
    > UM, yeah, the dataset includes WAV files. Try flac [sourceforge.net].
    > Then you will have exhausted a little more of the compression programs available.

    You are aware that all the tools tested are general purpose compressors, and FLAC is not, aren't you?

    Otherwise, you would also have to talk about Wavepack, Monkey Audio, Shorten and others.
    And those are only the loseless audio codecs. What about lossy codecs?

    What about all those different formats for pictures? They compress data as well.
    And what about the different video codecs? ...
  • by BluhDeBluh ( 805090 ) on Sunday April 22, 2007 @11:39PM (#18836775)
    It's closed sourced and proprietary though. Someone needs to make an open-source RAR compressor - the problem is you can't use the official code to do that (as it's specifically in the licence), but you could use unrarlib [unrarlib.org] as a basis...
  • by Anonymous Coward on Sunday April 22, 2007 @11:56PM (#18836855)
    Book: "Digital Compression for Multimedia". PRINCIPLES & STANDARDS. Morgan Kaufmann Publishers Inc.

    Interesant algorithms: i suppose that the patents are expired. Key items:
    • Tail-biting LZ77.
    • Lempel-Ziv-Yokoo LZY 1992, Kiyohara and Kawabata 1996.
    • LZ78SEP.
    • LZWEP.
    • LZYEP.
    No War, Peace Again!
  • by Anonymous Coward on Monday April 23, 2007 @12:03AM (#18836897)
    Save yourself 24 pages of crap, here's the punchline:

    Aggregate Results

    Overall, WinRK was the champion at compressing the filesets. It had an average compression rate of 23.2%. It was 9% better at overall compression than its closest rival, SBC Archiver which had an average compression rate of 21.3%.

    The poorest compressors overall, at default settings, were the trio of WinZip, gzip and ARJ32. They only had average compression rates of about 13%. ...

    However, gzip was the undisputed speed champion. It only took just over 121 seconds to completely process the complete fileset collection which weighed in at over 1.6GB. It was over a third faster than the runner-ups, ARJ32 and WinZip.

    The other compressors were pretty slow at their normal compression settings. However, WinRK was extremely slow, compared to the others. It took almost 1.5 hours to compress the entire fileset collection. ...

    The most efficient data compressor for the aggregated results was gzip. Its super-fast compression speed, coupled with its average compression rate allowed it to become the undisputed overall efficiency champion. ARJ32 and WinZip were also very efficient compressors. They were more than twice as efficient as their nearest rivals, StuffIt and bzip2.

    The other compressors may have been good at certain files, but overall, they were pretty inefficient. The most inefficient compressors overall was WinRK by a large margin . No matter how good it was at compressing files, its extremely slow compression speed totally killed its efficiency ratings.

    Conclusion

    WinRK was the best compressor in most filesets it encountered. So, it was not surprising that it was the overall compression champion. However, its performance was offset by its abysmally slow performance. Even with a really fast system, it still took ages to compress the filesets. On several occasions, it took more than 18 minutes to compress just 200MB of files. Thanks to this flaw, it had the dubious honour of being the most inefficient compressor as well.

    SBC Archiver, which was just slightly poorer than WinRK at compression was much faster at the job. Although it was nowhere near the top of the speed rankings, its faster speed allowed it to attain a moderate efficiency ranking.

    WinRAR, which is a favourite of many Internet users, displayed a surprisingly bland performance at default settings. Although it had a pretty good overall compression rate of just under 19%, it was very slow at its default settings. That made it the third most-inefficient compressor. Surprising, isn't it?

    In contrast, another perennial favourite, WinZip which had a lower overall compression rate of 13% managed to attain a much higher efficiency rating because it was able to compress the filesets much faster than WinRAR. Quite surprising since many users have abandoned it for WinRAR in view of its rather dated compression algorithm.

    StuffIt is a dark horse. It has a pretty good compression rate overall but with an unimpressive compression speed. However, its amazing performance with JPEG files cannot be denied. JPEG files is undeniably StuffIt's forte. No other compressor even comes within a light year of it.

    gzip and ARJ32 are both the fastest and the worst compressors of the lot. They have unimpressive overall compression rates but more than makes up for it with their tremendous compression speeds. Therefore, it isn't surprising to see them garner the top two spots in compressor efficiency. However, we would still recommend GUI alternatives like WinZip. It is almost as efficient as gzip and ARJ32 and far more user-friendly.

    Based on our results, we can only come to one conclusion. If you do not like to change the settings of your data compressors and want a good, fast and user-friendly data compressor, then WinZip is the best one for the job.

    So there you have it - the results of the Normal Compression Test.
  • Re:duh (Score:5, Informative)

    by morcego ( 260031 ) on Monday April 23, 2007 @12:03AM (#18836901)

    So you alreay knew WinRK gave the best compression? I didn't; never even heard of it. My money would have been on bzip2.


    I agree with you on the importance of this article but ... bzip2 ? C'mon.
    Yes, I know it is better than gzip, and it is also supported everywhere. But it is much worst than the "modern" compression algorithms.

    I have been using LZMA for some time now for things I need to store longer, and getting good results. It is not on the list, but should give results a little bit better than RAR. Too bad it is only fast when you have a lot of memory.

    For short/medium time storage, I use bzip2. Online compression, gzip (zlib), of course.
  • by Rosyna ( 80334 ) on Monday April 23, 2007 @12:10AM (#18836939) Homepage
    By default, Stuffit won't even bother to compress MP3 files. That's what it shows an increase in file size (for the archive headers) and why it is the fastest throughput (it's not trying to compress). If you change the option, the results will be different.

    I imagine some other codecs also have similar options for specific file types.
  • Re:duh (Score:3, Informative)

    by yppiz ( 574466 ) * on Monday April 23, 2007 @12:38AM (#18837087) Homepage
    Another problem is that gzip has compression levels ranging from -1 (fast, minimal) to -9 (slow, maximal), and I suspect he only tested the default, which is either -6 or -7.

    I wouldn't be surprised if many of the other compression tools have similar options.

    --Pat
  • Re:duh (Score:5, Informative)

    by timeOday ( 582209 ) on Monday April 23, 2007 @01:07AM (#18837201)

    I agree with you on the importance of this article but ... bzip2 ? C'mon.
    Well, now I know.

    Here's a scatterplot [theknack.net] of resulting file sizes and compression times from the text compression data (lower is better), and as my luck would have it, bzip2 is really the only one that's out of line - i.e. the furthest from the pareto frontier [wikipedia.org]. But then, looking at the same data with file sizes plotted in the range of [0.0, 1.0] [theknack.net], it seems like there's a major case of diminishing returns for the expensive algorithms anyways. If you care at all about compression time, good ol' gzip is still a pretty decent choice!

  • by timeOday ( 582209 ) on Monday April 23, 2007 @01:12AM (#18837239)

    It was interesting, but the graphs are what struck me. It seems to me all the graphs should have been XY plots instead of pairs of histograms.
    Yup. [theknack.net].
  • Re:duh (Score:4, Informative)

    by Compact Dick ( 518888 ) on Monday April 23, 2007 @01:30AM (#18837349) Homepage

    LZMA ... is not on the list
    7-Zip [included in the test] is based on LZMA [wikipedia.org].
  • Re:duh (Score:4, Informative)

    by OmnipotentEntity ( 702752 ) on Monday April 23, 2007 @01:40AM (#18837399) Homepage
    7zip's default compression is LZMA, FYI.
  • by MCTFB ( 863774 ) on Monday April 23, 2007 @03:36AM (#18837879)
    for general purpose lossless compression. Most modern compression utilities out there mix and match the same algorithms which do the same thing.

    With the exception of compressors that use arithmetic coding (which has patents out the wazoo covering just about every form of it), virtually all compressors use some form of Huffman compression. In addition, many use some form of LZW compression before executing the Huffman compression. That is pretty much it for general purpose compression.

    Of course, if you know the nature of the data you are compressing you can come up with a much better compression scheme.

    For instance, with XML, if you have a schema handy, you can do some really heavy optimization since the receiving side of the data probably already has the schema handy which means you don't need to bother sending some sort of compression table for the tags, attributes, element names, etc.

    Likewise, with FAX machines, run length encoding is used heavily because of all the sequential white space that is indicative of most fax documents. Run length encoding of white space can also be useful in XML documents that are pretty printed.

    Most compression algorithms that are very expensive to compress are usually pretty cheap to decompress. If you are providing a file for millions of people to download, it doesn't matter if it takes 5 days to compress the file if it still only takes 30 seconds for a user to decompress it. However, when doing peer to peer communication with rapidly generated data, you need the compression to be fast if you use any at all.

    Nevertheless, most generaly purpose lossless compression formats are more or less clones of each other once you get down to analyzing what algorithms they use and how they are used.
  • by kyz ( 225372 ) on Monday April 23, 2007 @06:55AM (#18838591) Homepage
    Wow, are you speaking beyond your ken. When you say "some form of LZW compression", you should have said "some form of LZ compression" - either Lempel and Ziv's 1977 (sliding window) or 1978 (dictionary slots) papers on data compression by encoding matched literal strings. LZW is "some form" of LZ78 compression which, apart from GIFs, almost nobody uses. It's too fast and not compressy enough. Most things use LZH (LZ77 + Huffman), specifically DEFLATE, the kind used in PKZIP, firstly because the ZIP file format is still very popular, and because zlib is a very popular free library that can be embedded into anything.

    Fax machines use a static Huffman encoding. They've never used run-length encoding. Run-length encoding is nothing compared to how efficiently LZ77 or LZ78 would handle pretty-printed XML.

    Compression algorithms vary on both their compression and decompression speed. LZ77 is slow to compress and fast to decompress. Arithmetic coding and PPM are slow both compressing and decompressing.
  • by Fweeky ( 41046 ) on Monday April 23, 2007 @10:24AM (#18839915) Homepage
    RAR has recovery records (settable percentage of each archive dedicated to ECC, default off) and recovery volumes (dedicated files with PAR-like recovery capabilities). "Keep broken files" can be used to extract from broken or truncated archives.
  • by swilver ( 617741 ) on Monday April 23, 2007 @06:06PM (#18846367)
    Read up on arithmic encoding. Basically it works by creating a huge floating point number. For example, let's say you want to encode a stream like this: "ABBBBABBBB". Statistically, A has a 20% chance of occuring, while B has a 80% chance of occuring. With Huffman you could encode this (obviously) as 0111101111, which takes 10 bits. Huffman encoding being limited to bits has no way to take advantage of the fact that the "B" occurs 80% more often than the "A".

    With Arithmic encoding however you'd encode each character according to the exact probability it has of occuring and write it as a fractional number between 0 and 1. For example, if you want to encode an "A", you'd pick a number between 0.0 and 0.2 (the lower 20% of our number); if you want to encode a "B", you'd use a number between 0.2 and 1.0 (the upper 80% of our number).

    What you keep track of during encoding is the upper and lower bound of this number. So, when I want to encode the first "A", my lower bound is 0.0 and the upper bound is 0.2. The next character to encode is "B". We already now the range we can pick from is 0.0 - 0.2, but to encode a "B" we need to pick a number in the upper 80% of this range, so 0.04 to 0.20 (picking a number between 0.0 and 0.04 would encode another "A").

    The next letter, another "B", would use a range 0.072 - 0.200. The 3rd "B" would narrow the range to 0.0976-0.2000. The 4th "B" narrows it to 0.11808 - 0.2000.

    At some point, the upper and lower bound will have a few most significant digits in common that cannot change anymore. When this occurs, you can start writing these out as part of your compressed stream. For example, when we encode the 6th character (the 2nd "A"), the range becomes 0.118080 - 0.134464. The first two digits (0.1) can't change anymore now, so we can write them out, and just continue narrowing the range further for subsequent data to be compressed.

    At some point, there'll be no more data to be compressed, and you then just pick a number (as convenient as possible) between the upper and lower bound you have established, write it out and end the stream. The process is the same when doing this with binary floating point numbers.

People will buy anything that's one to a customer.

Working...