Exhaustive Data Compressor Comparison 305
crazyeyes writes "This is easily the best article I've seen comparing data compression software. The author tests 11 compressors: 7-zip, ARJ32, bzip2, gzip, SBC Archiver, Squeez, StuffIt, WinAce, WinRAR, WinRK, and WinZip. All are tested using 8 filesets: audio (WAV and MP3), documents, e-books, movies (DivX and MPEG), and pictures (PSD and JPEG). He tests them at different settings and includes the aggregated results. Spoilers: WinRK gives the best compression but operates slowest; AJR32 is fastest but compresses least."
Re:Screw speed, size reduction: gimme compatibilit (Score:2, Interesting)
I have to admit I switched over/back to ZIP about a year ago for everything for exactly this reason. yeah, it meant a lot of my old archives increased in size (sometimes by quite a bit), but knowing that anything anywhere can read the archive makes up for it. ZIP creation and decoding is supported natively by Mac and Windows and most Linux distros right from the GUI, so it makes it brain-dead simple to deal with.
Pizzachish: setting a new standard in languages (Score:2, Interesting)
I have been thinking about creating a new language with about 60 or so words. The idea is that you don't need a lot of words when you can figure out the meaning by context. Strong points are that the language would be very easy to pick up, and you would get that invigorating feeling of talking like a primitive cave man.
As an example of the concept, we have the words walk and run. They are a bit too similar to be worth wasting one of our precious few 60 words. Effectively, one could be dropped with have the other taking on a broader meaning without any real repercussions. The words sit and shit are also fairly similar. When you have a guest over, you can say something like, "Please, shit down." Because of context, it would be all okay. Just remember, there is a difference between shitting on the toilet and shitting in the toilet.
Re:What's the point of compressing JPEG,MP3,DivX e (Score:5, Interesting)
Re:Skip the blogspam (Score:3, Interesting)
Agreed completely. (Score:5, Interesting)
Getting stuff out of some of those formats now is a real irritation. I haven't run into a case yet that's been totally impossible, but sometimes it's taken a while, or turned out to be a total waste of time once I've gotten the archive open.
Now, I try to always put a copy of the decompressor for whatever format I use (generally just tar + gzip) onto the archive media, in source form. The entire source for gzip is under 1MB, trivial by today's standards, and if you really wanted to cut size and only put the source for deflate on there, it's only 32KB.
It may sound tinfoil-hat, but you can't guarantee what the computer field is going to look like in a few decades. I had self-expanding archives, made using Compact Pro on a 68k Mac, thinking they'd make the files easy to recover later, which didn't help me at all now -- a modern (Intel) Mac won't touch it (although to be fair a PPC Mac will run OS 9 which will, and allegedly there's a Linux utility that will unpack CPP archives, although maybe not self-expanding ones).
Given the rate at which bandwidth and storage space are expanding, I think the market for closed-source, proprietary data compression schemes should be very limited; there's really no good reason to use them for anything that you're storing for an unknown amount of time. You don't have to be a believer in the "infocalypse" to realize that operating systems and entire computing-machine architectures change over time, and what's ubiquitous today may be unheard of in a decade or more.
Re:duh (Score:5, Interesting)
Missing part 3 of 10? No problem!
Of course, I'm a holder of a license for Rar from way back when. I like it.
Re:Pizzachish: setting a new standard in languages (Score:3, Interesting)
you might be interested in this:
http://www.tokipona.org/ [tokipona.org]
Didn't have Tridge's rzip... (Score:3, Interesting)
http://samba.org/junkcode/ [samba.org]
Tridge is one of the smart guys behind samba. And rzip is pretty clever for certain things. Just ask google.
Re:What's the point of compressing JPEG,MP3,DivX e (Score:5, Interesting)
Both methods do the same thing: they statistically analyse all the data, then re-encode it so the most common values are encoded in a smaller way than the least common values.
Huffman's main limitation is that each value compressed needs to consume at least one bit. Arithmetic coding can fit several values into a single bit. Thus, arithmetic coding is always better than Huffman, as it goes beyond Huffman's self-imposed barrier.
However, Huffman is NOT patented, while most forms of arithmetic coding, including the one used in the JPEG standard, ARE patented. The authors of Stuffit did nothing special - they just paid the patent fee. Now they just unpack the Huffman-encoded JPEG data and re-encode it with arithmetic coding. If you take some JPEGs that are already compressed with arithmetic coding, Stuffit can do nothing to make them better. But 99.9% of JPEGs are Huffman coded, because it would be extortionately expensive for, say, a digital camera manufacturer, to get a JPEG arithmetic coding patent license.
So Stuffit doesn't have remarkable code, they just paid money to get better compression that 99.9% of people specifically avoid because they don't think it's worth the money.
What are they actually measuring? (Score:3, Interesting)
Having said that, do I really care in practice that much about if algorithm A is 5% faster than algorithm B? I personally do not, I care if the person receiving them can open them. So the second problem with the article is that it is one computer user on his own, in the real world you would just distribute
LZMA is used in 7-zip (Score:1, Interesting)
However, it is the algorithm used in 7-Zip. It is represented in this test.
Speaking as a person with interest in 64K intros, LZMA is an awesome, awesome algorithm if you need fast decompression and *small decompression code*. A carefully hand-tuned implementation of an LZMA decompressor would be less than 2K of assembly code, and could perhaps be crammed into 1K by a sufficiently clever hacker. This is an order of magnitude smaller than most algorithms that can give comparable compression performance.
The high compression of LZMA comes from combining two basic, well-proven compression ideas: sliding dictionaries i.e. LZ77/78, and markov models (i.e. the thing used by every compression algorithm that uses an arithmetic encoder or similar order-0 entropy coder as its last stage). LZMA is awesome because the contexts used in its model are segregated according to what the bits are used for. Folding that knowledge right into the model results in a simple but very effective compression scheme.