A Look at Data Compression 252
With the new year fast approaching many of us look to the unenviable task of backing up last years data to make room for more of the same. That being said, rojakpot has taken a look at some of the data compression programs available and has a few insights that may help when looking for the best fit. From the article: "The best compressor of the aggregated fileset was, unsurprisingly, WinRK. It saved over 54MB more than its nearest competitor - Squeez. But both Squeez and SBC Archiver did very well, compared to the other compressors. The worst compressors were gzip and WinZip. Both compressors failed to save even 200MB of space in the aggregated results."
WinRK is excellent (Score:5, Interesting)
Windows only (Score:3, Interesting)
Actually (Score:5, Interesting)
Unix compressors (Score:5, Interesting)
Why compress in the first place? (Score:1, Interesting)
Speed aside [and speed would be a huge concern if you insisted on compression], I just don't understand the desire for compression in the first place.
As the administrator, your fundamental obligation is data integrity. If you compress, and if the compressed file store is damaged [especially if the header information on a compressed file - or files - is damaged], then you will tend to lose ALL of your data.
On the other hand, if your file store is ASCII/ANSI text, then even if file headers are damaged, you can still read the raw disk sectors and recover most of your data [might take a while, but at least it's theoretically do-able].
In this day and age, when magnetic storage is like $0.50 to $0.75 per GIGABYTE, I just can't fathom why a responsible admin would risk the possible data corruption that could come with compression.
Input type? (Score:3, Interesting)
Also, do any of you know any lossless algorithms for media (movies, images, music, etc)? Most algorithms perform poorly in this area, but I thought that perhaps there were some specifically designed for this.
Re:Why compress in the first place? (Score:5, Interesting)
Because when you are storing Petabytes of information it makes a difference in cost.
Besides, all the problems you mention with data coruption can be solved by backing up the information more than once. Anyplace that places a high value on there info is going to have multiple backups in multiple places anyways. The most usefull application of compression is in archiving old customer records. Being mostly text, you can easily get above 50% compression ratios. Also, these are going to be backed up to tape (not disk). Being able to reduce the volume of tapes being stored by 50% can save a lot of money for a large organization.
Re:More time = More compression (Score:2, Interesting)
Not only time, but also how much memory the algorithm uses, though the author did not mention how much space each algorithm uses. gzip, for instance, does not use much, but others, like rzip (http://rzip.samba.org/ [samba.org]) uses alot. rzip may use up to 900MB during compression.
I did a test with compressing a 4GB tar archive with rzip, wich result in a compressed file of 2.1 GB. gzip at max compression gave about 2.7 GB.
So one should choose an algorithm based upon need, and of course, availability of source code. Using a propetiary, closed source compression algorithm with no open source alternative implementation is begging for trouble down the road,
Re:Why compress in the first place? (Score:5, Interesting)
The solution to this issue is popular on usenet, since it's common for large files to be damaged. There's a utility called par2 that allows recovery information to be sent, and it's extremely effective. It's format-neutral, but most large binaries are sent as multi-part RAR archives. par2 can handle just about any damage that occurs, up to and including missing files.
Most of the time however, when it's simply someone downloading something it is only necessary to detect damage so they can download it again. All the formats I have experience with can detect damage, and it's common for MD5 and SHA1 sums to be sent separately anyway for security reasons.
Re:More time = More compression (Score:5, Interesting)
So, I would consider gzip the best performer by this criteria. After all, if I cared most about space savings I'd have picked the best-mode - not the fast-mode. All this articles suggests is that a few archivers are REALLY lousy for doing FAST compression.
If my requirements were realtime compression (maybe for streaming multimedia) then I wouldn't be bothered with some mega-compression algorithm that takes 2 minutes per MB to pack the data.
Might I suggest a better test? If interested in best compression, then run each program in a mode which optimizes purely for compression ratio. On the other hand, if interested in realtime compression then take each algorithm and tweak the parameters so that they all run in the same time (which is a realtively fast time), and then compare compression ratios.
With the huge compression of multimedia files I'd also want the reviewers to state explicity that the compression was verified to be lossless. I've never heard of some of these proprietary apps, but if they're getting significant ratios out of
Re:Why compress in the first place? (Score:4, Interesting)
Not all data is stored in ASKII and or ANSI. Compressing the data can make it more secure not less.
1. It takes up less sectors of a drive so it is less likely to get corrupt.
2. Can contain extra data to recover from bad bits.
3. Allows you to make redundant copies without using any more storage space.
Let's say that you have some files that are in ASCII you want to store. Using any compression method you can probably store 3 copies of the file using the same amount of disk space.
You are far more likely to recover a full data set from three copies of compressed file than from one copy of an uncompressed file.
Also we do not have unlimited bandwidth and unlimted storage EVERYWHERE.Loseless video, image, and audio files take up a lot of space. For some applications MP3, Ogg, MPG, and JPEG just don't cut it.
So yes compression still is important.
small mistake (Score:5, Interesting)
Since WinZip does not handle .7z, .ace or .rar files, it has lost much of its appeal for me. With my old serial no longer working, I now have absolutely no reason to use it. Now when I need a compressor for Windows I choose WinAce & 7-Zip. Between those two programs, I can de-/compress just about any format you're likely to encounter online.
Re:Speed (Score:3, Interesting)
If your file starts out as 250 mb, it might be worth it. However, if you start with a 2.5 gb file, then it's almost certainly not -- especially once you take the closed-source and undocumented nature of the compression algorithm into account.
Re:Why compress in the first place? (Score:0, Interesting)
Re:Input type? (Score:3, Interesting)
No one ever looks at rzip (Score:4, Interesting)
Decompression Speed (Score:4, Interesting)
JPG compression (Score:5, Interesting)
I'd heard the makers of Stuffit were claiming this, but I was sceptical, it's good to see independant confirmation.
Why does ANYBODY Bother with WinZip? (Score:4, Interesting)
Proprietary, costs money...
I use ZipGenius - handles 20 compression formats including RAR, ACE, JAR, TAR, GZ, BZ, ARJ, CAB, LHA, LZH, RPM, 7-Zip, OpenOffice/StarOffice Zip files, UPX, tc.
You can encrypt files with one of four algorhythms (CZIP, Blowfish, Twofish, Rijndael AES).
If you set an antivirus path in ZipGenius options, the program will prompt you to perform an AV scan before running the selected file.
It has an FTP client, TWAIN device image importing, file splitting, convert RAR into SFX, converts any Zip archive into an ISO image file, etc.
And it's totally free.
Re:Speed (Score:5, Interesting)
another case is if you only have 100 megabytes you can use and only a zzzxxxyyy archiver can compress it into the 100mb while gzip -9 leaves you with 102mb.
so it really depends if you need it or not. sometimes you need it, mostly you don't.
but bashing on the issue "like nobody ever needs it" is certainly wrong.
Lest We Forget - Philip W. Katz (Score:4, Interesting)
Phillip W. Katz, better known as Phil Katz (November 3, 1962-April 14, 2000), was a computer programmer best-known as the author of PKZIP, a program for compressing files which ran under the PC operating system DOS.
http://en.wikipedia.org/wiki/Phil_Katz [wikipedia.org]
Re:Nice Comparison... (Score:2, Interesting)
Speed is better than bzip2 and compression is top class, beaten only by 7zip and LZMA compresserors (which require much more speed and memory). Problem is that decompression is the same speed as the compression, unlike bzip2/gzip/zip where the decompression is much faster
The review quoted above is totally useless because 7zip for example uses a 32Kb dictionary. Given a 200Mb dictionary it really starts to perform quite well! I would not be suprised if 7zip didn't come out the winner there given a better compression parameter.
Re:Why compress in weird formats? (Score:3, Interesting)
Re:Speaking of Comparisons (Score:3, Interesting)
I knew I had seen this story before but it wasn't here. This article was up on Digg three days ago [digg.com]--with only three Diggs to it's name (at the time of this writing), but it's front page news here? Interesting to say the least...
I predict that this Digg [digg.com] will become frontpage Slashdot news shortly. It was quite popular (914 diggs so far) and it's hit the three-day mark...
I know, this is all so OT, but it's no worse then whining about duplicate postings here...
Oh the irony here is just too much to take without laughing! My comment gets hammered with the REDUNDANT pummel when I point out that /. is being REDUNDANT in posting old Diggs? Man, it just doesn't get any better then this to make a point.
Moderators: did you catch the not-so-subtle play I made here by quoting ALL of my original message? In case you didn't, I'm beinging REDUNDANTLY sarcastic...
Enjoy!