A Look at Data Compression 252
With the new year fast approaching many of us look to the unenviable task of backing up last years data to make room for more of the same. That being said, rojakpot has taken a look at some of the data compression programs available and has a few insights that may help when looking for the best fit. From the article: "The best compressor of the aggregated fileset was, unsurprisingly, WinRK. It saved over 54MB more than its nearest competitor - Squeez. But both Squeez and SBC Archiver did very well, compared to the other compressors. The worst compressors were gzip and WinZip. Both compressors failed to save even 200MB of space in the aggregated results."
Speed (Score:3, Insightful)
More time = More compression (Score:5, Insightful)
The one surprising thing I found in the article was that two virtually unknown contenders - WinRK and Squeez did so well. One disappointing obvious follow-up question would be how more well-known applications such as WinZip or WinRAR (which have a more mass-appeal audience) stack up against them with their configurable higher-compression options.
Nice Comparison... (Score:5, Insightful)
I personally use 7-Zip. It doesn't perform the best but it is free software and it includes a command line component that it nice for shell scripts.
Open formats and long-term accessibility (Score:5, Insightful)
The same can't be said for WinRK. Therefore, if you plan to want access to your data for a long period of time, you should carefully consider whether the format will be accessible.
Why compress in weird formats? (Score:5, Insightful)
I generally prefer gzip/7-Zip.
The reasoning is simple, I can use the results cross platform without special costly software. A few extra bytes of space is secondary.
For many files, I also find buying a larger disk a cheaper option than spending hours compressing/uncompressing files. So I generally only compress files I don't think I will need that are very compressable.
Re:Why compress in the first place? (Score:5, Insightful)
Sometimes, people have to download things.
Re:Why compress in the first place? (Score:3, Insightful)
Compressing files with a good compression program does not increase the chance of it being corrupted.
And, the majority of files people send to each other, etc, aren't simply ascii files. (even if yours are).
The other advantage of using a compression program is the majority of them create archives and allow you to consolidate all the related files.
A good archive/compression program will add a couple of percent of reduntancy data which can substantially increase the data integrity. Above and beyond that which you have by simply story an ascii file uncompressed.
My concern with all the 'new' compression programs is that they, unlike Zip, haven't survived the test of time. I've recovered damaged zip archives in the past and they have come through mostly intact. I've used archive/compression like ARJ with options to be able to recover data even if there are multiple bad sectors on a harddrive or floppy disk. How many of the new compression programs have the tools available to adequately recover every possible byte of data?
Re:Speed (Score:3, Insightful)
Exactly! We compress -terabytes- here at wr0k, and we use gzip for -nearly- everything (some of the older scripts use "compress",
Why? 'cause it's fast. 20% of space just isn't worth the time needed to compress/uncompress the data. I tried to be modern (and cool) by using bzip2, yes, it's great, saves lots of space, etc., but the time required to compress/uncompress is just not worth it. ie: if you need to compress/decompress 15-20gigs per day, bzip2 just isn't there yet.
Also, look at what google is using---they probably store more data than most other corps, but they still use gzip (I think, from some description, somewhere).
Re:Speed (Score:3, Insightful)
Re:Why compress in weird formats? (Score:3, Insightful)
Re:Speed (Score:5, Insightful)
Because it makes a hell of a lot of sense. (Score:5, Insightful)
Let's say you have a 200MB file to send. You could just send the 200MB file, with no guarantees that it will reach the destination uncorrupted. Or, you could use a compression program and bring it down to 100MB. In this case, even if you lost the first transfer, you could transfer it a second time. Then we look at PAR. You compress the 200MB file into ten 10MB files. Then, you could include 10% parity - if any of your files is bad, you'd be able to reconstruct it with the parity file. With only 110MB of transfer. PAR2 goes even further by breaking down each file into smaller units.
Besides transfer times and correction for network transfers, compression can also increase speeds of transfer to mediums. If you have an LTO tape drive that can only write to tape at 20MB/sec, you'll only ever get 20MB/sec. Add compression to the drive, and you could theoretically get 40MB/sec to tape with 2:1 compression. That means faster backups, and faster restores. On-board compression in the drives takes all the load off the CPU - but even if you use the CPU for it, they're fast enough to handle it.
Not to mention, it takes a lot less tape to make compressed backups. I don't know what world you live in, but in mine, I don't have unlimited slots in the library and I don't want to swap tapes twice a day. Handling tapes is detremental to their lives; you really want to touch them as least as possible.
Data corruption isn't caused by compression. If it's going to happen, it'll happen regardless. While your point is true that it MAY be more difficult to recover from a corrupt file, that's not the right methodology. If your backups are that valuable, you'd make multiple copies - plain and simple.
I can't fathom why a responsible and well informed admin would avoid compression.
accuracy test missing (Score:2, Insightful)
There's an article in there somewhere? (Score:5, Insightful)
Re:Just use DiskDoubler (Score:3, Insightful)
Re:Because it makes a hell of a lot of sense. (Score:3, Insightful)
The only arguement for compression is not the cost of media - in fact I didn't mention media price at all. I did mention the library capacity, however - and getting an even bigger library is a lot more expensive of a prospect then the $.75 you quoted per GB. Did you read the whole part of my post about speeds? If I can restore that database in half the time because of compression, that means less down time and less money lost. (Although, the money-lost factor doesn't really apply at a government institution; we're not selling anything.)
"If you backup more than once UNCOMPRESSED, you can recover almost anything because it is VERY unlikely that a bad sector will occur in the exact same spot or even in the same file (assuming the one file does not take up most of the specific media.)"
Wouldn't this apply to a compressed backup, too? You're assuming here that the file was unchanged in between the two backups - thus it would apply to any data, compressed or not.
"Alternatively, use PAR files to recover - as long as you're willing to add the extra space and time - which sort of obviates the advantage of compression, doesn't it?"
No - it simply lowers the compression ratio a bit. If you're getting 2:1 compression and add 10% pars, you're still looking at a 1.8:1 compression ratio, but with recoverability.
----
Within every IT budget, you must balance out the speed, recoverability, and cost of your backup solution.
In your solution of never using compression (since no admin should do that, you mentioned) you lose a lot of speed in backups and restores. Speed of recovery is a key factor in many enviornments. It's often the top question asked when in discussion of new backup solutions. You talk about this as an important point yet excluding compression could double your restore times, or more. Not to mention backup speeds - if you can take your backups in half the time, you effectively double the number of servers you could backup in the same amount of time. Or, you reduce the amount of time servers are busy with backups.
Recoverability is big - you want your backups to be reliable. Most of the time, any corruption is unacceptable, be it in a compressed file or not. It's either good or you throw it out and go back to the previous backup. Many IT shops are doing multiple backups these days - backup to disk first, then to tape. Then take snapshots of those tapes and bring them off-site. Compressed or not, testing your backups and ensuring you have no problems with hardware is much more effective then using uncompressed backups and performing forensics on them if they're bad. Speaking of which, I don't see why compressed data would be less recoverable.
Finally, you have cost. Yes, even when data recoverability is a key factor, you still have to consider cost. So, what makes more sense? Using uncompressed backups that will backup and restore slower, cost a lot more for media and library capacity, and cause more personnel overhead for swapping tapes - or using compression and cutting all that in half? You'd rather lose all that in the off chance that MAYBE you could recovery more of your data, in the off chance that NONE of your other backups are good? I don't know any resposible IT manager that could agree with you.
A proper backup and recovery plan with periodic testing and multiple copies held on-site and off is a much more effective solution then betting on forensic recovering of uncompressed data.
Hey, I'm not claiming that compression is always right in every situation. That's far fro
Re:Speed (Score:5, Insightful)
The pages are shamefully loaded with ads! I could barely find the next-page links at the bottom of the window! At first, I thought a "Google Ad" link labeled "compression" might be the next page, and clicked on it! And the true link is oddly hidden in small print, in a corner beneath a large table of PriceGrabber comparison results.
The article is basically unreadable, I'd say, due to the ads.
Compressing jpegs (Score:1, Insightful)
Also for a lot of applications, compression speed is not important, decompression speed is. If you're distributing software, it's not that much of a problem if it takes a lot of time to compress, but if the install takes ages because the decompression is too slow it does matter.
Embarassing ads - This is an ad cash-grab (Score:3, Insightful)
Please editors, check the sites out first. If it's 90% ads and impossible to navigate without clicking ads accidentally, it's just some losers cash-grab site.
Re:Speed (Score:2, Insightful)