Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Software IT

A Look at Data Compression 252

With the new year fast approaching many of us look to the unenviable task of backing up last years data to make room for more of the same. That being said, rojakpot has taken a look at some of the data compression programs available and has a few insights that may help when looking for the best fit. From the article: "The best compressor of the aggregated fileset was, unsurprisingly, WinRK. It saved over 54MB more than its nearest competitor - Squeez. But both Squeez and SBC Archiver did very well, compared to the other compressors. The worst compressors were gzip and WinZip. Both compressors failed to save even 200MB of space in the aggregated results."
This discussion has been archived. No new comments can be posted.

A Look at Data Compression

Comments Filter:
  • Speed (Score:3, Insightful)

    by mysqlrocks ( 783488 ) on Monday December 26, 2005 @02:37PM (#14340361) Homepage Journal
    No talk of the speed of compression/decompression?
  • by bigtallmofo ( 695287 ) on Monday December 26, 2005 @02:39PM (#14340369)
    For the most part, the summary of the article seems to be the more time that a compressing application takes to compress your files, the smaller your files will be after compressing.

    The one surprising thing I found in the article was that two virtually unknown contenders - WinRK and Squeez did so well. One disappointing obvious follow-up question would be how more well-known applications such as WinZip or WinRAR (which have a more mass-appeal audience) stack up against them with their configurable higher-compression options.

  • Nice Comparison... (Score:5, Insightful)

    by Goo.cc ( 687626 ) * on Monday December 26, 2005 @02:43PM (#14340398)
    but I was surprised to see that the reviewer was using XP Professional Service Pack 1. I actually had to double check the review date to make sure that I wasn't reading an old article.

    I personally use 7-Zip. It doesn't perform the best but it is free software and it includes a command line component that it nice for shell scripts.
  • by ahziem ( 661857 ) on Monday December 26, 2005 @02:48PM (#14340432) Homepage
    A key benefit to PKZIP and tarballs formats is that they will be accessible for decades or hundreds of years. These formats are open (non-proprietary), widely implemented, and free (as in freedom) software.

    The same can't be said for WinRK. Therefore, if you plan to want access to your data for a long period of time, you should carefully consider whether the format will be accessible.
  • by canuck57 ( 662392 ) on Monday December 26, 2005 @02:54PM (#14340470)

    I generally prefer gzip/7-Zip.

    The reasoning is simple, I can use the results cross platform without special costly software. A few extra bytes of space is secondary.

    For many files, I also find buying a larger disk a cheaper option than spending hours compressing/uncompressing files. So I generally only compress files I don't think I will need that are very compressable.

  • by ArbitraryConstant ( 763964 ) on Monday December 26, 2005 @03:00PM (#14340508) Homepage
    "I just don't understand the desire for compression in the first place."

    Sometimes, people have to download things.
  • by topham ( 32406 ) on Monday December 26, 2005 @03:03PM (#14340521) Homepage
    I'd call you a troll, but I think you were being honest.

    Compressing files with a good compression program does not increase the chance of it being corrupted.

    And, the majority of files people send to each other, etc, aren't simply ascii files. (even if yours are).

    The other advantage of using a compression program is the majority of them create archives and allow you to consolidate all the related files.

    A good archive/compression program will add a couple of percent of reduntancy data which can substantially increase the data integrity. Above and beyond that which you have by simply story an ascii file uncompressed.

    My concern with all the 'new' compression programs is that they, unlike Zip, haven't survived the test of time. I've recovered damaged zip archives in the past and they have come through mostly intact. I've used archive/compression like ARJ with options to be able to recover data even if there are multiple bad sectors on a harddrive or floppy disk. How many of the new compression programs have the tools available to adequately recover every possible byte of data?
  • Re:Speed (Score:3, Insightful)

    by Anonymous Coward on Monday December 26, 2005 @03:08PM (#14340550)
    No talk of the speed of compression/decompression?

    Exactly! We compress -terabytes- here at wr0k, and we use gzip for -nearly- everything (some of the older scripts use "compress", .Z, etc.)

    Why? 'cause it's fast. 20% of space just isn't worth the time needed to compress/uncompress the data. I tried to be modern (and cool) by using bzip2, yes, it's great, saves lots of space, etc., but the time required to compress/uncompress is just not worth it. ie: if you need to compress/decompress 15-20gigs per day, bzip2 just isn't there yet.

    Also, look at what google is using---they probably store more data than most other corps, but they still use gzip (I think, from some description, somewhere).
  • Re:Speed (Score:3, Insightful)

    by Arainach ( 906420 ) on Monday December 26, 2005 @03:20PM (#14340610)
    The Article Summary quoted is completely misleading. The most important graph is the final one on page 12, Compression Efficiency, where gzip is once again the obvious king. Sure, WinRK may be able to compress decently, but it takes an eternity to do so and is impractical for every-day use, which is where routines like gzip and ARJ32 come in - incredible compression for the speed in which they can operate. Besides - who really needs that last 54MB in these days of 4.9GB DVDs and 160GB Hard Drives?
  • by _Shorty-dammit ( 555739 ) on Monday December 26, 2005 @03:41PM (#14340720)
    haha, yeah, 7-zip isn't 'weird' at all. I like how you try to make it sound like it's just as pervasive as something like gzip, even though 7-zip's a pretty much unknown format.
  • Re:Speed (Score:5, Insightful)

    by Luuvitonen ( 896774 ) on Monday December 26, 2005 @03:52PM (#14340772)
    3 hours 47 minutes with WinRK versus gzipping in 3 minutes 16 seconds. Is it really worth watching the progress bar for 200 megs smaller file?
  • by cbreaker ( 561297 ) on Monday December 26, 2005 @04:51PM (#14341069) Journal
    If you're familiar with Usenet, you've probably encountered PAR files from time to time. A PAR file is a parity file which can be used to reconstruct lost data. It works sort of like a RAID, but with files as the units instead of disks.

    Let's say you have a 200MB file to send. You could just send the 200MB file, with no guarantees that it will reach the destination uncorrupted. Or, you could use a compression program and bring it down to 100MB. In this case, even if you lost the first transfer, you could transfer it a second time. Then we look at PAR. You compress the 200MB file into ten 10MB files. Then, you could include 10% parity - if any of your files is bad, you'd be able to reconstruct it with the parity file. With only 110MB of transfer. PAR2 goes even further by breaking down each file into smaller units.

    Besides transfer times and correction for network transfers, compression can also increase speeds of transfer to mediums. If you have an LTO tape drive that can only write to tape at 20MB/sec, you'll only ever get 20MB/sec. Add compression to the drive, and you could theoretically get 40MB/sec to tape with 2:1 compression. That means faster backups, and faster restores. On-board compression in the drives takes all the load off the CPU - but even if you use the CPU for it, they're fast enough to handle it.

    Not to mention, it takes a lot less tape to make compressed backups. I don't know what world you live in, but in mine, I don't have unlimited slots in the library and I don't want to swap tapes twice a day. Handling tapes is detremental to their lives; you really want to touch them as least as possible.

    Data corruption isn't caused by compression. If it's going to happen, it'll happen regardless. While your point is true that it MAY be more difficult to recover from a corrupt file, that's not the right methodology. If your backups are that valuable, you'd make multiple copies - plain and simple.

    I can't fathom why a responsible and well informed admin would avoid compression.
  • by Grimwiz ( 28623 ) on Monday December 26, 2005 @05:05PM (#14341120) Homepage
    A suitable level of paranoia would suggest that it would be good to decompress the compressed files and verify that they produce the identical dataset. I did not see this step in the overview.
  • by cbreaker ( 561297 ) on Monday December 26, 2005 @05:10PM (#14341141) Journal
    All I see is ads. I think I found a paragraph that looked like it may have been the article, but every other word was underlined with an ad-link so I didn't think that was it..
  • by fbjon ( 692006 ) on Monday December 26, 2005 @05:19PM (#14341180) Homepage Journal
    I prefer DoubleSpace [wikipedia.org] for maximum file-destroying activity.
  • by cbreaker ( 561297 ) on Monday December 26, 2005 @06:12PM (#14341397) Journal
    I don't undertstand why you think compression automatically destroys the chance of recovery? And how encoding in ASCII is better? What's the thing about "sectors"? I never said using a compressed volume on a hard disk was a good idea. Compressed files can be recovered too, you know. If you have the forensic expertise to recover a corrupted non-compressed file, changes are you'd also be able to recover the data from a compressed one.

    The only arguement for compression is not the cost of media - in fact I didn't mention media price at all. I did mention the library capacity, however - and getting an even bigger library is a lot more expensive of a prospect then the $.75 you quoted per GB. Did you read the whole part of my post about speeds? If I can restore that database in half the time because of compression, that means less down time and less money lost. (Although, the money-lost factor doesn't really apply at a government institution; we're not selling anything.)

    "If you backup more than once UNCOMPRESSED, you can recover almost anything because it is VERY unlikely that a bad sector will occur in the exact same spot or even in the same file (assuming the one file does not take up most of the specific media.)"

    Wouldn't this apply to a compressed backup, too? You're assuming here that the file was unchanged in between the two backups - thus it would apply to any data, compressed or not.

    "Alternatively, use PAR files to recover - as long as you're willing to add the extra space and time - which sort of obviates the advantage of compression, doesn't it?"

    No - it simply lowers the compression ratio a bit. If you're getting 2:1 compression and add 10% pars, you're still looking at a 1.8:1 compression ratio, but with recoverability.

    ----

    Within every IT budget, you must balance out the speed, recoverability, and cost of your backup solution.

    In your solution of never using compression (since no admin should do that, you mentioned) you lose a lot of speed in backups and restores. Speed of recovery is a key factor in many enviornments. It's often the top question asked when in discussion of new backup solutions. You talk about this as an important point yet excluding compression could double your restore times, or more. Not to mention backup speeds - if you can take your backups in half the time, you effectively double the number of servers you could backup in the same amount of time. Or, you reduce the amount of time servers are busy with backups.

    Recoverability is big - you want your backups to be reliable. Most of the time, any corruption is unacceptable, be it in a compressed file or not. It's either good or you throw it out and go back to the previous backup. Many IT shops are doing multiple backups these days - backup to disk first, then to tape. Then take snapshots of those tapes and bring them off-site. Compressed or not, testing your backups and ensuring you have no problems with hardware is much more effective then using uncompressed backups and performing forensics on them if they're bad. Speaking of which, I don't see why compressed data would be less recoverable.

    Finally, you have cost. Yes, even when data recoverability is a key factor, you still have to consider cost. So, what makes more sense? Using uncompressed backups that will backup and restore slower, cost a lot more for media and library capacity, and cause more personnel overhead for swapping tapes - or using compression and cutting all that in half? You'd rather lose all that in the off chance that MAYBE you could recovery more of your data, in the off chance that NONE of your other backups are good? I don't know any resposible IT manager that could agree with you.

    A proper backup and recovery plan with periodic testing and multiple copies held on-site and off is a much more effective solution then betting on forensic recovering of uncompressed data.

    Hey, I'm not claiming that compression is always right in every situation. That's far fro
  • Re:Speed (Score:5, Insightful)

    by MilenCent ( 219397 ) <johnwhNO@SPAMgmail.com> on Monday December 26, 2005 @07:16PM (#14341665) Homepage
    Don't you mean ads?

    The pages are shamefully loaded with ads! I could barely find the next-page links at the bottom of the window! At first, I thought a "Google Ad" link labeled "compression" might be the next page, and clicked on it! And the true link is oddly hidden in small print, in a corner beneath a large table of PriceGrabber comparison results.

    The article is basically unreadable, I'd say, due to the ads.
  • Compressing jpegs (Score:1, Insightful)

    by Anonymous Coward on Monday December 26, 2005 @08:32PM (#14341995)
    It's rather pointless to compare compressing jpegs between gzip and anything else, because jpeg internally uses gzip to compress the blocks that make up the image.

    Also for a lot of applications, compression speed is not important, decompression speed is. If you're distributing software, it's not that much of a problem if it takes a lot of time to compress, but if the install takes ages because the decompression is too slow it does matter.
  • by dr_skipper ( 581180 ) on Monday December 26, 2005 @08:38PM (#14342016)
    This is sad. Over and over slashdot is posting stories with nothing more than some lame tech review and dozens of ads. I really believe people are generating sites with crap technical content, packing them with ads, and submitting to slashdot hoping to win the impression/click lottery.

    Please editors, check the sites out first. If it's 90% ads and impossible to navigate without clicking ads accidentally, it's just some losers cash-grab site.
  • Re:Speed (Score:2, Insightful)

    by Killall -9 Bash ( 622952 ) on Monday December 26, 2005 @08:40PM (#14342025)
    If I didn't click on any ads on pages 1 through 14, will I click on one on that 15th page?

It's a naive, domestic operating system without any breeding, but I think you'll be amused by its presumption.

Working...