Use BitTorrent To Verify, Clean Up Files 212
jweatherley writes "I found a new (for me at least) use for BitTorrent. I had been trying to download beta 4 of the iPhone SDK for the last few days. First I downloaded the 1.5GB file from Apple's site. The download completed, but the disk image would not verify. I tried to install it anyway, but it fell over on the gcc4.2 package. Many things are cheap in India, but bandwidth is not one of them. I can't just download files > 1GB without worrying about reaching my monthly cap, and there are Doctor Who episodes to be watched. Fortunately we have uncapped hours in the night, so I downloaded it again. md5sum confirmed that the disk image differed from the previous one, but it still wouldn't verify, and fell over on gcc4.2 once more. Damn." That's not the end of the story, though — read on for a quick description of how BitTorrent saved the day in jweatherley's case.
jweatherley continues: "I wasn't having much success with Apple, so I headed off to the resurgent Demonoid. Sure enough they had a torrent of the SDK. I was going to set it up to download during the uncapped night hours, but then I had an idea. BitTorrent would be able to identify the bad chunks in the disk image I had downloaded from Apple, so I replaced the placeholder file that Azureus had created with a corrupt SDK disk image, and then reimported the torrent file. Sure enough it checked the file and declared it 99.7% complete. A few minutes later I had a valid disk image and installed the SDK. Verification and repair of corrupt files is a new use of BitTorrent for me; I thought I would share a useful way of repairing large, corrupt, but widely available, files."
Anonymous Coward (Score:3, Informative)
It's like getting parity files over on usenet to fix that damned
Scheduling (Score:4, Informative)
Been using bittorrent and rsync for this for years (Score:5, Informative)
Especially with our slow links, or worse yet, on dialup (if I go enough years back) in Australia.
Before bittorrent I would use rsync. That required me to download the large file to a server in the US on a fast connection, then rsync my copy to the server's copy to fix what is corrupt in my copy.
It works beautifully.
Re:Nice (Score:5, Informative)
Ok, you load torrentB in your favorite Bittorrent client, and start it up. It will automatically create 0-sized files with the names in filesetB (at least, all clients I know do that). Stop the transfer of torrentB, and substitute the 0-sized files in filesetB with the corresponding files in filesetA (may require some renaming). As you restart torrentB, your Bittorrent client will recheck the whole filesetB, keeping the valid parts in order to avoid downloading them. Voilá! You have migrated files from one torrent to another.
Note: You should make sure that the files you are substituting in are the same files you want to download through torrentB or, at least, keep a copy around until you see that the restart check accepts most of their contents.
Re:Any MD5s on Apple's page? (Score:3, Informative)
Re:What broken software were you using? (Score:5, Informative)
Re:What broken software were you using? (Score:3, Informative)
I'd bet $100 that if he did the same download over HTTPS, thus preventing software meddling of the packet contents, it would come out perfect.
Re:Scheduling (Score:3, Informative)
Re:Hardware Failure is your bigger concern (Score:4, Informative)
While I agree that bad ram is most likely the issue, it's still possible bad ram in a router or even something goofy going on in a router, such as the firmware bug described, could have caused problems. The bits were mangled before they were written to the disk. They could have been mangled by anything that processed those bits as they traversed from apple's website to his HD, including Apple's website and the HD itself. That embedded devices tend to be more reliable does not mean they don't break and do weird things sometimes.
Re:What broken software were you using? (Score:4, Informative)
Oh, and TCP checksumming isn't perfect.
Re:What broken software were you using? (Score:3, Informative)
Re:the new dr. who sucks... (Score:3, Informative)
Re:What broken software were you using? (Score:5, Informative)
I actually saw this happen once ... the astronomically unlikely [1]. TCP accepted the corrupt packet. I'm sure it will never happen again. Fortunately, rsync caught it in the next run.
One problem I ran into once with a certain Intel NIC was that a certain data pattern was always being corrupted. TCP always caught it and dropped the packet. There was no progress beyond that point because of the hardware defect always corrupted that data pattern. Turns out there was a run of zeros followed by a certain data byte (I tried a different data byte and with different run lengths and those never got corrupted). What the NIC did was drop 4 bytes, and put 4 bytes of garbage at the end. I suspect it was a clocking syncronization error. I got around the problem by adding the -z option to rsync (which I normally would not have done with an ISO of mostly compressed files). Another way would have been to do the rsync through ssh, either as a session agent (like rsync itself can do) or as a forwarded port (how I do it now for a lot of things).
[1] ... approximately 1 in 2^31-1 chance that the TCP checksum will happen to match when the data is wrong (variance depending on what causes the error in the first place) ... which approaches astronomically unlikely. Take 1 Terabyte of random bits. Calculate the CRC-32 checksum for each 256 byte block. Sort all these checksums. You will find 2 (or more) data blocks with the same checksum (or a repeating pattern in your RNG). Why? Because CRC-32 has 2^32-1 possible states, and you have 2^32 random checksums.
Agreed. Since it is at least software's responsibility to detect and fix it, if the problem happens, the famous finger of fault points at the software.
Your $100 is safe.
Re:!new (Score:5, Informative)
First of all, scene releases are _never_ compressed; it's always done with the -0 argument, this makes is basically equivalent to the unix split program. If a file is to be compressed, it is done with a zip archive, and the zip archive is placed inside the rar archive. This is because rar archives can be created/extracted easily with FOSS software, but cannot easily be de/compressed. This was more of an issue before Alexader Roshal released source code (note:not FOSS) to decompress rar archives.
Second, people often have parts of, or complete, scene releases and they are unwilling to unrar them (often because it's an intermediary, like a shell account somewhere where law isn't a problem).
Third, people follow "the scene" and try and download the exact releases that are chosen by the social customs of the scene (I am not going to detail those here), thus, "breaking up" (ie, altering) the original scene release is seen as rude.
Fourth, the archive splitting is in precise sizes so that fitting the archives onto physical media works better; typically the archive size is some rough factor of 698, ~4698 and ~8500.
Fifth, archives are split due to poor data integrity on some transfer protocols (though this is largely historical nowadays); redownloading a corrupted 14.3mb archive is easier than redownloading a 350mb file.
Sixth, traffic of the size is measured in terabytes, with some releases being tens, or sometimes hundreds of gigabytes in size. Thus, there become efficiency arguments for archive splitting; effective use of connections, limited efficiency of software(sftp scales remarkably poorly, though that is beginning to change - not that sftp is used everywhere), use of multiple coordinated machines and so on. This is an incomplete list of reasons; it is almost as though every time a new challenge is presented to the scene, splitting in some way helps to solve it.
AC because I'm not stupid enough to expose my knowledge of this either to law enforcement, or to the scene (who might just hand me over for telling you this - it has been done). Suffice to say that this is more complex than you understand, and that even this level of incomplete explanation is rare.
Re:What broken software were you using? (Score:4, Informative)
Re:Nice (Score:2, Informative)
We still use them, on usenet anyways.
Re:Nice (Score:2, Informative)
For example, if you are missing a total of 3 blocks (one block from 3 different files) you only need to download a very small par2 file that says "+3 blocks" and it will repair the three missing blocks. Of course, if you are missing a lot more data, even entire files, you can get several of the larger "+128" par files and it'll repair everything (assuming there is enough parity data). Often you can even request additional parity blocks, but that's only necessary if you have a *really* crappy nntp provider.
The first rule (Score:5, Informative)
Re:Nice (Score:3, Informative)
With multipart binary posts, a single file is split up between so many posts. Between fifteen to fifty, let's say. It's common for usenet providers to not receive all the posts, so folks are sometimes left with incomplete/corrupt files. Enter the small, spanned archive formats. It's quite common to see up to 10% parity per usenet posting, especially for large files. Small split set sizes make for easy reposting, as well.
In regards to the grandparent, this likely relates to why the said torrents are healthier. Folks who can bypass the leeching process and go straight to the seeding. The only other means of really sharing on such binary groups would be posting (or reposting) stuff for folks. Due to ever limiting server retention, a lot of the binary groups look down on heavy posting.
I gave up on usenet years ago, though; well, not really. My ISP gave up on it, and I was too much of a bum to pay someone for decent service. I would encourage anyone to check if their ISP offers usenet access if they're into P2P and don't like the "2P" part that much.
Chance of CRC clashes is much higher (Score:2, Informative)
First, as rdebath argues, you only get 16 bits of CRC on TCP headers.
And furthermore, if you start calculating CRCs off random data, chances (>50%) are you will get a collision (two chunks of data with the same CRC) around the 256th try (this is known as the "birthday paradox" in criptography). Of course, to be really sure to get a collision you will need to try at most 65536 values; but you will reach a very high probability of clash much sooner than intuition may tell you.
See birthday attack [wikipedia.org] for the math.Re:Nice (Score:4, Informative)
Besides that there is the information theory problem too. If the hash is 128bit long then every 2^128th file will have the same hash. This might seem unlikely if you only compare a few files (such as all the files ever created by man) but compared to the 2^8000000 hashes we where going to calculate it is actually quite substantial.
Re:!new (Score:5, Informative)
Re:Nice (Score:2, Informative)
A checksum is not unique.
A 32 MB file has 8388608 ways to generate the same 32 bit checksum.
Using "given" data to help narrow the search is a bad idea as well - there is no guarantee that the given data is correct, unless you have individual checksums for them. Bittorrent does do checksumming on each individual chunk (I believe), so you could narrow your search space to the size of the incomplete and missing chunks only. The existing data in incomplete chunks would be almost useless, since you don't know if that data is correct. But you can start your search assuming it's correct, (it probably is, mostly) and speed things up.
But the bottom line is that checksums are smaller than the data they verify. Much smaller. Consider a simple example of a 2 bit checksum on an 8 bit chunk of data. Our checksum simple counts the ones, and rolls over.
00000000 : 00 (0 ones)
00011010 : 11 (3 ones)
10110111 : 10 (6 ones - 110 is truncated to 10)
There are 64 ways to get any particular checksum.
2^8(data length) / 2^2(checksum length) = 2^6 = 64. And that's with us having 25% of our data duplicated in checksums.
A checksum is a check. It is not a guarantee nor is it a blueprint from which you can reconstruct the original data. In certain cases it would be feasible - if you're downloading a thesis about Skittles, and it's corrupted, you could perform a brute force search (like you described) on the (small - it's just text) data, and then sort the matches by # of times "Skittles" is present, then by the % of data that is ASCII. You then hand verify the top 20 results or so, and you'll probably have it.
The same could theoretically be applied to AVIs by enforcing the AVI frame structure (throw out checksum matches that don't generate valid AVI files), attempting to grab audio out of the generated files, and then doing a frequency analysis of the audio - rank the results in terms of % of audio that falls within normal listening ranges (since it's almost guaranteed that audio in an AVI will be compressed in a lossy format).
You could do analysis of the video frames and such too. But the bottom line is that it's a HUGE undertaking - just redownload the damned thing, pay for it, or write your own paper. If it's vital data then go ahead, brute force it and waste your life away.
PAR2 files are neat - they give you chunkettes of parity data at different offsets. This allows you to potentially patch holes in data and reconstruct the original files. WinRAR (and other programs) do give you the option to create a recovery record that's placed in the original RAR (or whatever format) files. The problem is that you then have to download the recovery data. With PAR files, you don't download them unless you need them. The downside is that availability then becomes a problem.