Large File Problems in Modern Unices 290
david-currie writes "Freshmeat is running an article that talks about the problems with the support for large files under some operating systems, and possible ways of dealing with these problems. It's an interesting look into some of the kinds of less obvious problems that distro-compilers have to face."
Not really that groundbreaking... (Score:5, Interesting)
Re:Why large files (Score:5, Interesting)
Wrong point of view. (Score:0, Interesting)
It's an old and well known problem that programmers and users tend to keep very large files for laziness and logical errors.
However it's also an old and well known fact that large files are bad for performance per se due to several reasons:
Think about it.
Re:Why large files (Score:5, Interesting)
Oh, you're still not convinced, well see it this way: when in the future will you ever need to burn a DVD?
Well? A typical one sided DVD-R holds around 4 GB of data (somewhat more), if you use both sides, you can get more than 8 GB of data on it. That's way bigger than 2 GB, no? Now, how big must your image be before you burn it on there? well?
Right...
Re:Wrong point of view. (Score:1, Interesting)
Q: Why large files? A: Disk images too (Score:2, Interesting)
I have most all of my older system images available to inspect. The loopback devices under Linux are tailor made for this type of thing.
I am puzzled as to why you mention the seek times. Surely you would agree that the seek time should be only inversely geometrically related to size, the particular factors depending on the filesystem. Any deviation from the theoretical ideal is the fault of a particular OS's implementation. My experience is that this is not significant.
(user dmanny on wife's machine, ergo posting as AC)
Re:Why large files (Score:3, Interesting)
Can anyone give a good reason for needing files larger than 2gb?
Forensic analysis of disk images. And yes, from experience I can tell you that half the file tools on RedHat (like, say, Perl) aren't compiled to support >2GB files.
The O/S should do it and do it well. (Score:3, Interesting)
2) Instead of 10 different applications writing code to support splitting up an otherwise sound model, why not have 1 operating system have provisions for dealing with large files.
3) You are going to need the bigger files with all those 32 bit wchar_t and 64 time_ts you got!
Re:huh? (Score:2, Interesting)
"of the kinds" really adds nothing to the meaning here, nor does "have to"
Thus we have:
The same sentence, but much cleaner!
Thanks! I'll be here all week.
Re:Why large files (Score:3, Interesting)
Re:Wrong point of view. (Score:5, Interesting)
> fragmentation: large files increase to fracmentation of most file systems
What kind of fragmentation?
Small files lead to more internal fragmentation.
Large files are more likely to consist of more fragments, but when splitting this data into small files, those files are fragments of the same data.
>entropy pollution
What kind of entropy? Are you speaking of compression algorithms?
Compression ratios are actually better with large files than small files, because similarities between files across file-boundaries can be found. Therefor, gzip(bzip2) compresses a single large tar-file. (Simple test, try zip on many files and then zip without compression and subsequent compression on the resulting file).
>data pollution
How should limiting file size improve that situation? Then, people tend to store data in lot of small files. What a success. People will waste space, whether there is a file size limit or not.
>These limits are there for very good reasons and in my opinion they are even much to big.
Actually, they are there for historical reasons.
And should a DB spread all its tables over thousands of files instead of having only one table in one file and mmapping this single file into memory? Should a raw video stream be fragmented into several files to circumvent a file limit?
>[...] original K&R Unix [...] was much faster than modern systems
Faster? In what respect?
Re:Why large files (Score:3, Interesting)
The real issue we ran up against was compression... we wanted to have the original and interm data files available on-disk for awhile in case of reprocessing. The processing would generally take up 10x as much space as the original data file, so you compressed everything. Except that gzip can't handle files >2GB (at the time an alpha could, but we didn't want to touch it). Nor can zip. So we had to use compress. Yay. (bzip could handle it, but was decided against by the powers that be).
Compression of large files is still an issue, unless you want to split them up. Unless you download a beta version gzip still can't handle it. As I understand it zip won't ever be able to do it. There are some fringe compressors that can handle large files, but, well, they're fringe.
Re:Wrong point of view. (Score:3, Interesting)
Sure spliting data into a lot of smaller files is going to reduce the fragmentation slightly, but it is not going to improve your performance. Because the price of accessing different files is going to be higher than the price of the fragmentation.
In the next two arguments you managed to make two opposite statements both incorrect. That is actually quite impressive.
First you say large files increase the entropy of the data stored on the disk. Which is wrong as long as you compare to the same data stored in diffeerent files. Of course if the number of files on the disk is constant smaller files will lead to less entropy, but most people actually want to store some data on their disks.
Then you say large files are highly redundant, which is the opposite of having a large entropy as claimed in your previous argument. And in reality the redundancy does not tend to increase with filesize, but might of course depend on the format of the file.
All in all you are saying that people shouldn't store many data on their disks, and the little data they do store should be as compact as possible, while still allowing it to be compressed even further when doing backups. You might as well have said people shouldn't use their disks at all.
Finally claiming older Unix versions were faster is ridiculous, first of all they ran on different hardware. And surely on that hardware they were slower than todays systems. And even if you managed to port an ancient Unix version to modern hardware, I'm sure it wouldn't beat modern systems in todays tasks. Which DVD player would you suggest for K&R Unix?
Error Prevention (Score:3, Interesting)
The 31 bit limit on time_t overflows in this century - 63 bits outlasts the probable life of the Universe so it is unlikely to run into trouble.
That is the best argument I know for a 64 bit file size; in the long run it is one less thing to worry about.
I can't believe this...superSynchronicity??? (Score:3, Interesting)
I have noticed that on the SAME DAY some folks have asked question about the 2 GB filesize limit in HP-UX on comp.sys.hp.hpux !! Apparently, HP-UX default tar and cpio don't support files over 2 GB, either. Not even in HP-UX 11i. I never thought HP-UX stinked this bad...
How does Linux on x86 stack up? I decided not to use it for this backup, since I had my Blade 100, but would it have worked? Oh, btw, is there finally implemented on Linux a command like "share" (exsts in Solaris) to share directories via NFS, or do I still need to edit
Large filesystem lack more of a problem (Score:3, Interesting)
Many servers now have the physical capacity of over 2TB on a filesystem storage device.
Unfortunately this is still a very significant limitation.
This problem is much more commonly encountered than file size limitations.
Re:Error Prevention (Score:3, Interesting)
Start with 64-bit, but make it 63-bit. If the 64th bit is on, then there's another 64-bit value following which is prepended to the value (making it a 126-bit address -- again, reserve one bit for another 64-bit descriptor).
Chances are it won't ever need the additional descriptors since 64-bits is a lot, but it would solve the problem once-and-for-all.