Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Unix Operating Systems Software Technology

Large File Problems in Modern Unices 290

david-currie writes "Freshmeat is running an article that talks about the problems with the support for large files under some operating systems, and possible ways of dealing with these problems. It's an interesting look into some of the kinds of less obvious problems that distro-compilers have to face."
This discussion has been archived. No new comments can be posted.

Large File Problems in Modern Unices

Comments Filter:
  • by CoolVibe ( 11466 ) on Sunday January 26, 2003 @10:59AM (#5161560) Journal
    The problem is nonexistant in the BSD's, which use the large file (64 bit) versions anyway. And that you have to use a certain -D flag if your OS (like Linux) doesn't use the 64 bit versions. Whoopdiedoo. Not so hard. Recompile and be happy.
  • Re:Why large files (Score:5, Interesting)

    by Anonymous Coward on Sunday January 26, 2003 @11:02AM (#5161574)
    Real analytical work can easily produce files this large. Output for analyses of structures with more than half a million elements and several million degrees of freedom can EASILY produce output of over two gigs. Yes, these results can and should be split, but sometimes it makes sense to keep them together as a matter of convenience. Plus, there IS a small performance hit when dealing with multiple files on most of the major FEA packages.
  • Wrong point of view. (Score:0, Interesting)

    by Krapangor ( 533950 ) on Sunday January 26, 2003 @11:08AM (#5161611) Homepage
    There is not a problem with support of large files in Unix system, there is a problem with incompetent people using too large files in Unix systems.
    It's an old and well known problem that programmers and users tend to keep very large files for laziness and logical errors.
    However it's also an old and well known fact that large files are bad for performance per se due to several reasons:
    • fragmentation: large files increase to fracmentation of most file systems, at least of any system with uses single indexed trees/B-trees and nonlinear hashes
    • entropy pollution: large files increase to overall entropy on the harddisk leading to worse compression ratios for backup and maintenance
    • data pollution: the use of large files tempts users to store all kinds of redundant, reducible, linear and irrelevant data wasting storage space and I/O time
    So I don't see why admins should provide a "work-around" for the filesize limits. These limits are there for very good reasons and in my opinion they are even much to big. You should always remember that the original K&R Unix had only 12 bits for file size storage and was much faster than modern systems, in fact it did run on 2,2 MHz processors and 32 kB of RAM which wouldn't be sufficient for even a Linux of Windows XP bootloader.
    Think about it.
  • Re:Why large files (Score:5, Interesting)

    by CoolVibe ( 11466 ) on Sunday January 26, 2003 @11:13AM (#5161636) Journal
    raw video can easily exceed 2 GB in size. Why raw video? Because (like others said) it's easier to edit. Then you encode to MPEG2, which will shrink the size somewhat (usually still bigger than 2 GB, ever dumped a DVD to disk?), so it'll be "small" enough to burn onto a DVD or somesuch. Oh, editing 3 hours of raw wave data also chews away at the disk size. Also, since you need to READ the data from the media to see if it looks nice, you need to have support for those big files as well. Right, now why don't we need files bigger than 2 GB again? Well?

    Oh, you're still not convinced, well see it this way: when in the future will you ever need to burn a DVD?

    Well? A typical one sided DVD-R holds around 4 GB of data (somewhat more), if you use both sides, you can get more than 8 GB of data on it. That's way bigger than 2 GB, no? Now, how big must your image be before you burn it on there? well?

    Right...

  • by Anonymous Coward on Sunday January 26, 2003 @11:16AM (#5161657)
    As others have noted, there are plenty of good reasons to have files greater than two gigs including video editing and scientific research. The file size limits aren't there for a very good reason at all. Someone years ago had to weigh whether to make small files take up a huge amount of room by using 64 bit addresses that would allow multi-terabyte files to exist against using 32 bit addresses that would make small files smaller and create a 2 gb file limit. At the time, it made perfect sense because nobody was using files anywhere near 2 gb... But now they are.
  • by Anonymous Coward on Sunday January 26, 2003 @11:17AM (#5161661)
    While almost all the examples given are good, I don't think anyone has mentioned complete disk images. I have recently had to do this in order to recover from a hardware issue (drive cable failure resulted loss of MBR, nasty) and on a TiVo unit that had a bad drive.

    I have most all of my older system images available to inspect. The loopback devices under Linux are tailor made for this type of thing.


    I am puzzled as to why you mention the seek times. Surely you would agree that the seek time should be only inversely geometrically related to size, the particular factors depending on the filesystem. Any deviation from the theoretical ideal is the fault of a particular OS's implementation. My experience is that this is not significant.

    (user dmanny on wife's machine, ergo posting as AC)

  • Re:Why large files (Score:3, Interesting)

    by bourne ( 539955 ) on Sunday January 26, 2003 @11:21AM (#5161679)

    Can anyone give a good reason for needing files larger than 2gb?

    Forensic analysis of disk images. And yes, from experience I can tell you that half the file tools on RedHat (like, say, Perl) aren't compiled to support >2GB files.

  • 1) Splitting up a big file turns an elegant solution into a an inelegant nightmare.

    2) Instead of 10 different applications writing code to support splitting up an otherwise sound model, why not have 1 operating system have provisions for dealing with large files.

    3) You are going to need the bigger files with all those 32 bit wchar_t and 64 time_ts you got!

  • Re:huh? (Score:2, Interesting)

    by RumpRoast ( 635348 ) on Sunday January 26, 2003 @11:44AM (#5161792)
    Actually you changed the meaning of that sentence. I think really we object to:
    "It's an interesting look into some
    of the kinds of less obvious problems that distro-compilers have to face."

    "of the kinds" really adds nothing to the meaning here, nor does "have to"

    Thus we have:

    "It's an interesting look into some of the less obvious problems that distro-compilers face."

    The same sentence, but much cleaner!

    Thanks! I'll be here all week.

  • Re:Why large files (Score:3, Interesting)

    by bunratty ( 545641 ) on Sunday January 26, 2003 @11:49AM (#5161821)
    Over Christmas and New Years, I helped my wife run a simulation of 1000 different patients for an acedemic pharmacokinetics paper. The run took ten days and had an input file of about 1.5 GB. If her computer was faster, or she had access to more computers, she would have wanted to simulate more patients and would easily have needed support for files larger than 4 GB. As CPUs get faster and hard disks get larger, there will be much more demand for these large files as well as more than 4 GB per process.
  • by Yokaze ( 70883 ) on Sunday January 26, 2003 @11:52AM (#5161843)
    I'm not a specialist on this matter, so maybe you can enlighten me, where I am wrong or misunderstood you.

    > fragmentation: large files increase to fracmentation of most file systems
    What kind of fragmentation?

    Small files lead to more internal fragmentation.
    Large files are more likely to consist of more fragments, but when splitting this data into small files, those files are fragments of the same data.

    >entropy pollution
    What kind of entropy? Are you speaking of compression algorithms?

    Compression ratios are actually better with large files than small files, because similarities between files across file-boundaries can be found. Therefor, gzip(bzip2) compresses a single large tar-file. (Simple test, try zip on many files and then zip without compression and subsequent compression on the resulting file).

    >data pollution
    How should limiting file size improve that situation? Then, people tend to store data in lot of small files. What a success. People will waste space, whether there is a file size limit or not.

    >These limits are there for very good reasons and in my opinion they are even much to big.

    Actually, they are there for historical reasons.
    And should a DB spread all its tables over thousands of files instead of having only one table in one file and mmapping this single file into memory? Should a raw video stream be fragmented into several files to circumvent a file limit?

    >[...] original K&R Unix [...] was much faster than modern systems

    Faster? In what respect?
  • Re:Why large files (Score:3, Interesting)

    by Zathrus ( 232140 ) on Sunday January 26, 2003 @11:53AM (#5161846) Homepage
    In my previous job we regularly processed credit data files >2 GB. All the data is processed serially (as someone else mentioned), so seek time is not an issue (nor is it an issue in a binary data file - seek to 1.4GB. Done. Next.).

    The real issue we ran up against was compression... we wanted to have the original and interm data files available on-disk for awhile in case of reprocessing. The processing would generally take up 10x as much space as the original data file, so you compressed everything. Except that gzip can't handle files >2GB (at the time an alpha could, but we didn't want to touch it). Nor can zip. So we had to use compress. Yay. (bzip could handle it, but was decided against by the powers that be).

    Compression of large files is still an issue, unless you want to split them up. Unless you download a beta version gzip still can't handle it. As I understand it zip won't ever be able to do it. There are some fringe compressors that can handle large files, but, well, they're fringe.
  • by kasperd ( 592156 ) on Sunday January 26, 2003 @12:00PM (#5161881) Homepage Journal
    I sure hope that was a joke. Because otherwise it would be one of the most clueless comments I have seen.

    Sure spliting data into a lot of smaller files is going to reduce the fragmentation slightly, but it is not going to improve your performance. Because the price of accessing different files is going to be higher than the price of the fragmentation.

    In the next two arguments you managed to make two opposite statements both incorrect. That is actually quite impressive.

    First you say large files increase the entropy of the data stored on the disk. Which is wrong as long as you compare to the same data stored in diffeerent files. Of course if the number of files on the disk is constant smaller files will lead to less entropy, but most people actually want to store some data on their disks.

    Then you say large files are highly redundant, which is the opposite of having a large entropy as claimed in your previous argument. And in reality the redundancy does not tend to increase with filesize, but might of course depend on the format of the file.

    All in all you are saying that people shouldn't store many data on their disks, and the little data they do store should be as compact as possible, while still allowing it to be compressed even further when doing backups. You might as well have said people shouldn't use their disks at all.

    Finally claiming older Unix versions were faster is ridiculous, first of all they ran on different hardware. And surely on that hardware they were slower than todays systems. And even if you managed to port an ancient Unix version to modern hardware, I'm sure it wouldn't beat modern systems in todays tasks. Which DVD player would you suggest for K&R Unix?

  • Error Prevention (Score:3, Interesting)

    by Veteran ( 203989 ) on Sunday January 26, 2003 @12:13PM (#5161941)
    One of the ways to keep errors from creeping into programs is to put limits on things so high that you can never reach them in the practical world.

    The 31 bit limit on time_t overflows in this century - 63 bits outlasts the probable life of the Universe so it is unlikely to run into trouble.

    That is the best argument I know for a 64 bit file size; in the long run it is one less thing to worry about.
  • by haggar ( 72771 ) on Sunday January 26, 2003 @12:23PM (#5162003) Homepage Journal
    I had a problem with HP-UX apparently not wanting to transfer via NFS (when the NFS server is on HP-UX 11.0) files larger than 2GB. I had to backup a Solaris computer's hard disk using DD across NFS. This usually worked when the NFS server is Solaris. However, last friday it failed, when the server was setup on HP-UX. I had to resort to my little Blade 100 as the NFS server, and I had no problems with it.

    I have noticed that on the SAME DAY some folks have asked question about the 2 GB filesize limit in HP-UX on comp.sys.hp.hpux !! Apparently, HP-UX default tar and cpio don't support files over 2 GB, either. Not even in HP-UX 11i. I never thought HP-UX stinked this bad...

    How does Linux on x86 stack up? I decided not to use it for this backup, since I had my Blade 100, but would it have worked? Oh, btw, is there finally implemented on Linux a command like "share" (exsts in Solaris) to share directories via NFS, or do I still need to edit /etc/exports and then restart NFS daemon (or send SIGHUP)?
  • by mauriceh ( 3721 ) <mhilarius&gmail,com> on Sunday January 26, 2003 @01:45PM (#5162431)
    A much bigger problem is that Linux filesystems have a capacity limit of 2TB.
    Many servers now have the physical capacity of over 2TB on a filesystem storage device.
    Unfortunately this is still a very significant limitation.
    This problem is much more commonly encountered than file size limitations.
  • Re:Error Prevention (Score:3, Interesting)

    by Thing 1 ( 178996 ) on Sunday January 26, 2003 @04:44PM (#5163383) Journal
    One of the ways to keep errors from creeping into programs is to put limits on things so high that you can never reach them in the practical world.
    Anyone ever thought of a variable-bit filesystem?

    Start with 64-bit, but make it 63-bit. If the 64th bit is on, then there's another 64-bit value following which is prepended to the value (making it a 126-bit address -- again, reserve one bit for another 64-bit descriptor).

    Chances are it won't ever need the additional descriptors since 64-bits is a lot, but it would solve the problem once-and-for-all.

Organic chemistry is the chemistry of carbon compounds. Biochemistry is the study of carbon compounds that crawl. -- Mike Adams

Working...