Google Switching To EXT4 Filesystem 348

Posted by timothy on Thursday January 14, 2010 @04:50PM from the make-money-with-open-source dept.

An anonymous reader writes "Google is in the process of upgrading their existing EXT2 filesystem to the new and improved EXT4 filesystem. Google has benchmarked three different filesystems — XFS, EXT4 and JFS. In their benchmarking, EXT4 and XFS performed equally well. However, in view of the easier upgrade path from EXT2 to EXT4, Google has decided to go ahead with EXT4."

This discussion has been archived. No new comments can be posted.

Google Switching To EXT4 Filesystem

Load All Comments

Search 348 Comments Log In/Create an Account

Comments Filter:

Time for a backup? (Score:5, Informative)

by Itninja ( 937614 ) writes: on Thursday January 14, 2010 @04:52PM (#30770502) Homepage

I guess now is as good as any to go through my Gmail and Google Docs and make local backups. I'm sure my info is safe, but I have been through these types of 'upgrades' at work before and every once in a while....well, let's just say backups are never a bad idea.

Share
twitter facebook
- Re:Time for a backup? (Score:5, Funny)
  
  by fuzzyfuzzyfungus ( 1223518 ) writes: on Thursday January 14, 2010 @04:54PM (#30770526) Journal
  
  Not to worry. It's all in the cloud, right?
  
  Parent Share
  twitter facebook
  - Re:Time for a backup? (Score:5, Funny)
    
    by castironpigeon ( 1056188 ) writes: on Thursday January 14, 2010 @05:02PM (#30770658)
    
    Uh huh, the mushroom cloud.
    
    Parent Share
    twitter facebook
    - Re:Time for a backup? (Score:5, Funny)
      
      by paradigm82 ( 959074 ) writes: on Thursday January 14, 2010 @05:16PM (#30770932)
      
      It's probably nothing, probably. But I'm getting a small discrepancy in the file sizes...no, no, it's well within acceptable limits. Continue to stage 2.
      
      Parent Share
      twitter facebook
      - Re:Time for a backup? (Score:4, Informative)
        
        by XaXXon ( 202882 ) writes: <xaxxon@gmail. c o m> on Thursday January 14, 2010 @07:26PM (#30772678) Homepage
        
        Half life.
        
        Parent Share
        twitter facebook
        
        Re:Time for a backup? (Score:4, Funny)
        
        by pz ( 113803 ) writes: on Thursday January 14, 2010 @07:59PM (#30773102) Journal
        
        Based on the movie 2001:
        HAL: "Sorry about this, I know it's a bit silly...just a moment...just a moment... I've just picked up a fault in the AE35 unit. It's going to go 100% failure within 72 hours."
        Dave:"It's still within operational limits right now?"
        HAL:"Yes. And it will stay that way till it fails."
        I don't have my copy of the book handy to check the original dialogue.
        
        Parent Share
        twitter facebook
    - Re:Time for a backup? (Score:5, Funny)
      
      by Anonymous Coward writes: on Thursday January 14, 2010 @05:39PM (#30771314)
      
      Wait a minute. I'm a manager, and I've been reading a lot of case studies and watching a lot of webcasts about The Cloud. Based on all of this glorious marketing literature, I, as a manager, have absolutely no reason to doubt the safety of any data put in The Cloud.
      The case studies all use words like "secure", "MD5", "RSS feeds" and "encryption" to describe the security of The Cloud. I don't know about you, but that sounds damn secure to me! Some Clouds even use SSL and HTTP. That's rock solid in my book.
      And don't forget that you have to use Web Services to access The Cloud. Nothing is more secure than SOA and Web Services, with the exception of perhaps SaaS. But I think that Cloud Services 2.0 will combine the tiers into an MVC-compliant stack that uses SaaS to increase the security and partitioning of the data.
      My main concern isn't with the security of The Cloud, but rather with getting my Indian team to learn all about it so we can deploy some first-generation The Cloud applications and Web Services to provide the ultimate platform upon which we can layer our business intelligence and reporting, because there are still a few verticals that we need to leverage before we can move to The Cloud 2.0.
      
      Parent Share
      twitter facebook
      - Re:Time for a backup? (Score:4, Funny)
        
        by naglep ( 709515 ) writes: on Friday January 15, 2010 @08:02AM (#30777408) Journal
        
        A comment very very similar to this one has appeared on slashdot cloud articles before - almost verbatim, I'd say. Can't help but wonder if you're one of those losers who keep logs of comments they like so they can copy/paste them later.
        
        Parent Share
        twitter facebook
- Re: (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  Oh fuck off. It's not like Google is going to upgrade their entire multiply-redundant infrastructure all at once. And ext4 is a very conservative and stable FS. The "upgrade" process is to simply mount your old ext3 volume as ext4, and let new writes take advantage of ext4 features. If Google is actually still using ext2 rather than ext3, ext4 will be significantly *more* reliable. Not as good as XFS for preserving data integrity, but better than ext2.
  - Re:Time for a backup? (Score:5, Funny)
    
    by Itninja ( 937614 ) writes: on Thursday January 14, 2010 @05:28PM (#30771112) Homepage
    
    Jeez, calm down junior! No need to open a can of fanboi on me....
    
    Parent Share
    twitter facebook
  - Re: (Score:2)
    
    by gmuslera ( 3436 ) writes:
    
    Data integrity (and replication) is managed in a layer over the fs, so the journaling could be an unneeded hit to the performance. Probably thats why they didnt upgraded to ext3 a long while ago.
  - Re: (Score:3, Insightful)
    
    by lymond01 ( 314120 ) writes:
    
    If Google is actually still using ext2 rather than ext3, ext4 will be significantly *more* reliable.
    It ain't the destination, it's the journey that worries me.
  - - Re: (Score:3, Informative)
      
      by Ginger Unicorn ( 952287 ) writes:
      
      Presumably you're a troll, since the link that you gave explicitly states the following:
      
      It was a firefox exploit, not a problem with google's servers
      Only 60 people were affected. "mass email deletions" indeed.
    - - Re: (Score:3, Informative)
        
        by tytso ( 63275 ) * writes:
        
        >I mount these read-only in the interests of security, but that means, of course,
        >that I can't have journalling on them, which precludes the use of ext3 or 4.
        #1. you can mount ext3 file systems read-only. The journal doesn't preclude a ro mount.
        #2. ext4 supports running without a journal. Google engineers contributed that code to ext4 last year.
- Re: (Score:2)
  
  by Monkeedude1212 ( 1560403 ) writes:
  
  It sounds like EXT4 is fully compatible with 2 and 3, so even an EXT2 drive can be mounted as EXT4, which means the chances for failure are seriously reduced.
  But I totally hear what you're saying. Whenever you upgrade Anything, nothing is SUPPOSED to go wrong.
  However, It always does.
- Re:Time for a backup? (Score:5, Funny)
  
  by tool462 ( 677306 ) writes: on Thursday January 14, 2010 @05:17PM (#30770942)
  
  I usually let the bit-gods decide what data I have that is important enough to save. Over the years the bit-gods have taught me that:
  Music files: not important, Styx crossed the Styx to /dev/null in 2002
  Essay written for sophomore year high school english: Important, I assume to haunt me in some future political race.
  Porn collection: Like the subject matter within, it swells impressively, explodes, then enters a refractory period until it's ready to build up again.
  C++ program that graphs the Mandelbrot set: Important. I like feeling like an explorer navigating the cardioid's canyons.
  Photos of my children: Not important. If I need more baby photos, I can just have more babies.
  
  Parent Share
  twitter facebook
- Re: (Score:3, Insightful)
  
  by at_slashdot ( 674436 ) writes:
  
  "backups are never a bad idea."
  Depends, for example you reduce the security of data with the number of backups you keep (you could encrypt them but that has it's own problems).
- - Re: (Score:2)
    
    by spydum ( 828400 ) writes:
    
    Actually, they could. It's not like you pay anything for it.
    - Re: (Score:2, Insightful)
      
      by BenLeeImp ( 1347831 ) writes:
      
      True, but they do make money off of your data. I'm pretty sure they will go to great lengths to protect their source of revenue.
  - Re: (Score:2)
    
    by berashith ( 222128 ) writes:
    
    is the beta over yet? I dont give good SLAs on retention and recovery to dev systems .
- - Re: (Score:2)
    
    by Itninja ( 937614 ) writes:
    
    Free loaders? You like GRUB or LILO? I don't get it.
Slashdotted already ? (Score:2)

by ccandreva ( 409807 ) writes:

Looks like Digitizor already melted.
- Re: (Score:2)
  
  by lalena ( 1221394 ) writes:
  
  Yeah, it was down by the time there were 2 posts in /.
- Re: (Score:2)
  
  by spazdor ( 902907 ) writes:
  
  Must be all that journalizing the webserver's gotta do.
Use of commas. (Score:4, Funny)

by Anonymous Coward writes: on Thursday January 14, 2010 @04:57PM (#30770572)

Eats, shoots and leaves. Read it.

Share
twitter facebook
- Re:Use of commas. (Score:5, Funny)
  
  by schon ( 31600 ) writes: on Thursday January 14, 2010 @05:03PM (#30770692)
  
  Maybe it was submitted by William Shatner?
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by natehoy ( 1608657 ) writes:
    
    Nope, that can't be it. There aren't any exclamation points.
- - Re: (Score:2, Informative)
    
    by Darth Sdlavrot ( 1614139 ) writes:
    
    Why do I put a comma before the and in a list?
    I would say "I have a cat, a dog, and two goats."
    But you would say "I have a cat, a dog and two goats."
    The English language is so damned weird...but AC is right, illegal use of commas. That's a 15 karma penalty. 1st down.
    I too add the comma in lists of discrete items -- not sure where I learned it.
    If some items are connected or related in some way that's distinct from the other items in the list I'd omit the comma. Not a great example: "I have a cat, a daughter and a son, a car and a motorcycle, and a swimming pool."
    I notice that the Brits (and Canucks, Aussies, etc., tend to always omit the comma.
    Could be an Americanism?
    (And I suspect you really write it, not "say" it.)
  - Re: (Score:2, Informative)
    
    by AvitarX ( 172628 ) writes:
    
    There is no hard rule on this, and both can be ambiguous in different circumstances.
    http://en.wikipedia.org/wiki/Serial_comma [wikipedia.org]
Digitzor link uesless (Score:5, Informative)

by autocracy ( 192714 ) writes: <`slashdot2007' `at' `storyinmemo.com'> on Thursday January 14, 2010 @04:58PM (#30770592) Homepage

I managed to ease a pageview out of it. That said, the /. summary says all they say, and you're all better served by the source they point to, which is what SHOULD have been in the article summary instead of the Digitzor site.
See http://lists.openwall.net/linux-ext4/2010/01/04/8 [openwall.net]

Share
twitter facebook
- Re: (Score:2)
  
  by ShadowRangerRIT ( 1301549 ) writes:
  
  Mod parent Informative please. It's a good link, particularly with the /.ing of the original article link.
Ted T'so (Score:5, Informative)

by RPoet ( 20693 ) writes: on Thursday January 14, 2010 @04:59PM (#30770604) Journal

They have Ted T'so [h-online.com] of Linux filesystem fame working for them now.

Share
twitter facebook
Btrfs? (Score:3, Interesting)

by Wonko the Sane ( 25252 ) * writes: on Thursday January 14, 2010 @05:00PM (#30770616) Journal

I guess they didn't consider btrfs ready enough for benchmarking yet.

Share
twitter facebook
- Re: (Score:3, Funny)
  
  by fuzzyfuzzyfungus ( 1223518 ) writes:
  
  I wonder if oracle is really bttr about their rejection?
- Re:Btrfs? (Score:5, Informative)
  
  by Paradigm_Complex ( 968558 ) writes: on Thursday January 14, 2010 @05:11PM (#30770838)
  
  From kernel.org's BTRFS page [kernel.org]:
  Btrfs is under heavy development, and is not suitable for any uses other than benchmarking and review. The Btrfs disk format is not yet finalized, but it will only be changed if a critical bug is found and no workarounds are possible.
  It's ready for benchmarking, it's just not ready for widespread use yet. If Google was looking for a filesystem to make a switch to in the near future, BTRFS simply isn't an option quite yet.
  
  It's really easy at this point to move from EXT2 to EXT4 (I believe you can simply remount the partition as the new filesystem, maybe change a flag or two, and away you go). It's basically free performance. If Google is convinced it's stable, there isn't much reason not to do this. It could act as an interim filesystem until something significantly better - such as BTRFS - gets to the point where it's dependable. The fact BTRFS was not mentioned here doesn't mean it's completely ruled out.
  
  Parent Share
  twitter facebook
  - - Re: (Score:2)
      
      by Korin43 ( 881732 ) writes:
      
      It sounds like just mounting an ext2 partition at ext4 should give some performance increase, but it won't be able to use extents, which are apparently a big deal.
    - Re:Btrfs? (Score:5, Informative)
      
      by StarHeart ( 27290 ) * writes: on Thursday January 14, 2010 @05:48PM (#30771442)
      
      You don't have to start from scratch. You just have to enable the extents feature. It won't auto convert the old stuff, but any time something is changed it will be made into an extent.
      
      Parent Share
      twitter facebook
- Re: (Score:2, Insightful)
  
  by Tubal-Cain ( 1289912 ) writes:
  
  The chances of them using it would be pretty much nil. They are switching from ext2, and ext4's been "done" for over a year now. I'm sure they have a few benchmarks of btrfs, just not on as large of a scale as these tests were.
Comment removed (Score:4, Interesting)

by account_deleted ( 4530225 ) writes: on Thursday January 14, 2010 @05:00PM (#30770626)

Comment removed based on user account deletion

Share
twitter facebook
- It's Not Hans (Score:5, Interesting)
  
  by TheNinjaroach ( 878876 ) writes: on Thursday January 14, 2010 @05:06PM (#30770752)
  
  I too have abandoned using ReiserFS but it's not about the horrible crime Hans committed. It's about the fact I don't think the company that he owned (who developed ReiserFS) has a great future, so I foresee maintenance problems with that filesystem. Sure, somebody else can continue their work but I'm not going to hold my breath.
  
  Parent Share
  twitter facebook
  - Re: (Score:2)
    
    by slimjim8094 ( 941042 ) writes:
    
    So it's indirectly about the horrible crime Hans committed. Since it's because of that that his company has a poor future, and won't be maintaining Reiser for very long.
  - Re: (Score:2)
    
    by Enderandrew ( 866215 ) writes:
    
    ReiserFS is in mainline, and is maintained by the kernel developers. Resier and Namesys all but abandoned it, which is one of many factors that kept the newer Reiser4 out of mainline, even though Reiser4 was superior to ReiserFS in many ways.
    - Re: (Score:2)
      
      by Rich0 ( 548339 ) writes:
      
      ReiserFS is in mainline, and is maintained by the kernel developers.
      So is OS/2 HPFS. On the one hand that shows that ReiserFS will probably supported almost forever. On the other hand, I'm not sure I'd be rolling it out for new deployments or applications unless you're in a very tight niche.
  - Re:It's Not Hans (Score:4, Informative)
    
    by diegocg ( 1680514 ) writes: on Thursday January 14, 2010 @06:15PM (#30771790)
    
    Reiserfs has been undermaintained for a lot of time AFAIK. When hans started working in reiser4, he forgot completely about adding needed features to v3. The reiserfs disk format may be good, but the codebase is outdated. Ext4 has an ancient disk format in many ways, but the codebase is scalable, it uses delayed allocation, the block allocator is solid, xattrs are fast, etc etc. Reiserfs still uses the BKL, the xattr support that Suse added is said to be slow and not very pretty, it had problems with error handling etc etc...
    
    Parent Share
    twitter facebook
  - Re: (Score:3, Interesting)
    
    by mqduck ( 232646 ) writes:
    
    Personally, I think Hans should have been allowed to continue his work on ReiserFS while incarcerated. Better to let a guilty man contribute to society than do nothing but rot in prison, no?
  - Re: (Score:3, Informative)
    
    by rwa2 ( 4391 ) * writes:
    
    I'm still running reiser3, and probably holding out for reiser4... it's been confusing since the benchmarks for the next-gen fs's have been all over the place, but some look promising:
    http://www.debian-administration.org/articles/388#comment_127 [debian-adm...ration.org]
    I've always run software RAIDs to crank out a bit more performance out of the slowest part of my system, and reiserfs3 has always worked better out of the box. I'd spent long hours tuning EXT3 stripe widths and directory indexes and stuff, and EXT3 always came out
- Re:No ReiserFS? (Score:4, Insightful)
  
  by pdbaby ( 609052 ) writes: on Thursday January 14, 2010 @05:06PM (#30770756)
  
  ...or maybe the fact that he's no longer involved brings up questions about its future direction. I'm sure they took a look at reiserfs previously
  
  Parent Share
  twitter facebook
- Re:No ReiserFS? (Score:4, Funny)
  
  by Anonymous Coward writes: on Thursday January 14, 2010 @05:06PM (#30770764)
  
  ...maybe they felt it wasn't cutting edge enough.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by Icarium ( 1109647 ) writes:
  
  I'd imagine contacting a prison for tech support could be a bit awkward.
  (Yes, I know it's lame)
- Re: (Score:2)
  
  by icepick72 ( 834363 ) writes:
  
  The association is too close in this case because a murderer's name is part of the file system name. If the product had been named something else the association wouldn't be there. Might as well stock the shelves with Bernardo Bath Oil and Dahmer Doodads. How well do you think that would go in the eyes of the corporate world? So it's not because the creator of the filesystem committed a crime, it's because the product has an unsavoury name - those are two distinct and unrelated issues.
  - Re:No ReiserFS? (Score:5, Funny)
    
    by jspenguin1 ( 883588 ) writes: <jspenguin@gmail.com> on Thursday January 14, 2010 @05:35PM (#30771236) Homepage
    
    They need to change the name... How about
    Object-oriented
    Journalled
    File
    System?
    
    Parent Share
    twitter facebook
  - Re:No ReiserFS? (Score:4, Interesting)
    
    by mqduck ( 232646 ) writes: <mqduck AT mqduck DOT net> on Thursday January 14, 2010 @07:51PM (#30772988)
    
    So it's not because the creator of the filesystem committed a crime, it's because the product has an unsavoury name
    Actually, it's more likely because the creator and main developer of the filesystem is suddenly gone. As I understand it, he wasn't a very friendly guy (surprise!) and drove others away from the project.
    
    Parent Share
    twitter facebook
- Re: (Score:2)
  
  by KlomDark ( 6370 ) writes:
  
  // Came here for the Reiser reference //// Not leaving disappointed! ////// Oops, this aint Fark...
- Re: (Score:3, Funny)
  
  by gmuslera ( 3436 ) writes:
  
  To make the move to this new filesystem, they hired Ted T'so (actual maintainer of ext4). Hans wasn't available for the moment, and would be bad to have a famous employee that, well, did evil.
  - - Re: (Score:3, Insightful)
      
      by TheRaven64 ( 641858 ) writes:
      
      They've never hired anyone from the Windows ME team though, only people who did the sort of everyday low-grade evil, nothing too heinous.
Google doesn't need journaling? (Score:4, Interesting)

by Paradigm_Complex ( 968558 ) writes: on Thursday January 14, 2010 @05:00PM (#30770634)

The main advantage of EXT3 over EXT2 is that, with journaling, if you ever need to fsck the data, it goes a LOT quicker. It's interesting to note that Google never felt it needed that functionality.

Additionally, I was under the impression that Google used massive numbers of commodity consumer-grade harddrives, as opposed to high-grade stuff which I presume is less likely to err. Couple this fact with the massive amount of data Google is working with and there has got to be a lot of filesystem errors, no?

Can anyone else with experience with big database stuff hint as to why Google would not need to fsck their data (often enough for EXT3 to be worthwhile)? Is it cheaper just to overwrite the data from some backup elsewhere at this scale? How do they know the backup is clean without fscking that?

Share
twitter facebook
- Re:Google doesn't need journaling? (Score:5, Informative)
  
  by spydum ( 828400 ) writes: on Thursday January 14, 2010 @05:06PM (#30770768)
  
  Replicas stored across multiple servers -- if one is corrupted or unavailable requiring fsck, who cares? Ask the next server in line for the data.
  
  Parent Share
  twitter facebook
- Re: (Score:2)
  
  by 42forty-two42 ( 532340 ) writes:
  
  First, google's servers each have their own battery [cnet.com], so it's unlikely that all the servers in a DC will go down at once. If only a few go down, their redundancy means that it's not a big deal - they can wait for the fsck. And moreover, even if an entire DC goes down (eg, due to cooling loss) they have the redundancy needed to deal with entire datacenter failures - with that kind of redundancy, fscking is only a minor inconvenience (plus with a cooling failure they might have time to sync and umount before p
- Re: (Score:2)
  
  by ls671 ( 1122017 ) * writes:
  
  I always felt that fscking the data taking data that is already on the disk (the journal) into account was weaker than fscking the data independently (no journal). Or at least that it would bring more possibilities of errors (e.g. errors in the journal itself). It may very well be an unjustified impression that I have but at least it seems logical at first glance; A simpler file system means less risk of bugs, etc.
  http://slashdot.org/comments.pl?sid=1511104&cid=30770742 [slashdot.org]
  - Re: (Score:2, Informative)
    
    by amRadioHed ( 463061 ) writes:
    
    If you lost power while the journal was being written and it was incomplete then the journal entry would just be discarded and your filesystem itself would be fine, it would just be missing the changes from the last operation before the crash.
- Re: (Score:2, Informative)
  
  by crazyvas ( 853396 ) writes:
  
  They use fast replication techniques to restore disk servers (chunkservers in GFS terminology) when they fail.
  The failure could be because of a component failure, disk corruption, or even a simply killing of the process. The detection is done via checksumming (as opposed to fscking), which also takes care of detecting higher-level issues that fscking might miss.
  Yes, it is much cheaper for them to overwrite data from another replica (3 replicas for all chunkservers is the default) using their fast re-re
- - Re:Google doesn't need journaling? (Score:5, Interesting)
    
    by tytso ( 63275 ) * writes: on Thursday January 14, 2010 @07:55PM (#30773056) Homepage
    
    So there's a major problem with Soft Updates, which is that you can't be sure that data has hit the disk platter and is on stable store unless you issue a barrier operation, which is very slow. What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption! The disk drive, especially modern ones with large caches, can reorder writes which are sent to the disk, sometimes (with the right pathological workloads) for minutes at a time. You won't notice this problem if you just crash the kernel, or even if you hit the reset button. But if you pull the plug or otherwise cause the system to drop power, data in the disk's write cache won't necessarily be written to disk. The problem that we saw with journal checksums and ext4 only showed up on a power drop, because there was a missing barrier operation, so this is not a hypothetical consideration.
    In addition, if you have a very heavy write workload, the Soft Updates code will need to burn a fairly large amount of memory tracking the dependencies and burn quite a bit of CPU figuring out which dependencies need to be rolled back. I'm a bit suspicious of how well they perform and how much CPU they steal from applications --- which granted, may not show up in benchmarks which are disk bound. But if the applications or the large number of jobs running on a shared machine are trying to use lots of CPU as well as disk bandwidth, this could very much be an issue.
    BTW, while I was doing some quick research for this reply. it seems that NetBSD is about to drop Soft Updates in favor of a physical block journaling technology (WAPBL), according to Wikipedia. They didn't get a reference to this, nor did they say why NetBSD was planning on dropping Soft Updates, but there is a description of the replacement technology here: http://www.wasabisystems.com/technology/wjfs [wasabisystems.com]. But if Soft Updates is so great, why is NetBSD replacing it and why did Free BSD add file system journaling alternative to UFS?
    
    Parent Share
    twitter facebook
    - - Re: (Score:3, Informative)
        
        by tytso ( 63275 ) * writes:
        
        What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption!
        Journaling, and every other filesystem, has exactly the same problem. If consistence is required, YOU MUST DISABLE THE CACHE, unless it is battery-backed, or you are willing to depend on your UPS. This is the penalty we take for devices which lie to the OS about flush operations and the like.
        Yes, there were, in the bad old days, devices which lied when the OS sent a flush cache command, and in order to get a better Winbench score, they would cheat and not actually flush the cache. But that hasn't been true for quite a while, even for commodity desktop/laptop drives. It's quite easy to test; you just time how many single block sector writes followed by a cache flush commands you can send per second. In practice, it won't be more than, oh, 50-60 write barriers per second. In general, if you
    - - Re: (Score:3, Interesting)
        
        by tytso ( 63275 ) * writes:
        
        So I'm an engineer, and not an academic. I'm not trying to get a Ph.D. The whole Keep it Simple, Stupid principle is an important one, especially as you say, "Journalling and Soft Updates have similar performance characteristics."
        If sometimes Journalling posts better benchmarks, and sometimes Soft Updates produces better results, but Soft Updates is hideously more complex, thus inhibiting new features such as ACL's and Extended Attributes (which appeared in BSD much latter than Linux, and I think Soft Up
        
        Re:Google doesn't need journaling? (Score:5, Informative)
        
        by Jeff- ( 95113 ) writes: on Friday January 15, 2010 @03:31AM (#30776116) Homepage
        
        There's a lot of misinformation in this thread about softupdates. I only have so much time to reply so I'll hit a few key points. I'm the author of journaling extensions to softupdates so I have some experience in this area.
        This notion that softupdates was so complex and so inhibited new features in ffs is bogus. I've seen it repeated a few times. There simply was not much pressure for these features and the filesystem metadata did not support it until ufs2. The total amount of code dedicated to extended attributes in softupdates can't be more than 100 lines. ffs sees fewer features because we have fewer developers period.
        Furthermore, softupdates is just a different approach. It is no more complex than journaling. When I review a sophisticated journaling implementation such as xfs I see more lines of code dedicated to journaling and transaction management than softupdates requires for dependency tracking. I have worked on a number of production filesystems and while softdep is definitely not trivial, neither were any of the others unless you compare to synchronous ufs. I think a lot of people who are familiar with COW and Journaling are looking at this unfairly because they already know another system and forget how long it took to become comfortable with it.
        In cpu benchmarks softdep costs more than async ffs, this is true. However, rollbacks are actually quite infrequent because our buffercache attempts to write buffers without dependencies first. Generally there are enough of those which satisfy dependencies on other buffers that you can keep the pipeline busy. Looking at the code size and depth in any modern filesystem it's clear that a lot of cpu is involved. Are journal blocks not consuming memory? Is the transaction tracking free? Most dependency structures are quite small compared to generating a copy of a metadata block for a jouranl write.
        NetBSD abandoned softdep for something much simpler because they didn't have the resources to fix the bugs in it and they didn't incorporate fixes from FreeBSD. Their journaling implementation is similar to our gjournal which is mostly filesystem agnostic and does full block logging in a very simple fashion.
        The journaled filesystem project was started simply to get rid of fsck. I think this hybrid solution is very promising. It gives us a place to issue barriers which can affect arbitrary numbers of filesystem operations. The journal write overhead is much lower than with traditional journals.
        And regarding benchmarks; FreeBSD doesn't really have a comparably developed journaling filesystem to benchmark softdep against. I think it's unreasonable to compare linux with ext4 to FreeBSD with ffs+softdep for purposes of evaluating the filesystem design. Too many other factors come into play.
        You can read more about softdep journaling at http://jeffr_tech.livejournal.com/
        Thanks,
        Jeff
        
        Parent Share
        twitter facebook
As impressively as each other?! WTF?! (Score:4, Funny)

by Anonymous Coward writes: on Thursday January 14, 2010 @05:04PM (#30770714)

From TFA:
In their benchmarking, EXT4 and XFS performed, as impressively as each other.
WTF kind of retarded sentence is that?! Did Rob Smith help you write that article?!
In their benchmarking of EXT4 and XFS, EACH performed as impressively as THE OTHER.

Share
twitter facebook
- Re:As impressively as each other?! WTF?! (Score:4, Informative)
  
  by mqduck ( 232646 ) writes: <mqduck AT mqduck DOT net> on Thursday January 14, 2010 @07:55PM (#30773060)
  
  Simply removing the second comma would make the sentence entirely correct:
  "In their benchmarking, EXT4 and XFS performed as impressively as each other."
  Adding "each" would make it a bit clearer, but the meaning is already obvious. I don't know why you think it has to be "THE other".
  
  Parent Share
  twitter facebook
- - Re: (Score:2)
    
    by fm6 ( 162816 ) writes:
    
    I think you meant to say, "Well a monster that gigantic could only be defeated by an even equally gigantic monster!"
Still on ext2 on servers (Score:4, Insightful)

by ls671 ( 1122017 ) * writes: on Thursday January 14, 2010 @05:05PM (#30770742) Homepage

We are still using ext2 on servers. Now I have an argument; if Google is still using ext2 maybe we aren't so foolish. We might update some day but it is not yet a priority. With UPS and proper fail over and backup procedure in place, I can't remember when a jounaling file system would have helped us in any way. They seem great for desktops/laptops although.

Share
twitter facebook
- Re: (Score:2)
  
  by Bill, Shooter of Bul ( 629286 ) writes:
  
  Seriously? Being able to recover you data faster, isn't a consideration? Or do you have a big SAN for all of the critical application data?
XFS performance highly variable (Score:4, Interesting)

by bzipitidoo ( 647217 ) writes: <bzipitidoo@yahoo.com> on Thursday January 14, 2010 @05:14PM (#30770898) Journal

I've used XFS on a RAID1 setup with SATA drives, and found the performance of the delete operation extremely dependent on how the partition was formatted.
I saw times of up to 5 minutes to delete a Linux kernel source tree on a partition that was formatted XFS with the defaults. Have to use something like sunit=64, swidth=64, and even then it takes 5 seconds to rm -rf /usr/src/linux. I've heard that SAS drives wouldn't exhibit this slowness. Under Reiserfs on the same system, the delete took 1 second. Anyway, XFS is notorious for slow delete operations.

Share
twitter facebook
- Re: (Score:2)
  
  by ShadowRangerRIT ( 1301549 ) writes:
  
  For a lot of modern corporate data storage situations, deletion isn't really important. My company uses an in-house write-once file system (no idea what it's based on), because by and large, the cost of storing old data is negligible next to the advantages of being able to view an older version of the dataset, completely remove fragmentation from the picture, etc. I suspect deletion operations are fairly uncommon at Google; in the rare cases it is necessary it is quite possible they just copy the data they
- Re: (Score:2, Interesting)
  
  by Anonymous Coward writes:
  
  mounting with nobarrier will change those 5 minutes to 5 seconds, but don't turn off your computer during the delete then.
GFS (Score:4, Insightful)

by jonpublic ( 676412 ) writes: on Thursday January 14, 2010 @05:16PM (#30770926)

I thought google had their own file system named the google files system.
http://labs.google.com/papers/gfs.html [google.com]

Share
twitter facebook
- Re: (Score:2, Insightful)
  
  by jonpublic ( 676412 ) writes:
  
  I should probably read my own posts before hitting submit.
  - - Re: (Score:2, Funny)
      
      by FlyingBishop ( 1293238 ) writes:
      
      I meant never.
- Re:GFS (Score:4, Informative)
  
  by joib ( 70841 ) writes: on Thursday January 14, 2010 @06:26PM (#30771918)
  
  I believe GFS uses a local fs on each node to take care of, well, all the stuff that a normal local fs like ext3 does. GFS only does the distributed stuff on top of that.
  
  Parent Share
  twitter facebook
Windows Driver (Score:2)

by pgn674 ( 995941 ) writes:

Might this prompt someone at Google to make an installable file system driver for Windows for EXT4? Right now, there is none, because of differing inode sizes and some extra features over EXT2 that EXT4 demands I think.
- Re:Windows Driver (Score:5, Insightful)
  
  by fuzzyfuzzyfungus ( 1223518 ) writes: on Thursday January 14, 2010 @05:38PM (#30771296) Journal
  
  I can't imagine why it would.
  
  To the best of my knowledge, Google uses pretty much no Windows servers themselves(at least not for any of their public facing products, they almost certainly have some kicking around) and "a vast number of instances of custom in-house server applications" is among the least plausible environments for a Windows server deployment, so that is unlikely to change.
  
  On the desktop side, Google has a bunch of stuff that runs on Windows; but it all communicates with Google's servers over various ordinary web protocols and stores local files with the OS provided filesystem. The benefits of EXT4 on Windows would have to be pretty damn compelling for them to start requiring a kernel driver install and a spare unformatted partition.
  
  I suppose it is conceivable that some Google employee might decide to do it, for more or less inscrutable reasons; but it would have no connection at all to Google's broader operation or strategy.
  
  Parent Share
  twitter facebook
Ubuntu 9.10? (Score:5, Interesting)

by GF678 ( 1453005 ) writes: on Thursday January 14, 2010 @05:36PM (#30771252)

Gee, I hope they're not using Ubuntu 9.10 by any chance: http://www.ubuntu.com/getubuntu/releasenotes/910 [ubuntu.com]
There have been some reports of data corruption with fresh (not upgraded) ext4 file systems using the Ubuntu 9.10 kernel when writing to large files (over 512MB). The issue is under investigation, and if confirmed will be resolved in a post-release update. Users who routinely manipulate large files may want to consider using ext3 file systems until this issue is resolved. (453579)
The damn bug is STILL not fixed apparently. Some people get the corruption, and some don't. Scares me enough to not even try using ext4 just yet, and I'm still surprised Canonical was stupid enough to have ext4 as the default filesystem in Karmic.
Then again, perhaps Google knows what they're doing.

Share
twitter facebook
- Re: (Score:2)
  
  by Nimey ( 114278 ) writes:
  
  Then again, perhaps Google knows what they're doing.
  Moreso than your average Slashdotter, I expect.
- Re:Ubuntu 9.10? (Score:5, Insightful)
  
  by Lennie ( 16154 ) writes: on Thursday January 14, 2010 @06:22PM (#30771878)
  
  They employ the main developer of ext2, ext3 and ext4.
  
  He probably knows a lot about it.
  
  Parent Share
  twitter facebook
- Re:Ubuntu 9.10? (Score:4, Informative)
  
  by tytso ( 63275 ) * writes: on Friday January 15, 2010 @02:03AM (#30775742) Homepage
  
  So Canonical has never reported this bug to LKML or to the linux-ext4 list as far as I am aware. No other distribution has complained about this > 512MB bug, either. The first I heard about it is when I scanned the Slashdot comments.
  Now that I'll know about it, I'll try to reproduce it with an upstream kernel. I'll note that in 9.04, Ubuntu had a bug which as far as I know, must have been caused by their screwing up some patch backports. Only Ubuntu's kernel had a bug where rm'ing a large directory hierarchy would have a tendency to cause a hang. No one was able to reproduce it on an upstream kernel,
  I will say that I don't ever push patches to Linus without running them through the XFS QA test suite. (Which is now generalized enough so it can be used on a number of file systems other than just XFS). If it doesn't have a "write a 640 MB file" and make sure it isn't corrupted, we can add it and then all of the file systems which use the XFSQA test suite can benefit from it.
  (I was recently proselytizing the use of the XFS QA suite to some Reiserfs and BTRFS developers. The "competition" between file systems is really more of a fanboy/fangirl thing than at the developer level. In fact, Chris Mason, the head btrfs developer, has helped me with some tricky ext3/ext4 bugs, and in the past couple of years I've been encouraging various companies to donote engineering time to help work on btrfs. With the exception of Hans Reiser, who has in the past me of trying to actively sabotage his project --- not true as far as I'm concerned --- we all are a pretty friendly bunch and work together and help each other out as we can.)
  
  Parent Share
  twitter facebook
- - Re: (Score:3, Interesting)
    
    by RoboRay ( 735839 ) writes:
    
    Yeah, they've got their own custom OS... Goobuntu.
Downtime (Score:2, Interesting)

by Joucifer ( 1718678 ) writes:

Is this why Google was down for about 30 minutes today? Did anyone else even experience this or was it a local issue?
- Not A Nerd? (Score:3, Insightful)
  
  by TheNinjaroach ( 878876 ) writes:
  
  News for nerds. Stuff that matters.
  
  Not that I RTFA or anything, but I find it interesting that XFS and EXT4 both appear to be equally impressive with benchmarks, and it's implied they are both better than JFS. You must not be a nerd.
  - Re:Not A Nerd? (Score:4, Interesting)
    
    by MBGMorden ( 803437 ) writes: on Thursday January 14, 2010 @05:11PM (#30770846)
    
    I too found it interesting, because it basically alleviates any need for me to worry about "upgrading" to ext4. My current Linux systemse use an ext3 /boot partition and everything else xfs. Given some of the press ext4 has gotten lately, I just trust xfs more, and knowing that I'm not really giving up any performance is a huge plus.
    Truthfully though, where the heck are the meta-data based filesystems that we were promised? I've love to be able to, on a filesystem level, instantly pull up a folder view of all videos - or all images. Or all images of my dog. Or all images outdoors. Or all images of my dog outdoors.
    Basically, just the ability to organize via an arbitrary number of categorized tags.
    
    Parent Share
    twitter facebook
    - Re: (Score:3, Interesting)
      
      by Hurricane78 ( 562437 ) writes:
      
      I tried TagFS. And I found the main problem is, that the tagging is way too much work, to get to the level of tagging I want.
      Also I avoid XFS, since it keeps huge amounts of (log?) data in RAM. So on a power failure, it’s goodbye data.
      XFS is for servers with battery backup. Not for normal home computers.
      I also tried JFS, and I got corruption with it. So I avoid it too.
      I wish I could use ZFS... especially the scrubbing functionality.
      - Re:Not A Nerd? (Score:4, Interesting)
        
        by smash ( 1351 ) writes: on Thursday January 14, 2010 @08:18PM (#30773284) Homepage Journal
        
        You can use ZFS. Just run FreeBSD or opensolaris. The amount of software that runs on Linux but not FreeBSD (particularly if you're talking about open-source) is exceedingly minimal.
        
        Parent Share
        twitter facebook
    - Re: (Score:3, Informative)
      
      by Archangel Michael ( 180766 ) writes:
      
      Truthfully though, where the heck are the meta-data based filesystems that we were promised
      I suspect that once we get over the BLOCK LEVEL DEVICE (BLD) paradigm, and into SSDs that are NOT mimicking BLD, we'll have something closer to what you want.
      The problem with moving from BLD, is that we've been using them for so long that I'm not sure there is any good way to make the switch to straight linear addressing of memory for ALL storage.
      In fact, I would suspect that our idea of "booting" is necessarily going
      - Re: (Score:3, Interesting)
        
        by marcansoft ( 727665 ) writes:
        
        SSD (NAND Flash) is still a block device. In fact, it's even "more" block, insomuch as it requires a filesystem a lot more aware of blocks, their limitations, and the proper way of using them (wear leveling, error correction, etc). It also uses larger blocks and also addresses groups of blocks for certain operations (erase). You either need a Flash-specific filesystem, or a translation to a more typical block device via a flash translation layer (FTL). Furthermore, I'm not aware of a single NAND Flash devic
        
        Re: (Score:3, Interesting)
        
        by TheRaven64 ( 641858 ) writes:
        
        Everything you say is true about Flash, but not about SSDs in general. Flash can be written to one byte at a time, but then it is stuck in that state until it is erased. The circuitry for erasing is bigger than the circuitry for writing, so it is shared among a group of bytes in a cell. These can be any size, but there are trades. The smaller you make them, the more copies of the erase circuit are needed, so the fewer bytes of storage you get per area of die size (and per dollar). The larger you make t
  - Re: (Score:2)
    
    by gazbo ( 517111 ) writes:
    
    As another home user I too find it illuminating which FS benchmarks best for Google's workload.
- Give us a +-0 Counterbalance (Score:3, Interesting)
  
  by itomato ( 91092 ) writes:
  
  When does black become white?
  #CCCCCC or #888888
  Is there overlap with Flamebait?
  When does an otherwise 'troll' moderation-worthy comment lose out on status that could validate 19 responses, with 50% scoring +2?
  Sometimes a troll is a troll, but sometimes its just a shadow.
- - Re: (Score:2)
    
    by Nadaka ( 224565 ) writes:
    
    what about all the people who don't even bother to log in to post as AC?
  - Re: (Score:3, Informative)
    
    by Captain Splendid ( 673276 ) writes:
    
    Or, you could stop being lazy and go tweak your preferences, thereby saving the rest of us from your whining.
- Re:I upgraded from ext3 to ext4 and (Score:4, Informative)
  
  by jjohnson ( 62583 ) writes: on Thursday January 14, 2010 @06:21PM (#30771862) Homepage
  
  When you run data centres around the world that are collectively the most powerful supercomputer known to man, you too can get a front page story on ./ announcing your upgrade.
  Until then, STFU.
  
  Parent Share
  twitter facebook
- Re:Has Ted Cooked the Benchmarks Again? (Score:5, Informative)
  
  by tytso ( 63275 ) * writes: on Thursday January 14, 2010 @08:11PM (#30773226) Homepage
  
  So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications that don't use fsync() will also lose information after a buggy proprietary Nvidia video driver crashes your machine, regardless of whether you are using XFS or ext4.
  If you are talking about the change to _ext3_ to use data=writeback, that was a change that Linus made, not me, and ext4 has always defaulted to data=ordered. Linus thought that since the vast majority of Linux machines are single-user desktop machines, the performance hit of data=ordered, which is designed to prevent exposure of uninitialized data blocks after a crash wasn't worth it. I and other file system engineers disagreed, but Linus's kernel, Linus's rules. I pushed a patch to ext3 which makes the default a config option, and as far as I know the enterprise distro's plan to use this config option to keep the defaults the same as before for ext3.
  Since it was my choice, I actually changed the defaults for ext4 to use barriers=1. which Andrew Morton vetoed for ext3 because again, he didn't think it was worth the performance hit. But with ext4, the benefits of delayed allocation and extents are so vast that it completely dominated the performance hit of turning on write barriers. That is what most of the performance benefits for ext4 come from, and it is very much a huge step forward compared to ext3.
  So with respect, you don't know what you are talking about.
  -- Ted
  
  Parent Share
  twitter facebook
  - - Re: (Score:3, Informative)
      
      by tytso ( 63275 ) * writes:
      
      So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications...
      Stop blaming the applications for a filesystem problem Ted. The excuse doesn't wash no matter how many times you use it, and no, XFS does not have it.
      http://en.wikipedia.org/wiki/XFS#Delayed_allocation [wikipedia.org]
      Any other questions? At the very least the applications are non-portable in the sense that they were depending on behavior not guaranteed by POSIX. XFS, btrfs, ZFS, and many if not most modern file systems do delayed allocation. It's one of the basic file system tricks to improve performance.
      - Re: (Score:3, Informative)
        
        by tytso ( 63275 ) * writes:
        
        So before I tried agitating for programmers to fix their buggy applications, I had already implemented both the heuristic that XFS uses (if you truncate a file descriptor, add an implicit fsync on the close of that fd), and in addition I had implemented another heuristic (if you rename on top of an existing file, fsync the source file of the rename). This was to work around buggy applications, and as you can see, ext4 does even more than XFS does.
        At the end of the day, though, the heuristic can sometimes

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Time for a backup? (Score:5, Informative)

Re:Time for a backup? (Score:5, Funny)

Re:Time for a backup? (Score:5, Funny)

Re:Time for a backup? (Score:5, Funny)

Re:Time for a backup? (Score:4, Informative)

Re:Time for a backup? (Score:4, Funny)

Re:Time for a backup? (Score:5, Funny)

Re:Time for a backup? (Score:4, Funny)

Re: (Score:2, Insightful)

Re:Time for a backup? (Score:5, Funny)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:3, Informative)

Re: (Score:3, Informative)

Re: (Score:2)

Re:Time for a backup? (Score:5, Funny)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2, Insightful)

Re: (Score:2)

Re: (Score:2)

Slashdotted already ? (Score:2)

Re: (Score:2)

Re: (Score:2)

Use of commas. (Score:4, Funny)

Re:Use of commas. (Score:5, Funny)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:2, Informative)

Digitzor link uesless (Score:5, Informative)

Re: (Score:2)

Ted T'so (Score:5, Informative)

Btrfs? (Score:3, Interesting)

Re: (Score:3, Funny)

Re:Btrfs? (Score:5, Informative)

Re: (Score:2)

Re:Btrfs? (Score:5, Informative)

Re: (Score:2, Insightful)

Comment removed (Score:4, Interesting)

It's Not Hans (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:It's Not Hans (Score:4, Informative)

Re: (Score:3, Interesting)

Re: (Score:3, Informative)

Re:No ReiserFS? (Score:4, Insightful)

Re:No ReiserFS? (Score:4, Funny)

Re: (Score:2)

Re: (Score:2)

Re:No ReiserFS? (Score:5, Funny)

Re:No ReiserFS? (Score:4, Interesting)

Re: (Score:2)

Re: (Score:3, Funny)

Re: (Score:3, Insightful)

Google doesn't need journaling? (Score:4, Interesting)

Re:Google doesn't need journaling? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2, Informative)

Re: (Score:2, Informative)

Re:Google doesn't need journaling? (Score:5, Interesting)

Re: (Score:3, Informative)

Re: (Score:3, Interesting)

Re:Google doesn't need journaling? (Score:5, Informative)

As impressively as each other?! WTF?! (Score:4, Funny)

Re:As impressively as each other?! WTF?! (Score:4, Informative)

Re: (Score:2)

Still on ext2 on servers (Score:4, Insightful)

Re: (Score:2)

XFS performance highly variable (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2, Interesting)

GFS (Score:4, Insightful)

Re: (Score:2, Insightful)

Re: (Score:2, Funny)

Re:GFS (Score:4, Informative)

Windows Driver (Score:2)

Re:Windows Driver (Score:5, Insightful)

Ubuntu 9.10? (Score:5, Interesting)