Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
Google Technology

Google Switching To EXT4 Filesystem 348

An anonymous reader writes "Google is in the process of upgrading their existing EXT2 filesystem to the new and improved EXT4 filesystem. Google has benchmarked three different filesystems — XFS, EXT4 and JFS. In their benchmarking, EXT4 and XFS performed equally well. However, in view of the easier upgrade path from EXT2 to EXT4, Google has decided to go ahead with EXT4."
This discussion has been archived. No new comments can be posted.

Google Switching To EXT4 Filesystem

Comments Filter:
  • Time for a backup? (Score:5, Informative)

    by Itninja ( 937614 ) on Thursday January 14, 2010 @04:52PM (#30770502) Homepage
    I guess now is as good as any to go through my Gmail and Google Docs and make local backups. I'm sure my info is safe, but I have been through these types of 'upgrades' at work before and every once in a while....well, let's just say backups are never a bad idea.
  • by autocracy ( 192714 ) <slashdot2007@sto ... .com minus berry> on Thursday January 14, 2010 @04:58PM (#30770592) Homepage

    I managed to ease a pageview out of it. That said, the /. summary says all they say, and you're all better served by the source they point to, which is what SHOULD have been in the article summary instead of the Digitzor site.

    See http://lists.openwall.net/linux-ext4/2010/01/04/8 [openwall.net]

  • Ted T'so (Score:5, Informative)

    by RPoet ( 20693 ) on Thursday January 14, 2010 @04:59PM (#30770604) Journal

    They have Ted T'so [h-online.com] of Linux filesystem fame working for them now.

  • by Anonymous Coward on Thursday January 14, 2010 @05:00PM (#30770618)

    Phoronix has the story

    http://www.phoronix.com/scan.php?page=news_item&px=Nzg4MA

  • by spydum ( 828400 ) on Thursday January 14, 2010 @05:06PM (#30770768)

    Replicas stored across multiple servers -- if one is corrupted or unavailable requiring fsck, who cares? Ask the next server in line for the data.

  • Re:Btrfs? (Score:5, Informative)

    by Paradigm_Complex ( 968558 ) on Thursday January 14, 2010 @05:11PM (#30770838)
    From kernel.org's BTRFS page [kernel.org]:

    Btrfs is under heavy development, and is not suitable for any uses other than benchmarking and review. The Btrfs disk format is not yet finalized, but it will only be changed if a critical bug is found and no workarounds are possible.

    It's ready for benchmarking, it's just not ready for widespread use yet. If Google was looking for a filesystem to make a switch to in the near future, BTRFS simply isn't an option quite yet.

    It's really easy at this point to move from EXT2 to EXT4 (I believe you can simply remount the partition as the new filesystem, maybe change a flag or two, and away you go). It's basically free performance. If Google is convinced it's stable, there isn't much reason not to do this. It could act as an interim filesystem until something significantly better - such as BTRFS - gets to the point where it's dependable. The fact BTRFS was not mentioned here doesn't mean it's completely ruled out.

  • Re:Use of commas. (Score:2, Informative)

    by Darth Sdlavrot ( 1614139 ) on Thursday January 14, 2010 @05:29PM (#30771126)

    Why do I put a comma before the and in a list?

    I would say "I have a cat, a dog, and two goats."

    But you would say "I have a cat, a dog and two goats."

    The English language is so damned weird...but AC is right, illegal use of commas. That's a 15 karma penalty. 1st down.

    I too add the comma in lists of discrete items -- not sure where I learned it.

    If some items are connected or related in some way that's distinct from the other items in the list I'd omit the comma. Not a great example: "I have a cat, a daughter and a son, a car and a motorcycle, and a swimming pool."

    I notice that the Brits (and Canucks, Aussies, etc., tend to always omit the comma.

    Could be an Americanism?

    (And I suspect you really write it, not "say" it.)

  • by amRadioHed ( 463061 ) on Thursday January 14, 2010 @05:32PM (#30771182)

    If you lost power while the journal was being written and it was incomplete then the journal entry would just be discarded and your filesystem itself would be fine, it would just be missing the changes from the last operation before the crash.

  • Re:Use of commas. (Score:2, Informative)

    by AvitarX ( 172628 ) <me&brandywinehundred,org> on Thursday January 14, 2010 @05:33PM (#30771198) Journal

    There is no hard rule on this, and both can be ambiguous in different circumstances.

    http://en.wikipedia.org/wiki/Serial_comma [wikipedia.org]

  • Re:Btrfs? (Score:5, Informative)

    by StarHeart ( 27290 ) * on Thursday January 14, 2010 @05:48PM (#30771442)

    You don't have to start from scratch. You just have to enable the extents feature. It won't auto convert the old stuff, but any time something is changed it will be made into an extent.

  • by crazyvas ( 853396 ) on Thursday January 14, 2010 @05:50PM (#30771460)
    They use fast replication techniques to restore disk servers (chunkservers in GFS terminology) when they fail.

    The failure could be because of a component failure, disk corruption, or even a simply killing of the process. The detection is done via checksumming (as opposed to fscking), which also takes care of detecting higher-level issues that fscking might miss.

    Yes, it is much cheaper for them to overwrite data from another replica (3 replicas for all chunkservers is the default) using their fast re-replication techniques rather than trying to fsck.

    Check this paper out (see pdf link at bottom of page) under "Section 5: Fault Tolerance and Diagnosis" for more info:
    http://labs.google.com/papers/gfs.html [google.com]

  • Re:Well (Score:3, Informative)

    by Captain Splendid ( 673276 ) <capsplendid@@@gmail...com> on Thursday January 14, 2010 @06:07PM (#30771708) Homepage Journal
    Or, you could stop being lazy and go tweak your preferences, thereby saving the rest of us from your whining.
  • Re:It's Not Hans (Score:4, Informative)

    by diegocg ( 1680514 ) on Thursday January 14, 2010 @06:15PM (#30771790)

    Reiserfs has been undermaintained for a lot of time AFAIK. When hans started working in reiser4, he forgot completely about adding needed features to v3. The reiserfs disk format may be good, but the codebase is outdated. Ext4 has an ancient disk format in many ways, but the codebase is scalable, it uses delayed allocation, the block allocator is solid, xattrs are fast, etc etc. Reiserfs still uses the BKL, the xattr support that Suse added is said to be slow and not very pretty, it had problems with error handling etc etc...

  • by jjohnson ( 62583 ) on Thursday January 14, 2010 @06:21PM (#30771862) Homepage

    When you run data centres around the world that are collectively the most powerful supercomputer known to man, you too can get a front page story on ./ announcing your upgrade.

    Until then, STFU.

  • Re:GFS (Score:4, Informative)

    by joib ( 70841 ) on Thursday January 14, 2010 @06:26PM (#30771918)

    I believe GFS uses a local fs on each node to take care of, well, all the stuff that a normal local fs like ext3 does. GFS only does the distributed stuff on top of that.

  • Re:Not A Nerd? (Score:2, Informative)

    by jlund ( 73067 ) on Thursday January 14, 2010 @06:51PM (#30772240)

    Truthfully though, where the heck are the meta-data based filesystems that we were promised? I've love to be able to, on a filesystem level, instantly pull up a folder view of all videos - or all images. Or all images of my dog. Or all images outdoors. Or all images of my dog outdoors.

    Basically, just the ability to organize via an arbitrary number of categorized tags.

    You must be referring to WinFS [wikipedia.org]... Oh wait, it's never shipped, but is in development.

  • by XaXXon ( 202882 ) <xaxxon.gmail@com> on Thursday January 14, 2010 @07:26PM (#30772678) Homepage

    Half life.

  • Re:Not A Nerd? (Score:3, Informative)

    by Archangel Michael ( 180766 ) on Thursday January 14, 2010 @07:32PM (#30772762) Journal

    Truthfully though, where the heck are the meta-data based filesystems that we were promised

    I suspect that once we get over the BLOCK LEVEL DEVICE (BLD) paradigm, and into SSDs that are NOT mimicking BLD, we'll have something closer to what you want.

    The problem with moving from BLD, is that we've been using them for so long that I'm not sure there is any good way to make the switch to straight linear addressing of memory for ALL storage.

    In fact, I would suspect that our idea of "booting" is necessarily going to have to change, from BLD bootstrap to just doing a memory move from slower to faster memory (SSD to RAM to Level 3, 2 and on die Cache).

    It is going to need a different way of looking at how we use storage from near to far off, from slow to fast(er)

    We're gonna have to index memory somehow, and track the bits.

  • by mqduck ( 232646 ) <(ten.kcudqm) (ta) (kcudqm)> on Thursday January 14, 2010 @07:55PM (#30773060)

    Simply removing the second comma would make the sentence entirely correct:

    "In their benchmarking, EXT4 and XFS performed as impressively as each other."

    Adding "each" would make it a bit clearer, but the meaning is already obvious. I don't know why you think it has to be "THE other".

  • by tytso ( 63275 ) * on Thursday January 14, 2010 @08:11PM (#30773226) Homepage

    So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications that don't use fsync() will also lose information after a buggy proprietary Nvidia video driver crashes your machine, regardless of whether you are using XFS or ext4.

    If you are talking about the change to _ext3_ to use data=writeback, that was a change that Linus made, not me, and ext4 has always defaulted to data=ordered. Linus thought that since the vast majority of Linux machines are single-user desktop machines, the performance hit of data=ordered, which is designed to prevent exposure of uninitialized data blocks after a crash wasn't worth it. I and other file system engineers disagreed, but Linus's kernel, Linus's rules. I pushed a patch to ext3 which makes the default a config option, and as far as I know the enterprise distro's plan to use this config option to keep the defaults the same as before for ext3.

    Since it was my choice, I actually changed the defaults for ext4 to use barriers=1. which Andrew Morton vetoed for ext3 because again, he didn't think it was worth the performance hit. But with ext4, the benefits of delayed allocation and extents are so vast that it completely dominated the performance hit of turning on write barriers. That is what most of the performance benefits for ext4 come from, and it is very much a huge step forward compared to ext3.

    So with respect, you don't know what you are talking about.

    -- Ted

  • by Ginger Unicorn ( 952287 ) on Thursday January 14, 2010 @09:46PM (#30774124)
    Presumably you're a troll, since the link that you gave explicitly states the following:
    1. It was a firefox exploit, not a problem with google's servers
    2. Only 60 people were affected. "mass email deletions" indeed.
  • Re:It's Not Hans (Score:3, Informative)

    by rwa2 ( 4391 ) * on Thursday January 14, 2010 @11:13PM (#30774800) Homepage Journal

    I'm still running reiser3, and probably holding out for reiser4... it's been confusing since the benchmarks for the next-gen fs's have been all over the place, but some look promising:
    http://www.debian-administration.org/articles/388#comment_127 [debian-adm...ration.org]

    I've always run software RAIDs to crank out a bit more performance out of the slowest part of my system, and reiserfs3 has always worked better out of the box. I'd spent long hours tuning EXT3 stripe widths and directory indexes and stuff, and EXT3 always came out slower and more wasteful of space.

    Here's a handful of numbers from bonnie++ from my 4-disk raid10:

    EXT3fs: 4G 246 97% 61403 29% 39928 11% 1512 95% 166253 24% 525.3 10%
    Latency 87699us 4739ms 644ms 54683us 69023us 302ms

    Reiser3: 4G 264 97% 65732 31% 44530 15% 1447 95% 164567 34% 557.9 18%
    Latency 33368us 4201ms 4061ms 21967us 134ms 118ms

  • by tytso ( 63275 ) * on Friday January 15, 2010 @01:18AM (#30775536) Homepage

    >I mount these read-only in the interests of security, but that means, of course,
    >that I can't have journalling on them, which precludes the use of ext3 or 4.

    #1. you can mount ext3 file systems read-only. The journal doesn't preclude a ro mount.

    #2. ext4 supports running without a journal. Google engineers contributed that code to ext4 last year.

  • by tytso ( 63275 ) * on Friday January 15, 2010 @01:38AM (#30775636) Homepage

    What Soft Updates apparently does is assume that once the data is sent to the disk, it is safely on the disk. But that's not a true assumption!

    Journaling, and every other filesystem, has exactly the same problem. If consistence is required, YOU MUST DISABLE THE CACHE, unless it is battery-backed, or you are willing to depend on your UPS. This is the penalty we take for devices which lie to the OS about flush operations and the like.

    Yes, there were, in the bad old days, devices which lied when the OS sent a flush cache command, and in order to get a better Winbench score, they would cheat and not actually flush the cache. But that hasn't been true for quite a while, even for commodity desktop/laptop drives. It's quite easy to test; you just time how many single block sector writes followed by a cache flush commands you can send per second. In practice, it won't be more than, oh, 50-60 write barriers per second. In general, if you use a reputable disk drive, it supports real cache flush commands. My personal favorites are Seagate momentus drives for laptops, and I can testify to the fact that they all handle cache flush commands correctly; I have quite a collection and it's really not hard to test.

    The big difference between journalling and soft updates is we can batch potentially hundreds of metadata updates into a single journal transaction, and send down a single write barrier every few seconds. The journal commit is an all-or-nothing sort of thing, but that gives us reliability _and_ performance.

    The problem with soft updates is that the relative ordering of nearly most (if not all) metadata writes are important. And putting a write barrier between each barrier operation is Slow And Painful. Yes, you can disable the write cache, but then you give up a huge amount of performance as a result. With journaling we can get the performance benefits of writes, but we only have to pay the cost of enforcing write ordering through the barrier once every few seconds.

    Of course, there are workloads where soft updates plus a disabled write cache might be superior. If you have a very metadata-intensive workload that also happens to call fsync() between nearly every metadata operation, then it would probably do better than a physical block journalling solution that used barrier writes but run with an enabled write cache. But in the general case, if you compare a more normal workload where fsync()'s aren't happening _that_ often, and compare physical block journalling with a write cache and barrier ops, with a Soft Updates approach with the write cache disabled, I'm pretty sure the physical block journalling approach will end up benchmarking better.

  • Re:Ubuntu 9.10? (Score:4, Informative)

    by tytso ( 63275 ) * on Friday January 15, 2010 @02:03AM (#30775742) Homepage

    So Canonical has never reported this bug to LKML or to the linux-ext4 list as far as I am aware. No other distribution has complained about this > 512MB bug, either. The first I heard about it is when I scanned the Slashdot comments.

    Now that I'll know about it, I'll try to reproduce it with an upstream kernel. I'll note that in 9.04, Ubuntu had a bug which as far as I know, must have been caused by their screwing up some patch backports. Only Ubuntu's kernel had a bug where rm'ing a large directory hierarchy would have a tendency to cause a hang. No one was able to reproduce it on an upstream kernel,

    I will say that I don't ever push patches to Linus without running them through the XFS QA test suite. (Which is now generalized enough so it can be used on a number of file systems other than just XFS). If it doesn't have a "write a 640 MB file" and make sure it isn't corrupted, we can add it and then all of the file systems which use the XFSQA test suite can benefit from it.

    (I was recently proselytizing the use of the XFS QA suite to some Reiserfs and BTRFS developers. The "competition" between file systems is really more of a fanboy/fangirl thing than at the developer level. In fact, Chris Mason, the head btrfs developer, has helped me with some tricky ext3/ext4 bugs, and in the past couple of years I've been encouraging various companies to donote engineering time to help work on btrfs. With the exception of Hans Reiser, who has in the past me of trying to actively sabotage his project --- not true as far as I'm concerned --- we all are a pretty friendly bunch and work together and help each other out as we can.)

  • by Jeff- ( 95113 ) on Friday January 15, 2010 @03:31AM (#30776116) Homepage

    There's a lot of misinformation in this thread about softupdates. I only have so much time to reply so I'll hit a few key points. I'm the author of journaling extensions to softupdates so I have some experience in this area.

    This notion that softupdates was so complex and so inhibited new features in ffs is bogus. I've seen it repeated a few times. There simply was not much pressure for these features and the filesystem metadata did not support it until ufs2. The total amount of code dedicated to extended attributes in softupdates can't be more than 100 lines. ffs sees fewer features because we have fewer developers period.

    Furthermore, softupdates is just a different approach. It is no more complex than journaling. When I review a sophisticated journaling implementation such as xfs I see more lines of code dedicated to journaling and transaction management than softupdates requires for dependency tracking. I have worked on a number of production filesystems and while softdep is definitely not trivial, neither were any of the others unless you compare to synchronous ufs. I think a lot of people who are familiar with COW and Journaling are looking at this unfairly because they already know another system and forget how long it took to become comfortable with it.

    In cpu benchmarks softdep costs more than async ffs, this is true. However, rollbacks are actually quite infrequent because our buffercache attempts to write buffers without dependencies first. Generally there are enough of those which satisfy dependencies on other buffers that you can keep the pipeline busy. Looking at the code size and depth in any modern filesystem it's clear that a lot of cpu is involved. Are journal blocks not consuming memory? Is the transaction tracking free? Most dependency structures are quite small compared to generating a copy of a metadata block for a jouranl write.

    NetBSD abandoned softdep for something much simpler because they didn't have the resources to fix the bugs in it and they didn't incorporate fixes from FreeBSD. Their journaling implementation is similar to our gjournal which is mostly filesystem agnostic and does full block logging in a very simple fashion.

    The journaled filesystem project was started simply to get rid of fsck. I think this hybrid solution is very promising. It gives us a place to issue barriers which can affect arbitrary numbers of filesystem operations. The journal write overhead is much lower than with traditional journals.

    And regarding benchmarks; FreeBSD doesn't really have a comparably developed journaling filesystem to benchmark softdep against. I think it's unreasonable to compare linux with ext4 to FreeBSD with ffs+softdep for purposes of evaluating the filesystem design. Too many other factors come into play.

    You can read more about softdep journaling at http://jeffr_tech.livejournal.com/

    Thanks,
    Jeff

  • by tytso ( 63275 ) * on Sunday January 17, 2010 @10:53AM (#30798274) Homepage

    So I'm not sure what you're talking about. If you're talking about delayed allocation, XFS has it too, and the same buggy applications...

    Stop blaming the applications for a filesystem problem Ted. The excuse doesn't wash no matter how many times you use it, and no, XFS does not have it.

    http://en.wikipedia.org/wiki/XFS#Delayed_allocation [wikipedia.org]

    Any other questions? At the very least the applications are non-portable in the sense that they were depending on behavior not guaranteed by POSIX. XFS, btrfs, ZFS, and many if not most modern file systems do delayed allocation. It's one of the basic file system tricks to improve performance.

  • by tytso ( 63275 ) * on Sunday January 17, 2010 @10:18PM (#30803638) Homepage

    So before I tried agitating for programmers to fix their buggy applications, I had already implemented both the heuristic that XFS uses (if you truncate a file descriptor, add an implicit fsync on the close of that fd), and in addition I had implemented another heuristic (if you rename on top of an existing file, fsync the source file of the rename). This was to work around buggy applications, and as you can see, ext4 does even more than XFS does.

    At the end of the day, though, the heuristic can sometimes get things wrong, and sometimes the heuristic will be too aggressive in forcing fsync()'s when it's not really necessary, which is why it's good to at least try to education application programs about something which even you agree shouldn't be a new thing.

    (For example, if you don't fsync, and you want to run your application on another OS, like say, Solaris, you will be very sad.)

    But it wasn't backside covering, although most people don't seem to realize it, FIRST I added the hueristics to work around the buggy code, and THEN I agitated for people to fix their d*mn code. But application programmers don't like being told that they are wrong, so this seems to be a case of "blame/shoot the messenger" --- with me having been cast into the role of the messenger.

The Tao is like a glob pattern: used but never used up. It is like the extern void: filled with infinite possibilities.

Working...