Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Google Businesses The Internet Networking Hardware IT

A Look At the Workings of Google's Data Centers 160

Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting: "'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."
This discussion has been archived. No new comments can be posted.

A Look At the Workings of Google's Data Centers

Comments Filter:
  • A surprisingly lengthy and revealing blog posting indeed. Quite informative and interesting.

    While Google uses ordinary hardware components for its servers ...
    I would like to point out that the networking details were vastly overlooked. Information about the servers is interesting but when you're networking such a vast amount of computers together, I would be more interested in a quick graphic of how the IP addresses are layed out over 'a typical' cluster of 1,800 machines.

    I understand distributed computing and I understand distributed searching. But the fact of the matter is that at some point at the top of the chain, you're usually transferring very large amounts of data--no matter how tall your 'network pyramid' is. The coding itself is no simple feat but I have heard rumors [gigaom.com] that Google was building their own 10-Gigabit ethernet switches since they couldn't find any on the market. You'll notice a lot of sites are just speculating [nyquistcapital.com] but it certainly is a nontrivial problem to network clusters of thousands of computers with more than 200,000 in the whole lot and not require some serious switch/hub/networking hardware to back it.
    • by magarity ( 164372 ) on Saturday May 31, 2008 @09:32AM (#23609099)
      a quick graphic of how the IP addresses are layed out over 'a typical' cluster of 1,800 machines
       
      I'll bet they don't mess with tcp/ip - that's way too slow and bulky. Think Infiniband or some other switched fabric instead of heirarchical.
      • Re: (Score:3, Interesting)

        by arktemplar ( 1060050 )
        Agreed, but their interconnect topology is what should be interesting not just the hardware, after all with simple topologies etc., there is a limit to how it scales efficiently, I have been doing some work on parallel processing for supercomputers as my undergrad thesis and believe me the major thing that differs amongst the top some 100 super computers is their interconnect topology not just their hardware.

        Also, their search algo is based on eigen values I think, a very very profitable algo to parallelize
      • Re: (Score:3, Informative)

        by agristin ( 750854 )
        No, they use gigabit and 10G ethernet. Infiniband is the opposite of cheap commodity hardware. Infiniband is expensive per port and not commodity.

        Google has a two vendor policy, I know some of their network gear for gig-e and 10G-e is Force10. Google and Force10 are both involved in the 802.3ba (40G and 100G), Force10 is on the IEEE committee and Google is one of the customers with demand, they may have a seat on the committee I don't really know all the members.
      • Re: (Score:3, Insightful)

        by Anonymous Coward
        Bwaahhahhahah. ARe you kidding?

        1) TCP/IP isn't really slow and bulky. It's one of the best protocols ever designed. With only minimal enhancements to the original protocol as designed, a modern host can achieve nearly line speed 10Gbit with pretty minimal CPU. We can push 900+Mbyte/sec from a single host. If you need more bandwidth, then do channel bonding.

        2) Infiniband? That costs at least $250-500 per node plus more for switches. Google is not going spend that kind of money for the limited benefits
        • Re: (Score:3, Interesting)

          by dfj225 ( 587560 )
          TCP is slow and bulky if dropped packets are a very rare thing. Confirming delivery of every packet results in a lot of wasted communication for the vast majority.

          My guess is that they use something else for internal communication. You can always recover from errors at the application level instead of forcing every packet to be confirmed.

          TCP is great for general communication over the Internet and not so great for specialized cases where performance is important, like at Google.
    • Re: (Score:3, Interesting)

      by dodobh ( 65811 )
      AFAIK, Google uses Force10 switches for the networking infrastructure. Details are confidential though. I learnt this from the Force10 salesguy convinving me to buy their hardware.
      • by kd5ujz ( 640580 )
        I have some DOD level Linksys routers I will sell you, Details about the use of them in government infrastructure is confidential, but I can assure you they use them. :P
        • by dodobh ( 65811 )
          Given that the guy is asking me to contact Google as a reference, I suspect that he isn't lying.

          From http://code.google.com/soc/2008/freebsd/about.html [google.com] :
          Relevance to Google : Google has many tens of thousands of FreeBSD-based devices helping to run its production networks (Juniper, Force10, NetApp, etc..), MacOS X laptops, and the occasional FreeBSD network monitoring or test server.
      • Is this satire? I honestly can't tell. I hope so.
    • Re: (Score:3, Informative)

      Here's what they used [c63.be] in 1998... A Wikipedia article explains a bit of what they're doing now [wikipedia.org]...
    • While Google uses ordinary hardware components for its servers...
      I thought that was an interesting quote as well but for different reasons. Once I read the about failure rates I thought maybe the vendor wouldn't enjoy being mentioned. But sure enough there it was, good ol' Intel. I just wish they would of specified a bit more as to what they consider "ordinary hardware".
    • I've done a couple clusters of 2200 machines per cluster (small for google). I'd bet Google does geographic IP addressing, using the RFC1918 10.0.0.0/8 network. We did. With 40 or 80 servers in a rack we did L3 bounds pretty easily for every rack or so. Since L3 switching at the edge is cheap and fast, solves scaling at L2, and L3 routing protocols have quick predictable ways to route around failure, it was easy to aggregate. If you can subnet and supernet, you too can build huge networks for clusters
    • by inKubus ( 199753 )
      Yeah, but they just have lots of load-balancing. Their index is just a huge lookup table (inverted index). Instead of actually asking the database for something like you might be thinking, they use a column-orientated database to store a lookup table of pages. So basically every search is two fast operations (find the word (search term), then go to the exact location in the DB and return the page), whereas actually building the lookup table takes forever. That's the genius of it (although it's 10 year o
  • by Anonymous Coward
    At what point is skimping on hardware because the system is failure tolerant costlier than using more reliable hardware?
    • by Vectronic ( 1221470 ) on Saturday May 31, 2008 @08:25AM (#23608801)
      Interesting, but I would probably venture a guess: never.

      Unless of course you are talking about P2's and ISA's, and its not a matter of "reliability" I dont think, it could easily be argued that a $200 [component] is just as reliable as a $500 [component] I think mostly what they are doing, is buying 3 of something cheaper, instead of one of something greater.

      Component A:cheaper, less cutting edge (generally more reliable)

      Component B: Has 3 times the power, 3 times the load, costs 3 times as much.

      If a single component A fails, there is still 2 running (depending on the component) and thus a 33% loss in performance, a third the of total cost to replace (making it like a 6th of the costs compaired to component B)

      If component B fails, 100% loss, complete downtime, 100% expense. (relatively)
      • by The Second Horseman ( 121958 ) on Saturday May 31, 2008 @09:00AM (#23608945)
        It depends on the kind of applications you're running. Google is something of a singular case. A lot of businesses need to run a lot of small servers for dissimilar applications, not similar ones. If you're talking about business apps that don't play well together on a single server and you virtualize them, you can get a pair of 8-core servers (something like an HP Proliant DL380 G5) with an extra NIC, fibre channel HBA and 32 GB of RAM, plus local SAS drives.

        You can easily run a dozen large VMs on one of those with room to spare (assuming some of them have 2GB or 3GB of RAM allocated to them). If you limit it to ten per box, that's twenty VMs, and you can migrate servers between them or fail them over in case of a fault. Those DL380's (if you have dynamic power savings turned on) can average under 400 watts of power draw each - so 40 watts per server. In our environment, we've got 5 hosts running a ton of VMs, some of which don't have to fail over (layer 4-7 switch, also a VM), so we're getting closer to 25 or 30 watts per VM. We'd have the SAN array anyway for our primary data storage, so that wasn't much of an extra. We're using fewer data center network ports, and few fibre channel ports. We've actually been able to triple the number of "servers" we're running while actually bringing energy use down as we've retired more older servers and replaced them with VMs. And it's been a net increase in fault tolerance as well.

        • Re: (Score:3, Insightful)

          by jacobsm ( 661831 )
          First let me state that I'm a mainframe systems programmer and a true believer of this technology. IMHO Google should start looking at mainframe based virtualization instead of the server farms they currently depend on.

          One z10 complex with 64 CPU's, 1.5 TB of memory, can support thousands of Linux instances all communicating with each other using hypersocket technology. Hypersockets uses microcode to enable communications between environments without going to the actual network.

          A z10 processor complex is as
          • Isnt a single box running thousands of virtual environments which are then running clustering software just a tad redundant?

            Anyway its far cheaper and has better bang for buck for Google to use cheap nasty hardware than your exotic stuff.
            Remember that even if they did use what your suggesting, they'd still need thousands of them.
          • by SuperQ ( 431 ) *
            So how much does that z10 cost? What's the physical footprint? 1.5T of RAM is a good target for comparison.

            HP DL160 G5: $6672 USD
            Low power dual quad-core, 4x 500GB disk, 32GB ram.

            Say Google gets a good discount for quantity, maybe 25%.. $5000 each.

            That seems like a simple enough commodity server these days.. A rack of 40 machines would come out to $200,000 USD, add another $50k for misc stuff and switching gear (rack and core)

            Each rack now has 1.25TB of RAM, 320 cores, 80TB of disk (who needs FC or iSCSI
            • by drsmithy ( 35869 )

              Say Google gets a good discount for quantity, maybe 25%.. $5000 each.

              They'll get a lot more than that. Heck, we typically get 25% and we're nobody - we maybe buy 50-60 servers a year from Dell.

      • You forgot to incorporate life-time of components into your calculation. Same everything but longer life time and more expensive component should become cheaper in the long-run and/or large-scale use.
      • by Znork ( 31774 ) on Saturday May 31, 2008 @09:58AM (#23609249)
        I think mostly what they are doing, is buying 3 of something cheaper, instead of one of something greater.

        From what it looks like they're doing exactly what I do for myself; skip the extraneous crap and simply rack motherboards as they are.

        In that case we're not talking 3 of something cheaper; you could probably get up towards 5-10 of something cheaper. Then consider that best price/performance is not generally what is bought, and the difference is even wider.

        Of course, it's not going to happen in the average corporation, where most involved parties prefer covering their ass by buying conventional branded products. Point out to your average corporate purchaser or technical director that you could reduce CPU cycle costs to 1/25 th, and that you could provide storage at 1/100th of the current per gigabyte cost and they'll whine 'but we're an _enterprise_, we cant buy consumer grade stuff or build it ourselves'.

        Ten years ago people brought obsolete junk from work home to play with. These days I'm considering bringing obsolete stuff from home to work because the stuff I throw out is often better than low-prioritized things at work.
        • by mrbooze ( 49713 )

          Of course, it's not going to happen in the average corporation, where most involved parties prefer covering their ass by buying conventional branded products.

          It's not *just* ass-covering, although there's definitely some of that. Average corporations also do not *remotely* employ enough IT staff do be doing the sort of constant maintenance and replacements as Google is doing, not to mention the engineers doing testing and design of the specialized architecture, etc. And IT is often one of the first groups up against the wall when it's time to shore up numbers for the fiscal year.

          I've worked with managers who believed very much in the commodity hardware philo

    • It's a lot easier and cheaper to make failure-tolerant software if you're looking at system functionality on a cluster/datacentre level than it is to ensure all your hardware is bulletproof.
      Hardware will fail - it's up to the intelligence of the overlaid systems to mitigate that.
      • Re: (Score:3, Insightful)

        by Anpheus ( 908711 )
        You're also paying through the nose for every extra nine of uptime.

        That's not to say it's impossible, IBM, HP, any of the "big iron" companies can offer you damn near 100% uptime without major changes to your software.

        But be prepared to pull out the checkbook. You know, the REALLY BIG one that is only suitable for writing lots of zeroes and grand prize giveaways.
    • by dotancohen ( 1015143 ) on Saturday May 31, 2008 @08:53AM (#23608905) Homepage

      At what point is skimping on hardware because the system is failure tolerant costlier than using more reliable hardware?
      Google is not skimping on hardware. They are simply not trusting hardware to be reliable. Actually, they are buying twice as much hardware as they would otherwise need, according to TFA. Er, not that I read it or anything, I swear,....
      • Actually, they are buying twice as much hardware as they would otherwise need, according to TFA. Er, not that I read it or anything, I swear,....

        Don't worry, your secret is safe with us.

        Real Slashdotters not only fail to read TFAs, but they also completely miss any and all relevant information in other people's posts.
        Therefore, someone may hook on your claim that Google is not skimping on hardware and try to argue that they, in fact, do. Your admission to having read TFA will go completely unnoticed.

        And before you ask yourself how come I noticed it: I didn't.
        And besides, I'm new here.

        • by enoz ( 1181117 )

          And besides, I'm new here.
          Lying about your age for karma? I've seen your 6-digit UID.
      • by SpinyNorman ( 33776 ) on Saturday May 31, 2008 @09:12AM (#23609005)
        You could say that Google is taking advantage of the fact that hardware is unreliable to reduce cost.

        With server farms the size of Google's, failures are going to occur daily regardless of how "fault-tolerant" your hardware is. Nothing is 100% failure free. Given that failures will occur, you need fault tolerance in your software, and if your software is fault tolerant, then why waste money on overpriced "fault-tolerant" hardware? If you can buy N cheapo servers for the price of 1 hardened one, then you'll typically have N times the CPU power available, and the software makes them both look as reliable.
    • by TheRaven64 ( 641858 ) on Saturday May 31, 2008 @09:21AM (#23609049) Journal
      It depends on how much downtime costs you. If Google is down for five seconds, no one will notice - they will just assume that their link is slow, blame their ISP, and hit refresh. If a telecom's billing system or a bank's transactional system is down for five seconds then they are likely to lose a lot of money. The only difference between doing this kind of thing in hardware and software is the fail-over time and the cost. Google take a slower fail-over time in exchange for lower costs. For them and for 99.9% of businesses, it makes perfect sense. The remaining 0.1% are the reason IBM's mainframe division is so profitable.
  • Hard drive failures (Score:2, Interesting)

    by pacroon ( 846604 )
    When looking at it on that massive scale, you really get the idea of just how fragile a hard drive really is. I wonder how much money the new generations of data storage is going to cost for large corporations like Google. And not to mention how existing corporations will handle it, once those devices goes from "super computers" to mainstream hardware.
    • Probably close to the same, if anything probably cheaper.

      I would imagine that Google wouldnt adopt SSDs until they were financially viable, which probably wont be too long, they will be about the same price per GB as HDDs, and eventually cheaper, making for greater profit for aslong as HDD's are being sold (200 GB HDD costs $50, 6 months later 200GB HDD costs $10, etc)

      Then, if SSDs are more reliable and the same price, thats also less expense.
      • by Firehed ( 942385 )
        Also consider the power and heat output of SSDs as compared to spinning disks. SSDs tend to have lower power requirements (which adds up very quickly when you're dealing with tens of thousands of machines) and as such tend to put out less heat (meaning the HVACs won't get reamed as hard, and should therefore also use less power). Assuming the reliability is decent, the additional premium per drive will probably pay for itself rather quickly considering the drop in price of operating the system as a result
      • by Thing 1 ( 178996 )

        Then, if SSDs are more reliable and the same price, thats also less expense.

        Actually, I'd say at twice the price, SSDs would be less expensive over their lifetime. (I'm not sure where the break-even point is, but Seagate warrants for 5 years, and most flash media has a 50-year average write cycle, so 10x probably isn't far off? I'll stick with 2x for my argument though.)

    • Comment removed based on user account deletion
      • by jimicus ( 737525 )

        Indeed and if you have 200 thousand servers running, they must be employing at least a couple of dozen people to run around hot datacenters all day and replace hard drives. Neither the hard drives nor the people will be cheap.

        By all accounts, they don't bother with individual machine repairs. A dead rack might get repaired or replaced, but an individual node will simply be marked as dead and left there. The rack itself will get maintenance as and when it no longer has enough functioning servers to merit keeping it going.

    • When looking at it on that massive scale, you really get the idea of just how fragile a hard drive really is.

      Less than you might think from the summary, reading further down the article you find "The company has a small number of server configurations, some with a lot of hard drives and some with few".
  • by throatmonster ( 147275 ) on Saturday May 31, 2008 @08:36AM (#23608857)
    The hardware failures I can understand, but needing to rewire the data center after it's been wired once, and the fact that half of them overheat? Those sound like problems that should be addressed in the engineering and installation phases of the datacenter.
    • by William Robinson ( 875390 ) on Saturday May 31, 2008 @08:48AM (#23608889)

      The hardware failures I can understand, but needing to rewire the data center after it's been wired once, and the fact that half of them overheat? Those sound like problems that should be addressed in the engineering and installation phases of the datacenter.

      Each machine has smoke detector installed right on top of it. The Maintenance director is standing at the gate of data center with pistol in his both hands. As soon as alarm is heard, a batch of maintenance engineers rush towards the faulty machine with keyboard, harddisc, mouse, motherboard and other components. The faulty components of machine are replaced on the rhythm of drumbeats they have been rehearsed through 1000's of times. The crew has to rewire the machine, reboot, and be back at the gate with burnt machine in less than 5 minutes or they are shot dead.

      The trouble is, because of this time limit, the maintenance engineers simply pull machine out of rack without disconnecting any wires. And that's why rewiring is needed.

      • by tenco ( 773732 )
        Wish i had modpoints. Thanks for saving my day :)
      • by inKubus ( 199753 )
        This is funny. But, there are rumors that they might be implementing robotic...overlords...to swap servers automatically.
      • by hoggoth ( 414195 )
        Thanks! I just rolled on the floor and laughed, then drank milk and forced it out my nose onto my brand new keyboard. In my parent's basement. With a real-doll named covered in hot grits.
    • Re: (Score:2, Interesting)

      by mdenham ( 747985 )
      Yeah, the overheating part could be solved by investing in more racks, and then putting half as many units on each rack.

      This also allows for future throughput improvements from a single unit, and probably would cost less than the two days' downtime every overheat (racks are relatively cheap, time isn't).
      • Racks are cheap, but have you ever seen the bill for floor space?

        From what I read, Google uses simple desktop computers.
        These machines have been designed to sit idle 99.9% of the time and they have been designed with that in mind. If you ramp up the load on such a machine, things start to get real noisy real quick. If you keep them at such a high load for a long time, they simply break. (IBM Netvista comes to mind...)

        Trouble is, buying machines designed with such a load in mind costs twice as much and the f
    • The over-heating deffinetly suprised me, especially %50... but they werent very explicit with what exactly "rewiring" entailed, perhaps its simply a preventative measure, instead of waiting for it to fail, just replace it all yearly, which i would imagine wouldnt be too costly when you are buying miles of wires wholesale, and it can all be done machine by machine, and only takes "2 days", or maybe it just means disconnecting and re-wiring, like stepping (disconnect, move from Connection A to B) the connecti
    • Re: (Score:3, Informative)

      by SuperQ ( 431 ) *
      The problem comes from requirements changing. "Sorry, we designed this building for X load, now you're using X+10% load so we have to add additional cooling units to keep up"

      I had this problem at the University where I worked a while ago. We rolled in a nice new SGI Altix machine. We had enough power, but the cooling system couldn't move enough cubic feet of air into the one part of the room where the box was. As soon as you reach capacity, temps skyrocket.
  • by Enleth ( 947766 ) <enleth@enleth.com> on Saturday May 31, 2008 @08:51AM (#23608899) Homepage
    I've been managing a dorm network consisting of two "servers" (routing, PPPoE, some services like network printing etc.), a single industrial rack-mounted swithch and dozens of consumer switches spread all over the building.

    And they failed. And then they failed again. And again. Sometimes completely, but usually just a single port, or just "a bit" - it looked as if the switch was working, but every - or every n-th, or every bigger than x - packet got mangled, misdirected or whatever. Or sometimes packets appeared just out of the blue (probably some partial leftovers from the cache) and a few of them made enough sense to be received and reported. Sometimes a switch with no network cables attached to it started blinking its lights - sometimes on two ports, sometimes just on a single one.

    Well, I could go on for hours, but you get the idea. What happens at Google happens everywhere, they just have some nice numbers.

    Regardless, the article is quite entertaining to read for a networking geek ;)
    • Re: (Score:3, Informative)

      No it isn't. If a machine works flawlessly for ten years 90% of the time, and you only have one odds are everything will always work. If you have ten, odds are one will die with in that ten years. Things are different at large scale, and failure prediction is an important part of creating such a big cluster. But yes, even on a small scale you should always plan for failure.
    • by Bender0x7D1 ( 536254 ) on Saturday May 31, 2008 @11:39AM (#23609871)

      Sounds like you have dust in your cables. I would recommend you clean the inside of your cables with compressed air so the bits don't get stuck on the lint and other stuff in there. The bits travel very fast, so even small dust particles can be a problem.

  • by howardd21 ( 1001567 ) on Saturday May 31, 2008 @08:54AM (#23608907) Homepage
    The fact that they attribute success to the software did not surprise me; the chunk and shard (not mentioned in the article) approach has been known for some time. But the fact that the GFS architecture works with BigTable and MapReduce was interesting, and that it handles many data/content types. What this creates is not only a scalable structure volume size, AND a sustainable business model. As new content types are added, regardless of size or type, they can generally be indexed appropriately. I am looking forward to searching more within types like video and audio, or even medical records like xRays or MRI results. The possibilities are staggering.
  • Hardware is cheap (Score:4, Interesting)

    by Ritz_Just_Ritz ( 883997 ) on Saturday May 31, 2008 @09:09AM (#23608989)
    It's always going to be cheaper to use anthill labor on this type of problem. Even relatively powerful 1RU and .5RU servers are dirt cheap these days. Hell, I was able to buy a pile of .5RU machines for one of my projects this week. I can't believe how cheap things have gotten:

    quad-core xeon @2.66ghz
    4gb RAM
    2 x 500gig barracudas (RAID1)
    dual gigabit ether
    CentOS 5.1
    US$1100 per unit

    They are all stashed behind a Foundry ServerIron to load balance the cluster. So far, it seems to scale VERY well and increasing capacity is as simple as tossing another US$1k server on the pile.

    Cheers,
    • would you mind letting me in on where you bought this stuff ?
    • Just for a comparison, where I work we bought a SuperMicro with the same specs. about two years ago for about $3500. This was just as the new Xeons and ICHs were coming out, so that number is a bit higher than it would have been if we waited a couple of months.
  • Seems that the whole server/complex monitoring aspect was left off. With 100K servers per complex, how do they even know which ones are broken? How do they even find them on the floor and in the racks?
    • Re: (Score:3, Informative)

      by Gazzonyx ( 982402 )
      Most rackmounts that I've seen have an 'identify' LED that you can have blink (I assume you can automate this with SNMP and management software).
    • Well, there are two ways (that my employer uses - I'd guess google does the same):

      Most server systems worth their salt have fault indicators that turn on when there is a hardware failure or perhaps even a watchdog timeout. Probably they do periodic walk-throughs to look for fault lights.

      The second is proper asset management. A machine identified as broken has a record in an asset database that describes the location of the machine, the location being something like (data center, row, rack, RU up from the
      • by inKubus ( 199753 )
        Looking at the "snapshot", I see some twisted pair running out of the Mobos in their "rack". Perhaps you could wire up all the fault light outputs on a rack of mobos to a simple multiple-and logic circuit which has a big siren light connected to it with a relay. Then you know what Rack to go to, after that just find the light.
  • Open Source as a business model scales very well, especially if your technology is so tightly integrated with your business goals. No matter how small you start out your business is never going to be limited by an Open Source platform because by the time your big enough to have those problems no one, except yourself, is going to be able to help you anyway.

  • I look at their data farm - and its complexity.. and cannot help but wonder how Google has organized itself for their thousands of employees to properly maintain it. No one employee can know every piece of it - or is it simply so simple that every employee knows all of it?

    Then.. we realize that our own lifespans and lives are as prone to failure as the servers in their datacenters. Our lifespans are short and everyone has problems.. So Google has mastered the ability to make us interchangeable.

    WE ARE ANTS!
    • What makes Googles datacenters different is that they are ALL the same.

      It's not like typicall datacenter where cluster X is for ESX Server, Y is for the financial system, z is Win 2k3, and Q is AIX. Every unit in a Google rack is just another piece of typical hardware running the same OS, the same software, and configured the same way. I suspect there may be some sort of 'controller node' for some number of worker machines, but even then, each controller node is just like another controller node.

      Each machin
      • I suspect there may be some sort of 'controller node' for some number of worker machines, but even then, each controller node is just like another controller node.

        ...and suddenly I'm wondering if slashdot shouldn't 'borgify' the Google logo instead of Bill...
  • by swb ( 14022 ) on Saturday May 31, 2008 @12:33PM (#23610307)
    He was my friend in high school and roommate in college for a year. Smartest guy I've ever met in my life, easily smarter than any other PhDs I've known, including people I know with Harvard post-med school doctorates.

  • This whole article is a persuasive argument for mainframes. Continually servicing cheap hardware that fails is labor-expensive, and replacing all those failed components costs real money. They're only cheap the first time you buy them. I don't see where they're saving money, though the guy does sound like he's having a ball directing armies of lackeys all through these vast, twisty datacenters.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...