A Look At the Workings of Google's Data Centers 160
Doofus brings us a CNet story about a discussion from Google's Jeff Dean spotlighting some of the inner workings of the search giant's massive data centers. Quoting:
"'Our view is it's better to have twice as much hardware that's not as reliable than half as much that's more reliable,' Dean said. 'You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day.' Bringing a new cluster online shows just how fallible hardware is, Dean said. In each cluster's first year, it's typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will "go wonky," with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span, Dean said. And there's about a 50 percent chance that the cluster will overheat, taking down most of the servers in less than 5 minutes and taking 1 to 2 days to recover."
And the Network That Connects These Clusters? (Score:5, Insightful)
I understand distributed computing and I understand distributed searching. But the fact of the matter is that at some point at the top of the chain, you're usually transferring very large amounts of data--no matter how tall your 'network pyramid' is. The coding itself is no simple feat but I have heard rumors [gigaom.com] that Google was building their own 10-Gigabit ethernet switches since they couldn't find any on the market. You'll notice a lot of sites are just speculating [nyquistcapital.com] but it certainly is a nontrivial problem to network clusters of thousands of computers with more than 200,000 in the whole lot and not require some serious switch/hub/networking hardware to back it.
Re:And the Network That Connects These Clusters? (Score:5, Insightful)
I'll bet they don't mess with tcp/ip - that's way too slow and bulky. Think Infiniband or some other switched fabric instead of heirarchical.
Re: (Score:3, Interesting)
Also, their search algo is based on eigen values I think, a very very profitable algo to parallelize
Re: (Score:3, Informative)
Google has a two vendor policy, I know some of their network gear for gig-e and 10G-e is Force10. Google and Force10 are both involved in the 802.3ba (40G and 100G), Force10 is on the IEEE committee and Google is one of the customers with demand, they may have a seat on the committee I don't really know all the members.
Re: (Score:3, Insightful)
1) TCP/IP isn't really slow and bulky. It's one of the best protocols ever designed. With only minimal enhancements to the original protocol as designed, a modern host can achieve nearly line speed 10Gbit with pretty minimal CPU. We can push 900+Mbyte/sec from a single host. If you need more bandwidth, then do channel bonding.
2) Infiniband? That costs at least $250-500 per node plus more for switches. Google is not going spend that kind of money for the limited benefits
Re: (Score:3, Interesting)
My guess is that they use something else for internal communication. You can always recover from errors at the application level instead of forcing every packet to be confirmed.
TCP is great for general communication over the Internet and not so great for specialized cases where performance is important, like at Google.
Re: (Score:2)
Re: (Score:3, Interesting)
Re: (Score:2)
Re: (Score:2)
From http://code.google.com/soc/2008/freebsd/about.html [google.com] :
Relevance to Google : Google has many tens of thousands of FreeBSD-based devices helping to run its production networks (Juniper, Force10, NetApp, etc..), MacOS X laptops, and the occasional FreeBSD network monitoring or test server.
Re: (Score:2)
Re: (Score:3, Informative)
Re: (Score:2)
Re: (Score:2)
No 'enterprise grade' parts.
Re: (Score:2)
Re: (Score:2)
Failure tolerance vs. failure prevention (Score:1, Interesting)
Re:Failure tolerance vs. failure prevention (Score:4, Insightful)
Unless of course you are talking about P2's and ISA's, and its not a matter of "reliability" I dont think, it could easily be argued that a $200 [component] is just as reliable as a $500 [component] I think mostly what they are doing, is buying 3 of something cheaper, instead of one of something greater.
Component A:cheaper, less cutting edge (generally more reliable)
Component B: Has 3 times the power, 3 times the load, costs 3 times as much.
If a single component A fails, there is still 2 running (depending on the component) and thus a 33% loss in performance, a third the of total cost to replace (making it like a 6th of the costs compaired to component B)
If component B fails, 100% loss, complete downtime, 100% expense. (relatively)
Re:Failure tolerance vs. failure prevention (Score:5, Interesting)
You can easily run a dozen large VMs on one of those with room to spare (assuming some of them have 2GB or 3GB of RAM allocated to them). If you limit it to ten per box, that's twenty VMs, and you can migrate servers between them or fail them over in case of a fault. Those DL380's (if you have dynamic power savings turned on) can average under 400 watts of power draw each - so 40 watts per server. In our environment, we've got 5 hosts running a ton of VMs, some of which don't have to fail over (layer 4-7 switch, also a VM), so we're getting closer to 25 or 30 watts per VM. We'd have the SAN array anyway for our primary data storage, so that wasn't much of an extra. We're using fewer data center network ports, and few fibre channel ports. We've actually been able to triple the number of "servers" we're running while actually bringing energy use down as we've retired more older servers and replaced them with VMs. And it's been a net increase in fault tolerance as well.
Re: (Score:3, Insightful)
One z10 complex with 64 CPU's, 1.5 TB of memory, can support thousands of Linux instances all communicating with each other using hypersocket technology. Hypersockets uses microcode to enable communications between environments without going to the actual network.
A z10 processor complex is as
Re: (Score:2)
Anyway its far cheaper and has better bang for buck for Google to use cheap nasty hardware than your exotic stuff.
Remember that even if they did use what your suggesting, they'd still need thousands of them.
Re: (Score:2)
Re: (Score:2)
The smallest Linux image can run in 8MB of RAM, so if I get 100 servers with 64GB of RAM I could theoretically fire up 800,000 Linux images on them. Does this mean my 100 servers are as powerful as your five z10s?
Re: (Score:2)
HP DL160 G5: $6672 USD
Low power dual quad-core, 4x 500GB disk, 32GB ram.
Say Google gets a good discount for quantity, maybe 25%.. $5000 each.
That seems like a simple enough commodity server these days.. A rack of 40 machines would come out to $200,000 USD, add another $50k for misc stuff and switching gear (rack and core)
Each rack now has 1.25TB of RAM, 320 cores, 80TB of disk (who needs FC or iSCSI
Re: (Score:2)
Say Google gets a good discount for quantity, maybe 25%.. $5000 each.
They'll get a lot more than that. Heck, we typically get 25% and we're nobody - we maybe buy 50-60 servers a year from Dell.
Re: (Score:2)
Even at the scale of a couple of racks (an average 12x12 colo cage at savvis or some place) you get many times the compute/storage capacity of a z10.
It's interesting technology, but completely wasteful when it comes to $/work
Re: (Score:1)
Re:Failure tolerance vs. failure prevention (Score:5, Insightful)
From what it looks like they're doing exactly what I do for myself; skip the extraneous crap and simply rack motherboards as they are.
In that case we're not talking 3 of something cheaper; you could probably get up towards 5-10 of something cheaper. Then consider that best price/performance is not generally what is bought, and the difference is even wider.
Of course, it's not going to happen in the average corporation, where most involved parties prefer covering their ass by buying conventional branded products. Point out to your average corporate purchaser or technical director that you could reduce CPU cycle costs to 1/25 th, and that you could provide storage at 1/100th of the current per gigabyte cost and they'll whine 'but we're an _enterprise_, we cant buy consumer grade stuff or build it ourselves'.
Ten years ago people brought obsolete junk from work home to play with. These days I'm considering bringing obsolete stuff from home to work because the stuff I throw out is often better than low-prioritized things at work.
Re: (Score:2)
Of course, it's not going to happen in the average corporation, where most involved parties prefer covering their ass by buying conventional branded products.
It's not *just* ass-covering, although there's definitely some of that. Average corporations also do not *remotely* employ enough IT staff do be doing the sort of constant maintenance and replacements as Google is doing, not to mention the engineers doing testing and design of the specialized architecture, etc. And IT is often one of the first groups up against the wall when it's time to shore up numbers for the fiscal year.
I've worked with managers who believed very much in the commodity hardware philo
Re:Failure tolerance vs. failure prevention (Score:5, Insightful)
Hardware will fail - it's up to the intelligence of the overlaid systems to mitigate that.
Re: (Score:3, Insightful)
That's not to say it's impossible, IBM, HP, any of the "big iron" companies can offer you damn near 100% uptime without major changes to your software.
But be prepared to pull out the checkbook. You know, the REALLY BIG one that is only suitable for writing lots of zeroes and grand prize giveaways.
Re: (Score:2)
One of the things big iron provides is a clear update cycle without sacrificing those 9s, as well. You don't have to worry about whether or not the latest batch of Dell machines is going to have bad capacitors that will incur 10% more expenses. No, you pay for all of the potential costs up front, at once, for high reliability.
For a lot of big businesses this makes a lot of sense to them. It's reliable, it doesn't depend on network
Re: (Score:2)
Yes there are flaws, but we're discussing what pros exist. There are definite cons here, and I would be very inclined to agree with you that de
Re:Failure tolerance vs. failure prevention (Score:5, Funny)
Re:Failure tolerance vs. failure prevention (Score:5, Funny)
Don't worry, your secret is safe with us.
Real Slashdotters not only fail to read TFAs, but they also completely miss any and all relevant information in other people's posts.
Therefore, someone may hook on your claim that Google is not skimping on hardware and try to argue that they, in fact, do. Your admission to having read TFA will go completely unnoticed.
And before you ask yourself how come I noticed it: I didn't.
And besides, I'm new here.
Re: (Score:2)
Re:Failure tolerance vs. failure prevention (Score:5, Insightful)
With server farms the size of Google's, failures are going to occur daily regardless of how "fault-tolerant" your hardware is. Nothing is 100% failure free. Given that failures will occur, you need fault tolerance in your software, and if your software is fault tolerant, then why waste money on overpriced "fault-tolerant" hardware? If you can buy N cheapo servers for the price of 1 hardened one, then you'll typically have N times the CPU power available, and the software makes them both look as reliable.
Re: (Score:2)
When you're running on anything ginormous-scale, you don't really care about local raid all that much, especially if the data is massively replicated in the datacenter.
In fact, you may not even care if an individual machine breaks down - you just unplug it because it ha
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re:Failure tolerance vs. failure prevention (Score:5, Interesting)
Hard drive failures (Score:2, Interesting)
Re: (Score:2)
I would imagine that Google wouldnt adopt SSDs until they were financially viable, which probably wont be too long, they will be about the same price per GB as HDDs, and eventually cheaper, making for greater profit for aslong as HDD's are being sold (200 GB HDD costs $50, 6 months later 200GB HDD costs $10, etc)
Then, if SSDs are more reliable and the same price, thats also less expense.
Re: (Score:2)
Re: (Score:2)
Actually, I'd say at twice the price, SSDs would be less expensive over their lifetime. (I'm not sure where the break-even point is, but Seagate warrants for 5 years, and most flash media has a 50-year average write cycle, so 10x probably isn't far off? I'll stick with 2x for my argument though.)
Re: (Score:1)
Re: (Score:2)
Indeed and if you have 200 thousand servers running, they must be employing at least a couple of dozen people to run around hot datacenters all day and replace hard drives. Neither the hard drives nor the people will be cheap.
By all accounts, they don't bother with individual machine repairs. A dead rack might get repaired or replaced, but an individual node will simply be marked as dead and left there. The rack itself will get maintenance as and when it no longer has enough functioning servers to merit keeping it going.
Re: (Score:2)
Less than you might think from the summary, reading further down the article you find "The company has a small number of server configurations, some with a lot of hard drives and some with few".
Overheating and rewiring? (Score:4, Interesting)
Re:Overheating and rewiring? (Score:4, Funny)
Each machine has smoke detector installed right on top of it. The Maintenance director is standing at the gate of data center with pistol in his both hands. As soon as alarm is heard, a batch of maintenance engineers rush towards the faulty machine with keyboard, harddisc, mouse, motherboard and other components. The faulty components of machine are replaced on the rhythm of drumbeats they have been rehearsed through 1000's of times. The crew has to rewire the machine, reboot, and be back at the gate with burnt machine in less than 5 minutes or they are shot dead.
The trouble is, because of this time limit, the maintenance engineers simply pull machine out of rack without disconnecting any wires. And that's why rewiring is needed.
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Interesting)
This also allows for future throughput improvements from a single unit, and probably would cost less than the two days' downtime every overheat (racks are relatively cheap, time isn't).
Re: (Score:2)
From what I read, Google uses simple desktop computers.
These machines have been designed to sit idle 99.9% of the time and they have been designed with that in mind. If you ramp up the load on such a machine, things start to get real noisy real quick. If you keep them at such a high load for a long time, they simply break. (IBM Netvista comes to mind...)
Trouble is, buying machines designed with such a load in mind costs twice as much and the f
Re: (Score:2, Informative)
Re: (Score:2)
Re: (Score:3, Informative)
I had this problem at the University where I worked a while ago. We rolled in a nice new SGI Altix machine. We had enough power, but the cooling system couldn't move enough cubic feet of air into the one part of the room where the box was. As soon as you reach capacity, temps skyrocket.
Re: (Score:2)
Right. But when was the last time you were unable to pull up Google's search page? At the end of the day, that's all that matters.
BTW, I'd bet good money that a "broadcast engineering truck" costs 25X what google pays per CPU cycle.
Re: (Score:2)
Re: (Score:2)
It's the same everywhere, regardless of scale (Score:4, Interesting)
And they failed. And then they failed again. And again. Sometimes completely, but usually just a single port, or just "a bit" - it looked as if the switch was working, but every - or every n-th, or every bigger than x - packet got mangled, misdirected or whatever. Or sometimes packets appeared just out of the blue (probably some partial leftovers from the cache) and a few of them made enough sense to be received and reported. Sometimes a switch with no network cables attached to it started blinking its lights - sometimes on two ports, sometimes just on a single one.
Well, I could go on for hours, but you get the idea. What happens at Google happens everywhere, they just have some nice numbers.
Regardless, the article is quite entertaining to read for a networking geek
Re: (Score:3, Informative)
Re:It's the same everywhere, regardless of scale (Score:4, Funny)
Sounds like you have dust in your cables. I would recommend you clean the inside of your cables with compressed air so the bits don't get stuck on the lint and other stuff in there. The bits travel very fast, so even small dust particles can be a problem.
Re: (Score:2)
Re:It's the same everywhere, regardless of scale (Score:4, Informative)
And that's why. If you're using "smart hubs" or "dumb switches" (aka, your $99 Linksys switch), then you're probably not going to have issues. All it does is store MAC tables and forwards data to the appropriate ports. You probably also don't have multiple other network switches/hubs/routers hanging off of those devices somewhere downstream, and if you do then it's very likely that you know what and where they are and can plan for them.
On the other hand, trying to manage an enterprise-class switch with advanced features can be a little more complicated, especially when you start allowing anybody to plug any other kind of network devices into the switch. You can easily end up with spanning tree loops, issues with frame sizing, cross-brand autonegotiation failures, and who knows what else. And that's before you even have to start worrying about bugs in various firmware revisions or some enterprising "hax0r d00dz" who passed Comp Sci 101 trying to do things that he shouldn't be doing, and spoofing addresses to try to cover his tracks.
Re: (Score:2)
Aka, a "Layer 3" switch [wikipedia.org].
Re:It's the same everywhere, regardless of scale (Score:4, Informative)
1. You've been fantastically lucky.
2. You've not been in IT terribly long.
3. Your job doesn't involve network management and so your experience of what switches can do when they have a mind to is limited.
Solid-state simple dumb switches can and do fail, as can managed ones. If you're lucky, they fail in a fairly obvious fashion (eg. they just stop pushing packets on some or all ports).
If you're unlucky, they start spewing corrupt frames everywhere confusing the hell out of everything else on the network and you have to figure out exactly which switch is doing this and get rid of it.
Re: (Score:2)
Guess what? Every week some moron manages to make a loop, connect a switch to itself, connect two switches with a telephone cable or do any other unspeakably ass-brained thing that makes CSI investigations look like a piece of cake compared to finding out what's wrong with this network.
And don't even g
Software architecture, Not hardware (Score:3, Interesting)
Hardware is cheap (Score:4, Interesting)
quad-core xeon @2.66ghz
4gb RAM
2 x 500gig barracudas (RAID1)
dual gigabit ether
CentOS 5.1
US$1100 per unit
They are all stashed behind a Foundry ServerIron to load balance the cluster. So far, it seems to scale VERY well and increasing capacity is as simple as tossing another US$1k server on the pile.
Cheers,
Re: (Score:2)
Re: (Score:2)
How do they KNOW what to fix (Score:2)
Re: (Score:3, Informative)
Re: (Score:2)
Most server systems worth their salt have fault indicators that turn on when there is a hardware failure or perhaps even a watchdog timeout. Probably they do periodic walk-throughs to look for fault lights.
The second is proper asset management. A machine identified as broken has a record in an asset database that describes the location of the machine, the location being something like (data center, row, rack, RU up from the
Re: (Score:2)
Re: (Score:2)
I've never actually seen our main data center.
Just goes to prove (Score:2)
Re: (Score:2)
After reading about google's infrastructure in the past it's not just a matter of tools but OSS Operating System collections, probably an internal google distribution, they might use Apache or even a kernel based httpd, and their software authoring tools are eclipse or ant as they seem to be a big fan of Java.
This is a comment about how Google uses Ope
We are ants. (Score:2)
Then.. we realize that our own lifespans and lives are as prone to failure as the servers in their datacenters. Our lifespans are short and everyone has problems.. So Google has mastered the ability to make us interchangeable.
WE ARE ANTS!
Re: (Score:2)
It's not like typicall datacenter where cluster X is for ESX Server, Y is for the financial system, z is Win 2k3, and Q is AIX. Every unit in a Google rack is just another piece of typical hardware running the same OS, the same software, and configured the same way. I suspect there may be some sort of 'controller node' for some number of worker machines, but even then, each controller node is just like another controller node.
Each machin
Re: (Score:2)
Jeff Dean is the smartest guy I've ever met (Score:3, Interesting)
Re: (Score:2, Funny)
a case for mainframes (Score:2)
Re: (Score:2)
Re: (Score:1)
I'd like to see the traffic patterns for their data centers. Our University has a daily and weekly pattern, no surprise there, but I wonder how much their traffic changes through the night.
There is no 'night' and 'day' for a worldwide internet-based organization such as google. When you have night, someone else has day. Both of you use google.
and the distribution of google uses is uniform across all the meridians of this world all the time---if this is not globalization then I don't what is: no night, no day, no east, no west, no nothin, just the steady state gooogling all the time everywhere...
Re:Traffic Patterns for Google (Score:5, Insightful)
And even if you think of Google as a whole, it is significantly more popular in Europe and the US than it is in Asia, so you would still have uneven traffic rates.
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Funny)
Re: (Score:2)
As soon as you bump up the AC, you then need to re-design the cooling structures, re-design the airflow construct, the power feeds will need up, then added cooling for the a/c and power ducting itself....
Better to just lose a cluster every 6 months, and know that it'll happen that often and plan for it, than to try
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Good cooling designs also use evaporation to do a bulk of the work. On a large 20 story building at the University they had 4 cooling towers that mostly just pushed air through waterfalls that went over the heat exchanger coils.
Re: (Score:2)
Re: (Score:2)
Concepts from functional programming have really helped out Google, but at the same time they introduce limitations (at least when considering the MapReduce/GFS framework).
I think the larger problem with parallel programming for multiple processors/cores really comes with finding a conceptual model for expressing the computation. Functional programming (in
Re: (Score:2)
The bit about Hadoop being supported mostly by Yahoo is news to me. I hadn't bothered to look into their funding.
Hearing this has me wondering what sort of organization and software structure Yahoo uses internally. They probably manage just as much, if not more, information as Google. However, I have a feeling that their software is more of a hodge-podge than built on top of a few parallel frameworks like with Google.
It does seem that video.google.com has become an