Huge Traffic On Wikipedia's Non-Profit Budget 240
miller60 writes "'As a non-profit running one of the world's busiest web destinations, Wikipedia provides an unusual case study of a high-performance site. In an era when Google and Microsoft can spend $500 million on one of their global data center projects, Wikipedia's infrastructure runs on fewer than 300 servers housed in a single data center in Tampa, Fla.' Domas Mituzas of MySQL/Sun gave a presentation Monday at the Velocity conference that provided an inside look at the technology behind Wikipedia, which he calls an 'operations underdog.'"
Re:Impressive (Score:5, Informative)
Re:The power of low standards (Score:2, Informative)
Changes are never just lost, when an error does happen and the action cannot be completed then it is rejected and the user notified of this so they can try what they were doing again. You have vastly overstated the severity of such issues.
Re:Works great because it's not "Web 2.0" (Score:1, Informative)
You make things sound cheap and simple, but without the memcached and the squid clusters Wikipedia is using, the whole thing would require significantly more hardware than the foundation could afford.
Sure they do it without ads... (Score:4, Informative)
Sure they do without ad income. But they also do it without having to pay salaries, or co location fees, or bandwidth costs... (I know they pay some of those, but they also get a metric buttload of contributions in kind.)
When your costs are lower, and your standard of service (and content) malleable, it is easy to live on a smaller income.
Re:The power of low standards (Score:5, Informative)
A bank requires "six nines" of performance (i.e., right 99.9999% of the time) and probably wants even better than that.
Re:Note to self (Score:5, Informative)
They're not all in Tampa, they have a bunch in Netherlands and a few more in South Korea.
Re:Works great because it's not "Web 2.0" (Score:2, Informative)
If you haven't noticed, "Web 2.0" is a long estabilished buzzword [wikipedia.org] - which means it carries little meaning, but it looks good in advertising. Just like "information superhighway", "enterprise feature" or "user friendly".
Re:Out like a light (Score:2, Informative)
We've never lost external power while we've been at Tampa, but if we did, there are diesel generators. Not that it would be a big deal if we lost power for a day or two. There's no serious problem as long as there's no physical damage to the servers, which we're assured is essentially impossible even with a direct hurricane strike, since the building is well above sea-level and there are no external windows.
Servers and locations (Score:2, Informative)
According to http://meta.wikimedia.org/wiki/Wikimedia_servers [wikimedia.org] Wikimedia (and by extension, Wikipedia):
"About 300 machines in Florida, 26 in Amsterdam, 23 in Yahoo!'s Korean hosting facility."
also: http://meta.wikimedia.org/wiki/Wikimedia_partners_and_hosts [wikimedia.org]
Re:I've always wondered... (Score:5, Informative)
Re:Tampa? (Score:2, Informative)
Re:Impressive (Score:5, Informative)
No, actually - the Wikimedia servers serve all Wikimedia projects (all the Wikipedias, Wikimedia Commons, all the other projects), but Uncyclopedia is part of Wikia, which is a private company owned by Jimmy Wales to do wikis and isn't actually linked to the Wikimedia Foundation in any way.
Re:Impressive (Score:5, Informative)
Single database, though. All the databases for all the projects are in Tampa - one master for English Wikipedia and two for all the other 700+ Wikimedia projects.
(They tried running the databases for Asian languages from the Yahoo!-sponsored datacentre in Seoul for a while, but it didn't actually work much faster than it did with everything in Tampa.)
What about the Internet Archive (Score:5, Informative)
Wikipedia's pretty impressive, but how about the Internet Archive [archive.org]? Also a non-profit that doesn't run ads, and not only do they, like Google and Yahoo, "download the Internet" on a regular basis, but the Archive makes backups! Plus, they have huge amounts of streaming audio and video (pd or creative-commons). The first time I ever heard the word "Petabyte" being discussed in practical, real world terms (as in, "we're taking delivery next month") was in connection with the Internet Archive. Several years ago. And it was being used in the plural! :)
They may not have as much incoming traffic as Wikipedia, but the sheer volume of data they manage is truly staggering. (Heck, they have multiple copies of Wikipedia!) When I do download something from there, it's typically in the 80-150 MB range, and 1 or 2 GB in a pop isn't unusual, and I know I'm not the only one downloading, so their bandwidth bills must still be pretty impressive.
The fact that these two sites manage to survive and thrive the way they do never ceases to amaze me.
Re:Wikipedia = much more traffic than slashdot (Score:2, Informative)
That's pretty obvious because Wiki has, literally, millions of topics covering every possible field. Whereas /. is very limited in scope.
Re:Impressive (Score:5, Informative)
Re:Cached on servers all over the interweb? (Score:3, Informative)
It exists. Its called "validators". There are strong and weak validators. You can Vary on your validators, and thus have multiple copies of the same object but in different forms (so given a text document, you can have it in different languages, compressed/uncompressed, etc.)
Your browser will then quite happily ask the origin server (which may not be the "origin" origin) for an object and provide validators. (Last-Modified -> If-Modified-Since; ETag->If-None-Match) which the origin (or the cache which is pretending to be the origin) can check against its local copy and then return a "yes, use your local copy" or "no, don't bother."
Its all there, right now, in HTTP/1.1. I swear. People just don't have a clue how to use caching, they've been bitten by the difference between "expiry" and "revalidation", and they just turn off all hope of caching. Maybe they're scared; maybe their job is to sell bits; maybe they're just clueless about it and turning off caching fixed an obscure problem. In any case, its right there in HTTP/1.1 and you can use it any time you like.
Adrian
(I'm a Squid developer.)