Behind the Scenes At Google

Behind the Scenes At Google 196

Posted by CmdrTaco on Sunday April 03, 2005 @10:48AM from the they-should-document-the-cafeteria dept.

An anonymous reader writes "University of Wahington TV Presents "behind the Scenes With Google." From the site: 'Search is one of the most important applications used on the internet and poses some of the most interesting challenges in computer science. Providing high-quality search requires understanding across a wide range of computer science disciplines. In this program, Jeff Dean of Google describes some of these challenges, discusses applications Google has developed, and highlights systems they've built, including GFS, a large-scale distributed file system, and MapReduce, a library for automatic parallelization and distribution of large-scale computation. He also shares some interesting observations derived from Google's web data.' "

Behind the Scenes At Google

This discussion has been archived. No new comments can be posted.

Search 196 Comments Log In/Create an Account

Comments Filter:

UW mirror (Score:4, Informative)

by JoshuaDFranklin ( 147726 ) * writes: <joshuadfranklin.NOSPAM@ya h o o .com> on Sunday April 03, 2005 @10:54AM (#12126287) Homepage

Also hosted by CS at:
http://norfolk.cs.washington.edu/htbin-post/unrest ricted/colloq/details.cgi?id=274 [washington.edu]
Jeff Dean
Abstract Search is one of the most important applications used on the internet, but it also poses some of the most interesting challenges in computer science. Providing high-quality search requires understanding across a wide range of computer science disciplines, from lower-level systems issues like computer architecture and distributed systems to applied areas like information retrieval, machine learning, data mining, and user interface design. I'll describe some of the challenges in these areas, discuss some of the applications that Google has developed over the past few years. I'll also highlight some of the systems that we've built at Google, including GFS, a large-scale distributed file system, and MapReduce, a library for automatic parallelization and distribution of large-scale computation. Along the way, I'll share some interesting observations derived from Google's web data. Jeff Dean joined Google in 1999 and is currently a Distinguished Engineer in Google's Systems Lab. While at Google he has worked on Google's crawling, indexing, query serving, and advertising systems, implemented several search quality improvements, and built various pieces of Google's distributed computing infrastructure. Prior to joining Google, he was at DEC/Compaq's Western Research Laboratory. He received a Ph.D. from the University of Washington in 1996 working with Craig Chambers on compiler optimization techniques for object-oriented languages.

Re:I use Google at work (Score:5, Informative)

by Anonymous Coward writes: on Sunday April 03, 2005 @11:09AM (#12126341)

Now I have some pretty important lists which I need to keep tight control over. The information really ought not be distributed outside my office. However, because of the nature of my business, I must do frequent searches using various search engines to fill in my lists.

If you want to keep something private, don't put it on the publicly accessible internet. Including searches. Duh.

How am I assured that my searches remain anonymous and secure with Google?

You aren't. Did you sign a contract to that effect? No.

And frankly, if you can find things with google, it isn't too secret.

Re:mediocre or no Linux support! (Score:3, Informative)

by Servo ( 9177 ) writes: <dstringf@noSPam.tutanota.com> on Sunday April 03, 2005 @11:15AM (#12126370) Journal

Like any tech company, they went with the biggest platform first. Gmail works on non-Windows browsers now. It just took them a while.

here is a transcript of the first 12 minutes (Score:3, Informative)

by Anonymous Coward writes: on Sunday April 03, 2005 @11:32AM (#12126464)

Here are the first 12 minutes typed out. i'm sorry i can't do the rest, but open the video and skip forward to 12:00 and go from there. i hope that these 12 minutes of my life typing this will save at least 2 other people 12 minutes of theirs.

(speech from this point...)
lots of people use google but i want to give you a flavour for what happens and what we are working on for our new systems and products. i'll focus on what are the interesting problems that crop up when you organize large amounts of information, like we do, and what you can do with lots of data and computational resources. i'll also talk about our engeneering organization.

google ha a mission statement that i like - to organize the worlds information and make it universally accessible and useful. we've moved from web searching to mail and news and searching books by scanning/ocr'ing them. this mission statment covers everything and means we won't run out of work!

a lot of our issues are to do with scale. we have 4B webpages with average 10kb/page, and lots and lots of searches per sections. it's a big problem but you solve it with lots of computers and disks and network them well.

dealing with scale comes about in a number of areas. hardware/network; what do you use. distributed systems; dealing with unreliable things. algorithims/structures; processing efficiently and in interesting ways. machine learning/info retrevial; improving quality of results by analyzing lots of data. user interfaces; we haven't done much on this yet but it would be interesting to provide new and interesting ways to naviage and refine the query by doing better things than just typing in new query words - i'd expect to see more developments in this area.

one thing we've made a decision about is that we tend to build on low cost commodity PCs. example setup: ibm eserver xseries 440, 8 2-ghz xexon, 64GB ram 8TB disk = 758,000. we use this: 88 machines that total, 172 2-ghz xeons, 176 GB ram, ~7TB = 278,000. this is 1/3x price, more cpu.

google was founded in 97 by two people at stanford working on interesting ways to use the search, but needed new hardware to do this. they'd go to the loading dock and offer to setup machine for other reasearch projects - but keep them for a while themselves to get work done. over time google was formed in 1999, and we've learned a lot since then - such as how to scale better and have good datacenter practices.

hosting centers were charging for the square foot, which is strange since their costs come from things like cooling and electricity so we got good at putting a lot of servers in one place. we know are very good at setting up large clusters quickly, such as our gigantic 2001 datacenter move configured in 3 days.

if you have that many machines you have to worry about failure. one machine might fail every thousand days, but thousands of machines mean at least a failure a day. you have to deal with this in software with replication and redundancy. one nice property of dealing with this problem is that having six copies for capacity reasons also means we now have six copies available for distributed application and load balancing. a lot of the applications we deal with are read-only, which helps handling so many querys easy.

Re:GFS (Score:2, Informative)

by warkda rrior ( 23694 ) writes: on Sunday April 03, 2005 @11:41AM (#12126501) Homepage

RedHat has something called GFS -- the Global File System [redhat.com].

Re:GFS (Score:5, Informative)

by AKAImBatman ( 238306 ) * writes: <akaimbatman@gmaYEATSil.com minus poet> on Sunday April 03, 2005 @11:42AM (#12126508) Homepage Journal

Ok, I looked it up. You're confusing Sistina's (now Red Hat) Global File System with the Google File System. The two ARE NOT THE SAME.

Here's Red Hat:

http://www.redhat.com/software/rha/gfs/ [redhat.com]

Here's Google:

http://www.cs.rochester.edu/sosp2003/papers/p125-g hemawat.pdf [rochester.edu] (PDF)
http://64.233.161.104/search?q=cache:m0TMQYgIlIoJ: www.cs.rochester.edu/sosp2003/papers/p125-ghemawat .pdf+Google+File+System&hl=en&client=safari [64.233.161.104] (HTML)

Re:Fsking video format. (Score:4, Informative)

by LuckyStarr ( 12445 ) writes: on Sunday April 03, 2005 @12:05PM (#12126640)

$ man mplayer /dumpstream

Download the .asx File, look inside. This is your URL. Have fun.

Equal Time (Score:5, Informative)

by DanielMarkham ( 765899 ) writes: on Sunday April 03, 2005 @12:30PM (#12126785) Homepage

Hey -- I love Google. Use it every day, and I think they're doing some really neat stuff. But this was an hour-long commercial for Google - -to me it looked designed to recruit from college campuses. While I think it's great that Google does this (it sure sounds like a great way to get cheap qualified labor) is it really new or interesting? Or even geeky? So we have redundant clustering, LISP-like patterns, and issues of dealing with BIG stuff. Hasn't the industry already done all of this, like dozens of times? You can't tell me VISA international doesn't handle this size data, or that General Motors doesn't have some of the same scaling issues. I read somewhere that Wal-Mart has one of the biggest computer systems in the world. To me the signal-to-noise ratio was out of whack to make it worth an hour of my time. Just my opinion folks.

Re:Few women in CS. (Score:1, Informative)

by Anonymous Coward writes: on Sunday April 03, 2005 @01:33PM (#12127126)

screenshot, since it's slashdotted [photobucket.com]

Re:50% female is the goal (Score:3, Informative)

by Flamesplash ( 469287 ) writes: on Sunday April 03, 2005 @01:56PM (#12127250) Homepage Journal

they are. They, nor I, stated otherwise. This is exactly why the engr said it would be impossible. To be able to sway 1500 competent female engr is not exactly doable, especially since google is growing a lot now too. They have high standards for their hiring in general, they often make a number of false negatives in hiring because they don't want to waste resources on a potential false positive.

MiMMS (Score:3, Informative)

by Kristoffer Lunden ( 800757 ) writes: on Sunday April 03, 2005 @01:57PM (#12127256) Homepage

Found directly in Ubuntus repositories, you probably have it in many others too:
MiMMS, formerly called "mmsclient", is a simple client to download

streaming audio and/or video media from the internet uscodeing the MMS
protocol (i.e. from mms:// type URLs, generally found in asx files).
Downloaded streams can then be replayed offline at your leisure,
using any compatible media player of your choice.

mimms mms://media-wm.cac.washington.edu/ifs/uw_cse05_goo gle_1300k.asf

Of course, a torrent would be even better - for their bandwidths sake. :)

Re:Google innovates? It's news to me. (Score:1, Informative)

by Anonymous Coward writes: on Sunday April 03, 2005 @07:22PM (#12129208)

Interesting, I did a comparison. Now, I happen to own a Via Rhine based NIC so I did a vague search on the string "Linux via-rhine" (without the quotes) on Google, Teoma and Vivisimo. The string doesn't imply anything except that I want my results to show something about Linux related to via-rhine (or vice versa).

In the Teoma Results [teoma.com] the first hit [titelstory.com] of 51,600 was a forum post where someone asked "Trying to install LMD 10 using a Via motherboard with onboard Rhine NIC configuration asks for additional parameters - anyone know these please?" It was about someone having problems with an onboard Rhine chip on Mandrake, very short, not much detail and not particularly interesting (nor would it have been helpful even if I was having problems with it).

In the Vivisimo Results [vivisimo.com] the first hit [iu.edu] of 51,600 was to a mailing-list post where the topic was "VIA Rhine problem in 2.4". Someone was having an obscure problem with a D-Link dfe-530tx, probably not what I'm looking for. Ironically there was a link in that post to what turned out to be the first hit on Google, and the mailing list post was actually an answer from a company employee at Scyld Software, which brings us to Google..

In the Google Results [google.com] the first hit [scyld.com] of 99,400 was to "Linux Drivers for PCI Ethernet Chips". The link was to a page at Scyld Software, a Linux company. It had information about several Linux kernel drivers (including Via Rhine/II) along with usage instructions, module settings, support options and diagnostic programs - not to mention a direct link to the driver source code. What I could learn from this hit was a lot, including the fact that I can use the via-rhine driver to both Rhine as well as Rhine II chips.

What I found most interesting about all this was not the results (they speak for them self) but rather the number of hits. Theoma and Vivisimo had the exact same number of hits which leads me to believe that they share the same indexes but filter the results differently (Indeed, the second hit on Vivisimo was the same as the first one on Teoma). I admit, Vivisimo has a really cool interface, especially the "Clustered Results"-thing, but the quality on the hits arent nearly as good as those of Google so none of them are Google replacements, yet. Well, that's my conclusion based on this shallow test anyway.

Oh, and thanks for the links btw - they're going into my collection.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Behind the Scenes At Google 196

Behind the Scenes At Google More Login

Behind the Scenes At Google

UW mirror (Score:4, Informative)

Re:I use Google at work (Score:5, Informative)

Re:mediocre or no Linux support! (Score:3, Informative)

here is a transcript of the first 12 minutes (Score:3, Informative)

Re:GFS (Score:2, Informative)

Re:GFS (Score:5, Informative)

Re:Fsking video format. (Score:4, Informative)

Equal Time (Score:5, Informative)

Re:Few women in CS. (Score:1, Informative)

Re:50% female is the goal (Score:3, Informative)

MiMMS (Score:3, Informative)

Re:Google innovates? It's news to me. (Score:1, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot