Google's Technology Explored 294
RobotWisdom writes "Internetnews offers a moderately detailed peek at Google's technology. For example, they use stripped-down Red Hat on a massively redundant network, and they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page." Additional analysis on InformationWeek and C|Net. From the article: "As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box."
PigeonRank(TM) (Score:5, Funny)
http://www.google.com/technology/pigeonrank.html [google.com]
Re:PigeonRank(TM) (Score:5, Funny)
That was pre-IPO.
We'd like you to meet Bubba [dailymail.co.uk]. Bubba's fully vested, and as this article [dailymail.co.uk] says, he's, uh... he's grown somewhat.
Not Bubba! (Score:2)
Re:PigeonRank(TM) (Score:5, Funny)
Google Lunar (Score:4, Funny)
http://www.google.com/jobs/lunar_job.html [google.com]
a snippet:
/. effect (Score:4, Funny)
Re:/. effect (Score:5, Interesting)
One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.
Re:/. effect (Score:2, Informative)
Re:/. effect (Score:3, Funny)
Is Dick Cheney in the IT business now?
considering.... (Score:3, Insightful)
Re:/. effect (Score:2, Insightful)
while true; do wget www.google.com; done
seems better to me.
Re:/. effect (Score:5, Funny)
open browser at www.google.com
get a drinking duck thing that bobs up and down hitting F5 every second
seems better to me.
Re:/. effect (Score:2)
Re:/. effect (Score:2)
Truly Amazing. (Score:5, Interesting)
Also Amazing: How much we miss (Score:5, Interesting)
Just think about how vast and extensive Google's search is, and then think about how little of the World's knowledge and creative achievement it actually can access.
The quantity and breadth of human knowledge is breathtaking, no?
Re:Also Amazing: How much we miss (Score:5, Insightful)
Only a few years ago it could take forever to find any kind of decent information on some topics online or even in libraries. Today, I go to wiki and I'm almost assured to have a FAIRLY reliable source for information, as it's cross checked by peers who have some kind of a personal interest in the subject.
However, there's a downside.
Back when I was in school, researching a subject typically meant going through encyclopedia after encyclopedia, which wasn't a bad thing. I learned quite a bit by being FORCED to over-research topics. Today, I can generally straight-shoot to whatever I need to find, giving my brain a good set of blinders to everything else along the way.
Re:Also Amazing: How much we miss (Score:3, Interesting)
Technology has the ability to improve everyone's collective IQ, but also has the ability to dumb down the populace. Kind of like TV. I remember tutoring an elementary student when I was a high school student back in '95 or so, and he couldn't do simple math (addition, subtraction, etc.) without his calculator. Sad...
Re:Also Amazing: How much we miss (Score:3, Insightful)
That's the problem. It isn't reliable. For example, one local journalist got burned badly by using that piece of crap to do research during the election.
Correction: It's "often" reliable.
You want a better source?
Sorry, you won't find one. Not a single one at least.
What you're speaking of is not a problem with Wikipedia, that's a problem with a journalist who doesn't know how to properly research a subject. If a journalist relies on any single source to be per
Re:Also Amazing: How much we miss (Score:3, Funny)
What more could you need?
Re:Also Amazing: How much we miss (Score:2, Insightful)
Well, I think you haven't studied enough if you think this. When you start to realize we actually know very little, then you're getting somewhere.
Re:Also Amazing: How much we miss (Score:3, Interesting)
"We have an embarrassment of riches in that we're able to store more than we can access. Capacities continue to double each year, while access times are improving at 10 percent per year. So, we have a vastly larger storage pool, with a relatively narrow pipeline into it." -- Jim Gray, Microsoft Research.
Re:Also Amazing: How much we miss (Score:3, Interesting)
I think it might be pretty amazing to find out what we can't easily access, even that which is published on the net. A simple example: you can't differentiate "net" from ".net" on google, and net is an extremely common word so it is next to useless as a qualifier if your searching for info on the ".net" equivalent to anything common. Or try searching for the smiley face: ":-)". While those may be trivial and uninteresting specific examples, they illustrate at least one area where "you can't find it throu
Laziness, ignorance or (Score:2, Insightful)
Re:Laziness, ignorance or (Score:5, Insightful)
Re:Laziness, ignorance or (Score:3, Insightful)
Re:Truly Amazing. (Score:2)
I was in a resteraunt and my girlfriend ordered a small orange juice while I ordered a large one.
She drank hers rather quickly and for some reason, I thought the glasses looked odd even though mine was larger. So, I poured my large orange juice in her empty glass, and (go figure) it fit.
Even after demonstrating this
That's an English pint, you yob. (Score:2)
interesting (Score:3, Funny)
So that's why I can search on the result page for my orginally query and find nothing. And all this time I was blaming Internet Explorer!
Re:interesting (Score:3, Interesting)
Comment removed (Score:5, Funny)
Re:Whats really impressive (Score:5, Funny)
Reminds me of a radio interview I once heard with the Google founders. The host was curious about what the "I'm feeling lucky!" button was about. She claimed she typed in "Google" into the search box and clicked "I'm feeling lucky!", and nothing happened, so it didn't work!
Re: (Score:3, Informative)
did for me (Score:3, Informative)
An "I'm Feeling Lucky" search means less time searching for web pages and more time looking at them.
from the "I'm Feeling LuckyTM" button [google.com]. Guess they changed it.
Re:Whats really impressive (Score:2)
Re:Whats really impressive (Score:2)
Re:Whats really impressive (Score:3, Interesting)
A set which contains all sets which do not contain themselves may be a conundrum, but a catalog that lists all catalogs that do not list themselves is merely impossible (trivially impossible, in fact). There are plenty of things that can be
Picked up a Microsoftie (Score:2, Informative)
http://news.yahoo.com/news?tmpl=story&u=/zd/200 5 03 03/tc_zd/146950
blog:
http://mark-lucovsky.blogspot.com/2005/02/shippi ng -software.html
Meltdown? (Score:4, Interesting)
Gee.. I wish our
Re:Meltdown? (Score:4, Funny)
Gee.. I wish our
It is my belief that data center fires are caused by slashdot every day!
More useless search results? (Score:4, Insightful)
Even pages that come up in my search results now that contain my query don't even have anything to do with what I am looking for. Isn't this just adding to the problem?
How about a Did you mean? option that doesn't compare against spelling, but related topics instead?
Re:More useless search results? (Score:5, Informative)
the word "tree" may either refer to a data structure (binary, B-,red-black etc.) or to the stuff forests are made of. If my query is "search tree", the words search and tree may show up on a page about people searching for some kind of a tree and on pages about search trees. Assuming they're both popular classes of pages, you're going to end up with some mishmash of results from both classes.
Instead, the clustering algorithm might notice (based on other words that appear on the pages, for example) that pages with 'search' and 'tree' in them fall into two classes. That doesn't help if "search tree" is all it has to go by. But now if I add the words "data structure" to the query, it knows which class of pages I'm interested in, because many pages about binary trees contain the words "data structure" whereas almost none about the quest for trees do. Now it can return pages from the right cluester that it knows are relevant, even if they don't contain the word "data structure" in them.
no AND needed (Score:5, Interesting)
they're starting to have success with automatic clustering of concepts, so that pages can match even if none of the words in your query actually appear on the page.
From the help guide [google.co.uk]:
By default, Google only returns pages that include all of your search terms.
Which of these is correct? If it's the summary, is there any way to turn this behaviour off? I find it immensely annoying.
Re:no AND needed (Score:5, Informative)
I think what they mean is that they are working on search algorithms that will implement this. Not that they have already made it publicly available. They want it to work first, and be released second. The problem the you have cropping up most likely occurs with pages that put info in the metadata, and hence don't show up in the page itself.
Re:no AND needed (Score:3, Insightful)
For example, try searching for "miserable failure" on Google. The first result is George Bush's biography on www.whitehouse.gov.
However, the term "miserable failure" doesn't actually show up (yet) in the biography. But, pages that POINT to the biography do include those terms.
As a result, pages can match your search quer
Re:no AND needed (Score:2)
"allintext:" before everything works. Thanks to a helpful AC.
Oops (Score:5, Funny)
Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cooking" is a good match even though it contains none of the query words.
One word: cooking.
I'm sure the principle is sound. I just think the example is a leetle bit flawed.
Re:Oops (Score:2)
1) The connection between "courses" and "class".
2) The connection between "Bay Area" and "Berkeley".
Re:Oops (Score:4, Interesting)
The company also is applying machine learning to its system to give better results. Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cuisine" is a good match even though it contains none of the query words.
FYI.
Re:Oops (Score:3, Interesting)
Too celver for their own good? (Score:3, Insightful)
I hate that. Don't you hate that? When you type in a search keyword, isn't it because you want that keyword to appear in the documents you find?
This "find tangentially related documents" feature will be fine so long as they make it optional and set it to be off by default. Otherwise, I don't want their idea of what pages I should be looking at polluting my results list.
I call "innovation for the sake of innovation".
"Celver"? Did I say "celver"? (Score:2)
Re:"Celver"? Did I say "celver"? (Score:2)
Re:Too celver for their own good? (Score:3, Insightful)
That's how it works...
Re:Too celver for their own good? (Score:2)
I do agree it should be an option though. Google (in my opinion) has been pretty good about not being obtrusive, so I su
Yeah, I noticed that (Score:4, Informative)
I was mightily impressed, and not just because it means more people read my stuff. Or at least surf to it.
Re:Yeah, I noticed that (Score:3, Informative)
Re:Yeah, I noticed that (Score:2, Interesting)
I can't reproduce this with another term. I wonder whether this was a manual fix by Google programmers.
Re:Yeah, I noticed that (Score:2)
Video about some of the backend stuff (Score:5, Interesting)
Google: A Behind-the-Scenes Look [uwtv.org].
Re:Video about some of the backend stuff (Score:2)
Re:Video about some of the backend stuff (Score:2)
Re:Video about some of the backend stuff (Score:2)
Want the dean.pdf without a USENIX account? (Score:2, Informative)
Question... (Score:4, Interesting)
Do they share these patches with everyone else?
Re:Question... (Score:2, Insightful)
;i was wondering the same thing. do modifications of this sort fall under the GPL? if so, isn't google required to share them with the public [gnu.org], or are "patches" not considered "modifications" to the software?
;treehead
Re:Question... (Score:2)
Re:Question... (Score:5, Informative)
The GPL does not force them to do anything unless they wish to redistribute the software.
Re:Question... (Score:2)
Re:Question... (Score:2)
Re:Question... (Score:2, Funny)
Sure? (Score:5, Funny)
Re:Sure? (Score:2)
Or...maybe not
gCluster (Score:5, Informative)
sourceforge is begging for something like this..
Their engineer desktops have special google builds of linux which help them compile things insanely fast with g4, ie hacked p4 (Perforce).
They also have one of the best intranet sites I've seen. Lots of info and services the employees can use, apart from email.
The internal blogs really help with keeping track of projects you're not working on, and what others are doing. Their mailing lists are often usefull too, for example there's a lost and found, for sale, and biking partners list. All kinds of usefull little stuff, taking care of the people with little nice things. Lots of reading too.
-- Robi
kernel patches? (Score:5, Insightful)
and the obvious question:
where are the patches?
Anybody knows? This is not a GPL question just an ethical one.
Re:kernel patches? (Score:2)
Re:kernel patches? (Score:3, Interesting)
On their own servers, then they're obeying the rules.
The question is: Do they use these patches on the search appliances they sell, and does that count as "distribution"? I honestly don't know the answer to that question, and I'd like to think Google has sharp legal advisors to go with their sharp technical people.
Re:kernel patches? (Score:2)
You mean GPL. The other "rules" are pretty subjective. I for one, would like to see Google act in favor of the community assuming the kernel patches are not the core of their technology.
The question is: Do they use these patches on the search appliances they sell, and does that count as "distribution"? I honestly don't know the answer to that question, and I'd like to think Google has sharp legal advisors to go with their sharp technical people.
Re:kernel patches? (Score:2, Insightful)
Re:kernel patches? (Score:3, Insightful)
They'll tell you as soon as you point out where or how they are distributing them (yes, that's why it wasn't a GPL question).
Why should Google be "ethical"? Likely these modifications are part of their IP trove, which keeps them ahead of the (already heated up) competition.
If you don't like the way someone uses the software you're giving away then perhaps you shouldn't give it away, or maybe it's just that the license is flawed. It's dumb to expect people who run billion-dollar
Re:kernel patches? (Score:3, Insightful)
The only question is whether or not Google is selling these patches as part of their appliances.
"The text you entered was not found." (Score:5, Interesting)
The main flaw I've found in Google's results has been when it returns pages without one of my query words, which doesn't respond to the sense of my query. Sometimes it's changed page content at the same URL, so I go back and get the "cached" page, if it exists. The cached pages reveal in their headings whether the page matched only because the query word was found only in another page linking to the returned page. I'd like their immediate results to show that distinction, and to have links in the results to click around those pages related by my complete query. The current click/back/"cache" combinations are frustratingly disconnected, conflicting with Google's otherwise smooth immediacy.
Re:"The text you entered was not found." (Score:2)
When I ask for Google.com, I'd like Google.com, not Google.random_country! (but I know, it's not a bug, it's a feature)
Re:"The text you entered was not found." (Score:2)
Google Maps - Designed to protect data centres (Score:5, Funny)
Google's redundancy theory works on a meta level, as well, according to Hoelzle. One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.
"You don't have just one data center," he said, "you have multiples."
The real idea behind Google Maps is so that as the server catches fire it use it's last cycles to send an eMail to the nearest fire cheif and include a map. I think it would also throw in a GMail invite for incentive.
Question -- Is any of this considered P2P? (Score:2, Interesting)
How much of what Google is doing -- the clustering, the redundancy, the sub-categorization -- how much of this (if any) could be described -- could fit under the mantle of "Peer-to-Peer"? Is anything that Google is doing here remotely considered P2P? (Even if the P2P is what's going on on their own, in-house servers?)
Obviously, I ask this because of the upcoming supreme court case. And I ask because it struck me as I read the article t
Re:Question -- Is any of this considered P2P? (Score:2, Insightful)
no matches, yet matches (Score:2)
Let me guess... the pages that match just happen to point to advertisers?
Frugal Google (Score:3, Insightful)
They say being frugal is a virtue, which Google has, evidently. What is the lesson here? Holding down the cost and being innovative never fail. I guess.
MapReduce (Score:2, Informative)
Alot of this stuff is application of SAN/RAID/Failover technology, which is cool (and we've never seen it so pervasively implemented), but not horribly revolutionary. I think the slickest thing they've developed, but might not get the most attention is their MapReduce [google.com] framework. The abstract from their paper:
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a _map_ function that processes a key/value pair to generate a set of i
Google and it's 1980's search literal-mindedness (Score:3, Insightful)
So, WHY, I ask, is google only now getting around to using these techniques?
Re:Google and it's 1980's search literal-mindednes (Score:3, Insightful)
It's not a great example, but my mind seems to have gone temporarily blank of words that have many synonyms :(
"we can't crawl as fast as we would like" (Score:3, Interesting)
Re:"we can't crawl as fast as we would like" (Score:2)
Because this already can be specified in html metadata:
Crawling rate... (Score:2)
well, we could introduce a setting into robots.txt where we can tell google how often they can spider your site...
And Debian in Gmail servers? (Score:2)
Posted in my blog [bioinformatica.info].
Re:Impressive technology but the algorithms aren't (Score:3, Interesting)
Here's another great idea you inspired that they could also never do (being a commercial company themselves and all).
When I am searching I virtually always want to do one of two distinct things:
1) Sarch only commercial sites for a product to purchase.
2) Search everything but commercial sites for information.
There really should be a "$" flag that you could add (or at least a "!$" flag) to control wheather you see commercial or non-commercial sites in the results list
Re:Impressive technology but the algorithms aren't (Score:2)
hardware (Score:2, Interesting)
Parts fail left and right, and nobody bothers
to fix them. The software hides all this from
the users.
Google even checksums the data, on the assumption
that it is frequently getting corrupted by all the
junk hardware they buy.
Re:define: cheap machines (Score:5, Interesting)
Anyhow, the article mentioned that in these early datacentres they experienced something like a 25% hardware failure rate, but that it didn't matter because the software worked around it and the hardware was cheap.
Here's a link to the page [topix.net] where I read all this neat stuff. It's probably mostly about the same stuff as the article we've all just slashdotted, but I won't be albe to tell for a while....