Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

The Man Behind Google's Ranking Algorithm

Posted by CmdrTaco on Sun Jun 03, 2007 09:45 AM
from the dear-god-no-more-seo-spam-please dept.
nbauman writes "New York Times interview with Amit Singhal, who is in charge of Google's ranking algorithm. They use 200 "signals" and "classifiers," of which PageRank is only one. "Freshness" defines how many recently changed pages appear in a result. They assumed old pages were better, but when they first introduced Google Finance, the algorithm couldn't find it because it was too new. Some topics are "hot". "When there is a blackout in New York, the first articles appear in 15 minutes; we get queries in two seconds," said Singhal. Classifiers infer information about the type of search, whether it is a product to buy, a place, company or person. One classifier identifies people who aren't famous. Another identifies brand names. A final check encourages "diversity" in the results, for example, a manufacturer's page, a blog review, and a comparison shopping site."
+ -
story
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by Anonymous Coward on Sunday June 03 2007, @09:53AM (#19371217)
    Pigeon Rank?
  • Amit Singhal ... (Score:5, Informative)

    by WrongSizeGlass (838941) on Sunday June 03 2007, @09:55AM (#19371235) Homepage
    ... is not to be confused with Amit Singh [kernelthread.com], who also works at Google and has authored an excellent book on Mac OS X Mac OS X Internals [osxbook.com].
  • by dwater (72834) on Sunday June 03 2007, @09:58AM (#19371261)
    > They use 200 "signals" and "classifiers," of which PageRank is only one.

    How many did they expect PageRank to be? In the words of someone immortal, "There can be only one.".
    • Now, what if they cut out pagerank completely, would their search results still be just as good?
      • Re: (Score:3, Interesting)

        From the results I've been getting lately, they seem to dropping page rank in preference to how many times the words 'google adwords' appears om the page, or more precisely the code for generating them. Totally worthless pages but obviously not worthless for google's bottom line. This story obviously reflects one thing and one thing only, the growing perception in the public's eye of the deteriorating quality of google's results, hence yet another marketing fluff piece, to try to convince them, it just ain'
  • Feature Request (Score:5, Insightful)

    by rueger (210566) * on Sunday June 03 2007, @10:13AM (#19371383) Homepage
    My ongoing gripe with Google is the number of times when the first page is filled with shopping sites, "review" pages, and click through pages that exist only to grab you onto the way to where you really want to go.

    I would love a switch, or even a subscription, that would allow me to filter these usually useless types of pages and instead show me pages with real content.

    • Re: (Score:3, Funny)

      Haven't had much trouble with the click through sites but when looking for some information on anything that can potentially be sold (or even, as I recently experienced, has been sold in the not too distant past but hasn't been in the last five years), the shopping sites are a real problem

      This item you're searching for hasn't been in inventory for 6 years since nobody makes it anymore, would you like to read a review ? : be the first to write one !

      Yay.
    • Try a more specific query, or try a query that excludes "review", "sale", "price", or whatever you like.

      I find that most queries give me what I want right away (eg paris hilton), and those that don't (eg lindsay lohan) do give me what I want after narrowing down the sites returned (eg lindsay lohan drunk car -herbie -vomit -intitle:"fan site").
    • I'm fairly confident that the feature you want is one Google is trying very hard to provide. I doubt adding a switch somewhere is the problem.
    • Re:Feature Request (Score:5, Informative)

      by SilentStrike (547628) on Sunday June 03 2007, @01:42PM (#19372959) Homepage
      This probably does what you want.

      http://www.givemebackmygoogle.com/ [givemebackmygoogle.com]

      It just negates a whole lot of affliate sites.

      This is part of the query it feeds to Google.

      -inurl:(kelkoo|bizrate|pixmania|dealtime|pricerunn er|dooyoo|pricegrabber|pricewatch|resellerratings| ebay|shopbot|comparestoreprices|ciao|unbeatable|sh opping|epinions|nextag|buy|bestwebbuys)
    • Re:Feature Request (Score:5, Informative)

      by quiddity (106640) on Sunday June 03 2007, @01:51PM (#19373039)
      Firefox extension: http://www.customizegoogle.com/ [customizegoogle.com] lets you filter out URLs from the results (plus dozens of other useful things).

      You can filter out Wikipedia mirrors (using that extension) with the list here: http://meta.wikimedia.org/wiki/Mirror_filter [wikimedia.org]
  • by Xoq jay (1110555) on Sunday June 03 2007, @10:14AM (#19371387)
    Pagerank is the source of all wisdom in google... but there is so much more... Like string searching & matching algos, file searching.. you name it.. Just the other day I was searching for books about Google's algorithms... I found zero interesting stuff.. They keep their algorithms secret and out of the public domain... (like they should..). we praise Pagerank, but if we knew what other stuff is there, we would all be members of Church of Google (http://www.thechurchofgoogle.org/) :P
    • Re: (Score:2, Informative)

    • How does it work (Score:5, Informative)

      by Anonymous Coward on Sunday June 03 2007, @03:11PM (#19373743)
      It is rather simple (I am an insider).

      Google breaks pages in words. Then, for evey word it keeps a set which contains all the pages (by hash ID) that contain that word. A set is a data structure with O(1) lookup.

      When you search for "linux+kernel" google just does the set union operation on the two sets.

      Now a "word" is not just a word. In google sees that many people use the combination linux+kernel, a new word is created, the linux+kernel word and it has a set of all the pages that contain it. So when you search for linux+kernel+ppp we find the union of the linux+kernel set and the "ppp" set.

      So every time you search, you make it better for google to create new words. And this is part of the power of this search engine. A new search engine will need some time to gather that empirical data.

      Of course, there are ranks of sets. For example, for the word "ppp" there are, say, two sets. The pages of high rank that contain the word ppp, and the pages of low rank. When you search for ppp+chap, first you get the set union of the high rank sets of the two words, etc.

      Now page rank has several criteria. Here are some:
      well ranked site/domain, linked by well ranked page, document contains relevant words, search term is in the title or url, page rank not lowered by google emploee (level 1), page rank increased, etc.

      It is not very difficult actually.

      (posting AC for a reason).
      • Could it not simply be that they're not keeping it under wraps to avoid sneaky webmasters manipulating their sites, but to prevent competitors gaining an edge?
        • I would agree that's likely the reason that Google won't release their algorithm, but my question was why many people outside of Google insist that Google should keep their algorithm secret. If Google in a moment of financial insanity released their search algorithms to their competition it wouldn't decrease the quality of my search results, actually that might improve my results if someone takes Google's algorithm and improves on it.
  • by Timesprout (579035) on Sunday June 03 2007, @10:15AM (#19371401)

    Search over the last few years has moved from Give me what I typed to Give me what I want, says Mr. Singhal
    So this is why all my results are links to lesbian porn regardless of what I search for.
  • by Anonymous Coward on Sunday June 03 2007, @10:24AM (#19371463)
    One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages. A google search for "ruby <<" returns no results related to the ruby append operator. A Simple search for "<<", by itself returns ZERO results.
    • One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages. A google search for "ruby <<" returns no results related to the ruby append operator. A Simple search for "<<", by itself returns ZERO results.
      Yes, well you see that's a problem common to most search systems. Non-alphanumeric characters tend to be reserved for search logic. It would indeed be nice if there was a way to force literals into the search terms, but for now we just have to make do the way we always have: search for ruby append [google.com] instead, or (if you don't know what it's called) search for ruby string operators [google.com] and find out.
      • Yes. Try to find information on the web about the language "C+@". It's real, and it was developed at Bell Labs some years ago back in the Plan 9 era, but it's unsearchable.

          • So how does Google know to tailor its results for C, C++, and C#, which all return results specific to the requested language, but not for C+@?

            Manually implemented special cases, perhaps. Or Google may not consider the possibility that "@" can be part of a word, which is likely.

          • Re: (Score:2, Interesting)

            This is an interesting question that I've often wondered about. It's possible that Google programmers simply went in and special-cased C++ and C#, but I personally think that Google has an automated process which notices that "C++" and "C#" are commonly occurring both in web pages and queries, and then automatically adds them to the list of "strange" tokens to index.
      • Non-alphanumeric characters tend to be reserved for search logic.

        True, but I'd hope that at least using quotation marks to search for phrases would also include special characters.

        I mean, there can't be any search logic inside quotes anyway; then that would be part of the phrase.
        Like "Apples or oranges" won't search for either apples or oranges, but the actualy phrase.
    • Re: (Score:2, Insightful)

      One of the most annoying things about google for me is how it interprets queries with strange characters common to almost all programming languages.

      You should try google code search [google.com].

    • Re: (Score:3, Insightful)

      I have the same problem. But if you're searching for actual code, you're better off using a code search engine [koders.com]. Or as others have pointed out, search "ruby append operator" if you're interested in the concept.
  • One search feature (Score:5, Interesting)

    by Z00L00K (682162) on Sunday June 03 2007, @10:27AM (#19371493) Homepage
    that has been lost was the "NEAR" keyword that AltaVista used earlier. I found it rather useful.

    This could allow for a better search result when using for example "APPLE NEAR MACINTOSH" or "APPLE NEAR BEATLES"

    Ho hum... Times changes and not always for the better...

    • Clusty [clusty.com] does something similar. Searching for "Apple" will show categories for OSX and fruit, for instance.
    • I think "NEAR" is implied with Google. That is to say, if you search for "apple macintosh", pages with those two terms in close proximity will rank higher than pages which simply contain the terms. Since Google's exact algorithms are proprietary, I cannot swear to this, but that seems to be the way it behaves in my own use.

      What I miss from Alta Vista is the ability to go grouping to set precedence, i.e., parenthesis. I don't have to do this very often, but when I do, I really miss it. The need generally
      • This is definitely not always the case. I've had this problem a few times recently - the first page or two of results is a mix of a few useful sites and a lot of sites that happen to contain the two words, but on unrelated parts of the page. I have to dig through the results to find what I need. Especially if the unuseful sites are very popular ones and the ones I want are more obscure.
    • Wildcards in strings "apple * macintosh" will return pages with the word macintosh shortly following apple. Not reversable, but still quite useful for that kind of search.
  • by rbarreira (836272) on Sunday June 03 2007, @10:38AM (#19371569) Homepage
    Does the algorithm account for the toilet seat's positon?
  • by polarbeer (809243) on Sunday June 03 2007, @11:27AM (#19371905)
    One interesting thing about the article was the down-to-earth lack of abstraction in the problems described, such as the teak patio palo alto problem. Other search engines brag about their web-filtered-by-humans approach, as opposed to the "cold" algorithmic approach of Google. But it turns out Google is pretty human too, only with higher ambitions of creating generalizations from the human observations.
  • I'd like to know how they transform their queries before running them against the index. I.e. how they decide whether they should throw out the "stop" words (most prepositions, some verbs, some nouns) or keep them, whether they should throw in an alternative spelling or synonym, whether they should throw in a semantically related word or two to increase recall (this is evident when you search for something and get related words highlighted in the results), when to stem and when not to stem.

    Those are the thi
    • by martin-boundary (547041) on Sunday June 03 2007, @09:42PM (#19376743)
      Read the article, it gives a pretty clear picture of what's going on if you're a little familiar with classification ideas, eg bagging, boosting etc. Don't read further if you're familiar with those terms.

      A classifier is a black box which takes some data as input, and computes one or more scores. The simplest example is a binary classifier, say for spam. You feed some data (eg an email) and you get a score back. If it's a big score say, then the classifier thinks it's spam, and if it's a small score it's not spam. More generally, a classifier could give three scores to represent spam, work, home, and you could pick the best score to get the best choice.

      So you should really think of a classifier as a little program that does one thing really well, and only one thing. For example, you can build a small classifier that looks if the input text is english or russian. That's all it does.

      Now imagine you have 100 engineers, and each engineer has a specialty, and each builds a really small classifier to do one thing well. The logic of each classifier is black boxed, so from the outside it's just a component, kind of like a lego brick. What happens when you feed the output of one lego brick to the input of another lego brick?

      Say you have three classifiers: english spam recognizer, russian spam recognizer, english/russian identifier. You build a harness which uses the english/russian identifier first, and then depending on the output your program connects the english spam recognizer or the russian spam recognizer.

      Now imagine a huge network with some classifiers in parallel and some classifiers in series. At the top there's the query words, and they travel through the network. One of the classifiers might trigger word completion (ie bio -> biography as in the article), another might toggle the "fresh" flag, or the "wikipedia" flag etc. In the end, your output is a complicated query string which goes looking for the web pages.

      The key idea now is to tweak the choice thresholds. To do that, there's no theory. You have to have a set of standard queries with a list of the outputs the algorithm must show. Let's say you have 10,000 of these queries. You run each query through the machine, and you get a yes/no answer for each one, and you try to modify the weights so that you get a good number of correct queries.

      Of course you want to speed things up as much as possible, you can use mathematical tricks to find the best weights, you don't need to go get the actual pages if your output is a query string you just compare the query string with the expected query string etc, but that would be depend on your classifiers, the scheme used to evaluate the test results, and how good your engineers are.

      The point is that there's no magic ingredient, it's all ad-hoc. Edison tried a hundreds of different materials for the filament in his lightbulb. Google is doing the same thing according to the article. What matters for this kind of approach is a huge dataset (ie bigger than any competitors') and a large number of engineers (not just to build enough components, but to deprive its competitors of manpower). The exact details of the classifier components aren't too important if you have a comprehensive way of combining them.

      • And the thing that I want to know is how they evaluate the results. I actually do research in this space right now, and by far the most painful thing is evaluation of results. We have a system that automates most of the work, but there's still a lot of human involvement, and this limits the input dataset size and speed with which we can iterate the improvements.
        • by martin-boundary (547041) on Sunday June 03 2007, @11:25PM (#19377367)
          Good question. I agree with you that the article doesn't say anything valuable in this respect :(

          When you say that your system is limited by human involvement, I presume you mean that implementing new features can have serious impact on the overall design (and therefore on testing procedures)? Feel free to not answer if you can't.

          One thing I found interesting in the article is that Google's system sounds like it scales well. It reminded me of antispam architectures like Brightmail's (if memory serves), which have large numbers of simple heuristics which are chosen by an evolutionary algorithm. The point is that new heuristics can be added trivially without changing the architecture. I think their system used 10,000 when they described it a few years ago at an MIT spam conference. Adjustments were done nightly by monitoring spam honeypots.

          I'd love to see better competition in the search engine space. I hope you succeed at improving your tech.

  • by aldheorte (162967) on Sunday June 03 2007, @02:34PM (#19373403)
    Not sure about this:

    "Google rarely allows outsiders to visit the unit, and it has been cautious about allowing Mr. Singhal to speak with the news media about the magical, mathematical brew inside the millions of black boxes that power its search engine."

    I could see tens of thousands, maybe hundreds of thousands, but millions?
    • Re: (Score:3, Informative)

      This [baselinemag.com] is from a year ago (July 2006):

      Google runs on hundreds of thousands of servers--by one estimate, in excess of 450,000--racked up in thousands of clusters in dozens of data centers around the world.

      If this figure is accurate, a million boxen nowadays doesn't seem out of reach.

    • Re: (Score:3, Insightful)

      by Anonymous Coward
      In Soviet Russia, they shoot idiots why don't realize this joke is dead.
    • by WrongSizeGlass (838941) on Sunday June 03 2007, @10:17AM (#19371413) Homepage

      Google Search is a primitive tool used by fanboys "Googling" for pictures of Natalie Portman.
      Ha! Shows what you know. The only pics I search for are of a tall drink of Texas water named Patricia Vonne and of Cowboy Neal in his homemade Hulk costume. Who knew the Hulk wore a tri-corner hat & rainbow wrestling boots?
    • You could add "site:.com" to the query. That might help.
      • Re: (Score:2, Informative)

        Actually, using -site:.co.uk would yield much better results. Since he will then get everything except .co.uk instead of just .com
    • If the UK sites in particular are the ones you want out of you search results, compare these searches on Google:

      digestives london

      digestives london -inurl:.uk
    • Blogs are read only by bloggers and the press, and present absolutely no interest to normal people (including me).

      Considering that you're reading a blog, I think it's pretty fair that your only counting web pages that you think suck as blogs... so of course you don't like the results. Amazingly, no one is willing to tag their blog as "shohat will think this sucks, so please don't search me."
      • Re: (Score:2, Insightful)

        Slashdot is as much of a blog as I am a Egyptian gerbil. Slashdot links to stories that generate discussions. Slashdot is NOT about the people that create the posts, but about the people that comment here.