Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

The Math Behind PageRank

Journal written by anaesthetica (596507) and posted by samzenpus on Wed Dec 06, 2006 06:45 PM
from the learn-to-be-number-one dept.
anaesthetica writes "The American Mathematical Society is featuring an article with an in-depth explanation of the type of mathematical operations that power PageRank. Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods. And because the links constituting the web are constantly changing and updating, the relevance of pages needs to be recalculated on a continuous basis."
+ -
story
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • by ambivalentduck (1004092) on Wednesday December 06 2006, @06:48PM (#17139072)
    But 9,000 of those words are slang for parts of the human anatomy.  Go figure.
          • On the other hand: explain Gallagher and Carrot Top. "Apparently" they are funny, because they have "careers". Yet everyone with an actual sense of humor knows they are just waiting to unhinge their jaws and swallow you whole.
  • I have sites with a PR of 6, and I can tell you that they got that way because of inbound links from other sites. In fact, when other sites dropped those links, my PR dropped (to 5, and even to 4). Getting more inbound links brought the PR back.

    Think about those links, too. How often do you use common words in an HREF? I don't think there's a lot of weeding out of common words since the link to a site is usually either its name, or a description containing some important keywords.

    I love seeing these technoscientists think they understand PageRank, but just like TimeCube, they're way, way off.
    • Re: (Score:3, Informative)

      There has been a PageRank paper out there since 2000 or so, so it's not exactly a secret how it works. Basically an initial set of relevant pages is pulled from the database and ranked by doing some computation on a connectivity matrix. The trick is to come up with a good initial set; and unless they managed to implement an all-knowing oracle they probably do it by doing a keyword search. Here's where the article summary makes sense; if most pages have the same keywords, a keyword search is going to come
    • If you're referring to the article, it focuses on the "links" aspect when describing the PageRank algorithm. The summary on here is pretty misleading in that way.

    • Think about those links, too. How often do you use common words in an HREF?

      interestingly, it appears that Adobe Acrobat leads the list of results [google.com] when you search for "here" on Google (you can download it here [adobe.com]).

      and who would have expected this [google.com]

  • Bad summary (Score:5, Interesting)

    by Knights who say 'INT (708612) on Wednesday December 06 2006, @07:06PM (#17139318) Journal
    The article specifically says the PageRank eigenvector is only recalculated once a month, approximately. Even though Google uses some clever numerics to calculate the eigenvectors to a 25 billion by 25 billion matrix by iteration, it still takes several hours to finish.
    • Even though Google uses some clever numerics to calculate the eigenvectors to a 25 billion by 25 billion matrix by iteration, it still takes several hours to finish.

      Please. I can do that on paper in, like, five minutes.
    • Several hours for 25b x 25b? Jeez, it took Slashdot the better part of a day to update the comment id field type in their database... 16.7m by 1. OSTG, we demand that the servers running Slashdot be upgraded to something that could actually withstand a Slashdotting!
      • Re:Bad summary (Score:5, Insightful)

        by martin-boundary (547041) on Wednesday December 06 2006, @09:19PM (#17140646)
        It's nowhere near like that. A web matrix is very sparse, so if you did a true 25Bx25B matrix power iteration, you'd be multiplying zero by zero a gazillion times. Optimization is about not doing things you don't need to do, and optimizing PageRank is about figuring out clever ways to not do the full multiplication. Moreover, PageRank is calculated in parallel over a computer farm. Overall, you can expect a single iteration to take on the order of an hour, and you can expect around 50-80 iterations before Google gives up and says it's converged. You can also try and reuse the previous "converged" PageRank vector to cut down on the 50-80 iterations after you've crawled new pages.

        If google used a single computer to do all the work, and truly did 80*25B^2 operations, they'd be morons.

  • Nouns maybe? (Score:4, Insightful)

    by Bryansix (761547) on Wednesday December 06 2006, @07:07PM (#17139344) Homepage
    It seems like it would be the nouns, pronouns, etc. that Google should be paying attention to. Who cares about all the verbs, adjectives, etc. that just muddy the indexing waters?
    • I believe that a race is on at the moment for semantic searching. Not only nouns, verbs etc, but whether the phases are subjective or objective. I know a blog search company that is working on this. They wanted to borrow some of my code.
    • Re: (Score:2, Insightful)

      Searching for pill and the pill should yield very different results. Yes nouns are more important, but articles and other words cannot be disregarded.
      • I actually thought about that after I posted. I know all the words are important for indexing. I'm just saying that looking at keywords and placing more importance on those is a part of the mix too. Those keywords are almost always nouns.
  • I read about this some time ago ... I think the paper was entitled "The 10 billion dollar Eignvector: The math behind google" or something to that effect. Sorry, but I've got a new laptop and cannot find the exact title. It was an excellent introduction for beginner computational scientists for an application of the eigenvector. I forget the American University responsible.
    • Here's the bibtex reference.

      @article{bryan:569,
      author = {Kurt Bryan and Tanya Leise},
      collaboration = {},
      title = {The $25,000,000,000 Eigenvector: The Linear Algebra behind Google},
      publisher = {SIAM},
      year = {2006},
      journal = {SIAM Review},
      volume = {48},
      number = {3},
      pages = {569-581},
      keywords = {linear algebra; PageRank; eigenvector; stochastic matrix},
      url = {http://link.aip.org/link/?SIR/48/569/1},
      doi = {10.1137/050623280}
      }
  • by CrazyJim1 (809850) on Wednesday December 06 2006, @07:10PM (#17139396) Journal
    I skimmed the article and didn't find what I wanted to find. If you make a webpage that you want ranked high, what do you do? Do you make 100 geocities accounts and provide links to your main website, or what? I'm just wondering this out of curiosity, not out of need.
    • At a very basic level a sites page rank is a reflection on how much other sites think it's relevent, and is based on how important the sites are that link to it. Get a link from the BBC, CNN, or somewhere like that and it's worth thousands or millions of links from Geocities sites.
    • That's kinda what I thought at first as well, but looking over the lower two-thirds of the article, I started to get a different impression. They talked about a 'strong web' idea, where if your webpage is disconnected from the 'main' web and set up in a sort of 'secondary web' with just your Geocities accounts, for instance, linking to it, then the actual websites that interconnected within your site matrix would rank a 0 overall.

      Not sure if this is correct or not, just the impression that I got from what
    • by Anonymous Brave Guy (457657) on Wednesday December 06 2006, @07:52PM (#17139894)

      The underlying idea behind page rank is pretty well-exposed at this point, and is described in TFA. Essentially, it's a big set of simultaneous equations: each incoming link to your page gets a score that is roughly the rank of the source page divided by the number of outgoing links on that page, and then the rank of your page is roughly the sum of the scores of all incoming links.

      Various fudge factors are introduced along the way. For example, if you break Google's rules about displaying the same content to bots as to humans, you can get slapped right down. More subtly, newly registered domains take a modest hit for a while. More nobody-knows-ly, Google's handling of redirects is unclear: information about exactly what adjustments are made is pretty scarce, and there's a lot of conjecture around. One thing that's pretty certain is that they penalise for duplicate content, which is why some webmasters do apparently unnecessary things like redirecting http://www.theircompany.com/ [theircompany.com] to http://theircompany.com/ [theircompany.com] or vice versa.

      So, if you want to get a page with a high rank yourself, then ideally you need would get many established, highly-ranked pages to link to your page and no others. In your example, all those Geocities sites wouldn't help a lot, because (a) they'd have negligible rank themselves, and (b) they'd be penalised for being new and lose some of that negligible rank before they even started. Many times negligible is still negligible, and so would be your target page's rank. OTOH, get a few links from university sites, big news organisations and the like, and your rank will suddenly be way up there. Alternatively, get a grass-roots movement going where a gazillion individuals with small personal sites link to you, and the cumulative effect will kick in.

      • Re: (Score:3, Interesting)

        "if you break Google's rules about displaying the same content to bots as to humans"

        I notice many sites that do that and don't get slapped down - esp subscription sites. And seems Google doesn't cache those, so its probably collusion.

        You see the keywords and paragraphs in the search, but click on it you get a login page.

        They should have to pay a special rate be marked differently from the other search results. It's a waste of time otherwise.
        • by oni (41625) on Wednesday December 06 2006, @08:39PM (#17140342) Homepage
          I notice many sites that do that and don't get slapped down - esp subscription sites.

          I wonder, if I changed my useragent to be whatever the googlebot reports itself to be - would I get by the registration screen on websites like the NYTimes??
          • No, because they check the IP you're coming from as well now - they grew wise to user agent spoofing years ago.

            Google for the "bugmenot" Firefox extension.
            • Googlebot doesn't use the same IP address all the time (several servers running Googlebot I'd imagine), so filtering based on IP addresses would be infeasible (at least according to Google).
          • Re: (Score:3, Informative)

            As pointed out, the Times site isn't fooled, but there are a good many out there that are fooled. Sometimes if you ever do a Google search, one of the results will contain a keyword or two. However, when you click on the link, you'll find yourself redirected to a subscription page. Useragent spoofing can frequently show you the same page that Google indexed.

            If you're a FF user, grab the Useragent Switcher extension [mozilla.org] and add in a UA of "Mozilla/5.0 (compatible; googlebot/2.1; +http://www.google.com/bot.html)
        • Re: (Score:3, Interesting)

          Here is an email with associated response I received from Google on roughly this topic.

          This is a very general question. I'm creating a website. It is going to be a blogging platform. Obviouslly, the content of the site(s) is the most important thing. I've already started making the content of my site dynamic in the sense that I tailor it to the requesting agent (via the user-agent header). My intention for doing this is to make sure that the content renders correctly for *any* browser that accesses the sit

      • I now have a nice basic understanding of Google page ranking system. Thats all I was asking for.
      • Re: (Score:2, Insightful)

        Thanks for the informative post. I have one question though. How does it help find the relevant information unless that information just happens to be on a popular page too? What I mean to say is that the idea behind grading/filtering systems like PageRank is to provide the most relevant information about the thing you are trying to search on the net. Now suppose Mr. A is looking for some obscure Indian text written in Sanskrit and Mr. B has (recently or not) put up a website with that text as one of the co
    • If those 100 geocities pages each have a PageRank of 0 (which they would if they aren't linked to from other high-ranking pages), their total contribution to your main page PageRank will be 0.
  • As a self proclaimed SEO expert - I honestly don't believe PageRank counts nearly as much as it did a few years ago! You'll find lots of PR5 sites ahead in the SERPS of PR9 sites!
    • by Trieuvan (789695) on Wednesday December 06 2006, @07:35PM (#17139726) Homepage
      The pagerank that's reported from toolbar is really old. Google never want to let you know the real number or it will be easy to spam ...
      • Re: (Score:2, Funny)

        by Anonymous Coward
        Concentrate on SERPs, not PR, ASAP for SEO on the WWW

        I searched on Google but I cannot find what "on", "not", "for" and "the" mean...
  • I think we can get four or five tomorrow.
  • Great article.

    The character of online content is changing now rapidly. We used to be in an Internet where mostly only the site provider determined the content on the pages they served (/. being a notable, early exception). Now, with the rise of "2.0" systems, user-generated content, and empowerment of the individual - the content being served on many sites is coming into sites from wide groups, and being moderated and curated by those groups.

    So... a thought: as user-submitted and group-moderated content
    • I could not disagree more. Most of the sort of information people search for is not user generated: when did you last do a Google search for which a slasdot comment was the appropriate answer?

      The only exception that I can think of (form my searches) are forums that have answers to software problems. Google seems to have no problem finding these for me.
      • Sometimes you want to search through your old posts. Not all sites let you do that (slashdot does if you pay up, I think), and often forums are even norobots space.
      • The meme that Google helps us find all the information is a huge marketing Spin.

        Compared to "exactly the information you want, when and how you want it" - Google sucks. It is better that anything else now, but it still is not anywhere close to really solving the information access problem generally.
  • For a different, somewhat more technical, but more succint discussion, Cleve Moler [of Matlab fame] wrote another view [mathworks.com] of this topic, about 5 years ago.

    The math is the same, of course, but two points of view may provide a greater sense of perspective. So to speak. And Cleve is always worth listening to.

    • Actually, I'm not so sure it's the largest matrix computation. Weather and nuclear bomb simulations are done with matrix algebra, and it wouldn't surprise me to discover that they do some months-long calculations with even larger matrices.
  • I've seen links on google searches that don't exist anymore but were ranked highly when they DID exist and still exist in the top 10 of the query. What happens to those? Do they stay at their ranking till they get overtaken by other more popular pages on the same search? Get their ranking slowly reduced because they don't exist?
    • Pagerank (Score:5, Funny)

      by Skythe (921438) on Wednesday December 06 2006, @07:37PM (#17139752)
      Because about 95% of the text on the 25 billion pages indexed by Google consist of the same 10,000 words, determining relevance requires an extremely sophisticated set of methods.

      They use a set of nested if-else statements
      *ducks*
    • Re:Pagerank is cool (Score:5, Interesting)

      by silentounce (1004459) on Wednesday December 06 2006, @08:02PM (#17139982) Homepage
      Interestingly enough, google thinks so, too. [google.com]

      Of course, yahoo has its own opinion. [yahoo.com]
       
      Although, altavista seems to almost agree. [altavista.com] Check the second non-advertised result.
       
      I do find this [google.com] amusing though. Third place, how humble.
       
      I didn't expect such interesting results. The site with the search term in its url was tops for av and yahoo, but not google. Yahoo ranked the wiki entry above google, but av reversed that decision, google of course thought itself was more important than the wiki. Google's own reference site was number one in its own search and near the top in the other two, but pagerank.net wasn't even in the top 10 for google's search. I'm not sure what conclusions can be drawn from all that, but it is definitely food for thought.
      • I do find this amusing though. Third place, how humble.

        What I found interesting about that link was the description listed for google's entry:

        Google - 11:54pm
        Enables users to search the Web, Usenet, and images. Features include PageRank, caching and translation of results, and an option to find similar pages.
        www.google.com/ - 5k - Dec 5, 2006 - Cached - Similar pages

        Where did they get that text from? It's not anywhere to be found in the source [tinyurl.com]. Did they cheat? Or are they just tricky?

        • Where did they get that text from? It's not anywhere to be found in the source. Did they cheat? Or are they just tricky?

          They got it from the Google category [dmoz.org] at the Open Directory Project at dmoz.org [dmoz.org], mirrored at directory.google.com [google.com]. Google is a user of dmoz.org data but has completely de-emphasized that as of late.

          It's actually against the dmoz license agreement to use their data without a link back to the source, but nobody seems to care.

        • Whenever possible, Google uses the DMOZ description [dmoz.org] for the snippet shown in the results.
    • Why does that make PageRank broken? That's not the problem it tries to solve. Google might be broken for slavishly adhering to PageRank, but that's a different matter entirely...