Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Google The Internet

Huge Google Search Document Leak Reveals Inner Workings of Ranking Algorithm (searchengineland.com) 64

Danny Goodwin reports via Search Engine Land: A trove of leaked Google documents has given us an unprecedented look inside Google Search and revealed some of the most important elements Google uses to rank content. Thousands of documents, which appear to come from Google's internal Content API Warehouse, were released March 13 on Github by an automated bot called yoshi-code-bot. These documents were shared with Rand Fishkin, SparkToro co-founder, earlier this month.

What's inside. Here's what we know about the internal documents, thanks to Fishkin and [Michael King, iPullRank CEO]:

Current: The documentation indicates this information is accurate as of March.
Ranking features: 2,596 modules are represented in the API documentation with 14,014 attributes.
Weighting: The documents did not specify how any of the ranking features are weighted -- just that they exist.
Twiddlers: These are re-ranking functions that "can adjust the information retrieval score of a document or change the ranking of a document," according to King.
Demotions: Content can be demoted for a variety of reasons, such as: a link doesn't match the target site; SERP signals indicate user dissatisfaction; Product reviews; Location; Exact match domains; and/or Porn.
Change history: Google apparently keeps a copy of every version of every page it has ever indexed. Meaning, Google can "remember" every change ever made to a page. However, Google only uses the last 20 changes of a URL when analyzing links.

Other interesting findings. According to Google's internal documents:

Freshness matters -- Google looks at dates in the byline (bylineDate), URL (syntacticDate) and on-page content (semanticDate).
To determine whether a document is or isn't a core topic of the website, Google vectorizes pages and sites, then compares the page embeddings (siteRadius) to the site embeddings (siteFocusScore).
Google stores domain registration information (RegistrationInfo).
Page titles still matter. Google has a feature called titlematchScore that is believed to measure how well a page title matches a query.
Google measures the average weighted font size of terms in documents (avgTermWeight) and anchor text.
What does it all mean? According to King: "[Y]ou need to drive more successful clicks using a broader set of queries and earn more link diversity if you want to continue to rank. Conceptually, it makes sense because a very strong piece of content will do that. A focus on driving more qualified traffic to a better user experience will send signals to Google that your page deserves to rank." [...] Fishkin added: "If there was one universal piece of advice I had for marketers seeking to broadly improve their organic search rankings and traffic, it would be: 'Build a notable, popular, well-recognized brand in your space, outside of Google search.'"
This discussion has been archived. No new comments can be posted.

Huge Google Search Document Leak Reveals Inner Workings of Ranking Algorithm

Comments Filter:
  • Bye bye (Score:5, Insightful)

    by phantomfive ( 622387 ) on Tuesday May 28, 2024 @09:50PM (#64506789) Journal
    The only ones that will benefit from this leak are SEO spammers. Which means for the rest of us, search results will get worse.
    • Re:Bye bye (Score:5, Insightful)

      by ArchieBunker ( 132337 ) on Tuesday May 28, 2024 @10:22PM (#64506831)

      Getting worse is a possibility?

      • Are you feeling lucky on which page number of results to hit to skip the ads?

        • by AmiMoJo ( 196126 )

          Doesn't everyone use uBlock these days? One of the reasons I stick with Google, trying rivals like DuckDuckGo and even Bing now and then (but I repeat myself), is because they are better at filtering out the spammy ad sites.

      • Remember Alta Vista?
    • by rsilvergun ( 571051 ) on Tuesday May 28, 2024 @11:24PM (#64506867)
      I'm literally using reddit to find useful information because Google let SEO spam take over the site.
      • by paul_engr ( 6280294 ) on Tuesday May 28, 2024 @11:25PM (#64506873)
        Bro just click this link about my side hustle
      • by TigerPlish ( 174064 ) on Wednesday May 29, 2024 @12:45AM (#64506949)

        For once, I agree with you.

        It's not just Google; Bing and DuckDuckGo are a similar SEO cesspool.

        I too find myself turning to Reddit.

        Perhaps it is search itself that has become obsolete due to the concentration of all these different interests into a very few handful of sites. That leaves commerce and digging for "how to"as the only overt applications for search.

        Oh, and ads, of course. But with a good ad blocker you don't get to see most of that.

        How convenient that a new monetization vector has arrived: AI. And, let's not forget that the real purpose behind all of this is selling your data.

        Yep, search is dead.

        • What's fucked is that Reddit has been at least 50% bots in all popular subreddits for years now, and shills have been powermods for as many as eight years. I wouldn't be surprised with the release of LLMs in the past year, 90% of users on popular subs are bots. And now Reddit signed a deal with OpenAI to sell user data, so it's literally AI training AI to spit out AI and a search company will use AI to browse that AI to return AI-pruned results.

        • For once, I agree with you.

          It's not just Google; Bing and DuckDuckGo are a similar SEO cesspool.

          I too find myself turning to Reddit.

          Perhaps it is search itself that has become obsolete due to the concentration of all these different interests into a very few handful of sites. That leaves commerce and digging for "how to"as the only overt applications for search.

          Oh, and ads, of course. But with a good ad blocker you don't get to see most of that.

          How convenient that a new monetization vector has arrived: AI. And, let's not forget that the real purpose behind all of this is selling your data.

          Yep, search is dead.

          The recent "AI" fad is just Enshittification 2.0. Its implementation goals appear to be to one-up the 15-year distortion of search results by also distorting the actual UI itself. It's deeply insidious because it aims to finish destroying any human being's ability to find ways to construct their own effective path through the search tools and results. You don't even get to interact with a search interface at all; you "talk" to a "search assistant" that then creates an on-the-fly interaction based on the sam

        • Perhaps it is search itself that has become obsolete due to the concentration of all these different interests into a very few handful of sites. That leaves commerce and digging for "how to"as the only overt applications for search.

          One reason for this is that Google has exerted its influence to kill any sort of vertical search except in social media sites, which it didn't really see as search competitors.

          Vertical-specific search (e.g. search for an object using physical dimensions or specs) would not only h

      • Once SEO scammers find they can use Reddit to influence Google search results, then Reddit will be filled with spam, too.
      • Google mostly links to Reddit posts now anyway.

      • by flink ( 18449 )

        Yup, half my Google searches include site:reddit.com at this point. Especially if I'm searching for non-StackOverflow technical info or HOWTOs. I'm sure there are still some awesome enthusiast forums and personal pages/blogs out there, but Google just doesn't surface them anymore.

    • by gweihir ( 88907 )

      I have stopped using Google several years ago because I do not think it can get much worse and still be somewhat useful.

  • Cached (Score:5, Interesting)

    by Anonymous Coward on Tuesday May 28, 2024 @10:15PM (#64506825)

    Change history: Google apparently keeps a copy of every version of every page it has ever indexed. Meaning, Google can "remember" every change ever made to a page. However, Google only uses the last 20 changes of a URL when analyzing links.

    Yes and they used to surface the latest version via the "Cached" button in the information attached to every search result. This was was extremely useful for dead links for technical documentation that never got crawled by the Wayback Machine. Bring back the "Cached" button!

    • Re:Cached (Score:4, Funny)

      by Dru Nemeton ( 4964417 ) on Wednesday May 29, 2024 @11:29AM (#64507797)
      That was, by far, the most interesting tidbit released. I mean, if they have a copy of every page ever indexed why the fuck do they not have an "Internet Archive" available? This would be such a boon to humanity, I just don't...oh, yeah...I get it now.
  • Nostalgia. (Score:5, Interesting)

    by msauve ( 701917 ) on Tuesday May 28, 2024 @10:39PM (#64506847)
    I miss Altavista.
    • by Anonymous Coward

      I miss when google was actually good

    • I miss The Northern Light from the period it was still free.
      • Re: (Score:2, Funny)

        by Anonymous Coward

        I miss the rains down in Africa.

    • In some ways I miss the days of Altavista, but that search engine on today's Web would be like bringing a sword to a tank battle. It was defenseless even against simple keyword stuffing. That's how earnest the Web was.
  • "If there was one universal piece of advice I had for marketers seeking to broadly improve their organic search rankings and traffic, it would be: 'Build a notable, popular, well-recognized brand in your space, outside of Google search.'"

    Genius.

    So build a notable, popular, well-recognized brand in your space to rank on Google. This is a font of forbidden lore.

    • That's why this leak isn't that big of a deal - spammers might get a little more detail but they work day in and day out to reverse engineer how it works anyways. So, there are no simple ways to game it any better than what's already being done.
  • Change history (Score:3, Interesting)

    by Ryanrule ( 1657199 ) on Tuesday May 28, 2024 @11:26PM (#64506875)

    Change history: Google apparently keeps a copy of every version of every page it has ever indexed. Meaning, Google can "remember" every change ever made to a page. However, Google only uses the last 20 changes of a URL when analyzing links.

    So for legal matters, google has the record of everything. Interesting, Bad for some people.

  • Why has google removed access to the cache?
    • by silentbozo ( 542534 ) on Wednesday May 29, 2024 @12:34AM (#64506937) Journal

      Because if you can't find what you're looking for, you'll burn more time on their search page trying to track down other copies or sources, which means... more ad exposure time!

      This reminds me of when Intuit started cramming more and more shit into Quicken to advertise other services. Milking of captive audiences...

      This of course, only works if the alternative is much worse. At a certain point, even if the alternative isn't as good, at least the experience is not as shitty.

      Honestly, this is the only reason I can think of to make a tool less useful. Either because you want to sell a "premium" version of the same tool for more money, or because you want to cut costs necessary to maintain the same level of quality but keep the price the same to raise profit. Or both.

      • There is another reason which would be lawyers and regulations that mandate a change... but I kind of doubt that was the reason here.

        • also, they have to keep the cache available. now they can persist it to slower storage for internal use only. no need to make it fast and highly available.
          • > also, they have to keep the cache available. now they can persist it to slower storage for internal use only. no need to make it fast and highly available.

            And it'll also make it difficult to verify the historical record.
  • by Tony Isaac ( 1301187 ) on Tuesday May 28, 2024 @11:48PM (#64506895) Homepage

    Just ask anybody that has tried to place Google Ads on their website, the rules, and the resulting traffic, fluctuates wildly. A lot of this fluctuation results from Google's constant battle with SEOs, which they try to thwart. So whatever this leak revealed, won't be in effect for long.

  • Return what we think they want and have lots of results for, not what they've typed.
  • Crappy guidelines (Score:4, Interesting)

    by bradley13 ( 1118935 ) on Wednesday May 29, 2024 @03:00AM (#64507009) Homepage

    Lots of these guidelines are absolutely counterproductive to what many (most?) of us would like to see in the Internet.

    I used to be admin for a couple of small, independent websites. Among other things, they had some decent, independent reference material online. Freshness? The material rarely changed, because it didn't need to change. Embeddings? No, we didn't artificially make lots of internal links to the reference pages. Who has time for that?

    These sites were semi-important in their niche market. Back in the 2000s, the sites ranked in place 1 or 2 on Google. Over the course of the 2010s, rankings started to slip. First came commercial interests: shops that paid for SEO. Ok, that's not so bad, our sites were still always on the first page. As time went on, more and more crap sites ranked better. Sites containing semi-literate content, lots of ads, and links to other crap sites. Honestly, sites like that shouldn't be hard to detect, but I suppose they drive ad revenue, so...

    • Re:Crappy guidelines (Score:4, Informative)

      by AmiMoJo ( 196126 ) on Wednesday May 29, 2024 @04:41AM (#64507119) Homepage Journal

      Freshness only applies to information that needs to be fresh. If someone searches for their sports team, they probably want their latest score, for example. If the user specifies old data with say a date, or if the information they are asking for doesn't need to be fresh, then the freshness weight is largely discounted.

      Even for relatively static info it makes some sense, because it's not as static as you might think. Recently I was checking the pinout for Playstation controllers, and when I looked many years ago most of the info said they used 5V. Some more recent sites say they are mostly 5V tolerant, but the actual Playstation uses 3.6V. The older info is still up, so it makes sense for Google to put the newer info first because even though it's a ~30 year old console the data isn't as static as you might assume. FWIW, in case anyone finds this, I measured it at 3.6V on VCC and 3.3V signalling, suggesting that the standard is 3.3V signalling and 3.6V is expected to be regulated down to that.

    • It's not a list of guidelines, it's a list of features that pages are scored on. The scores will be fed into a neural network to get a final combined score for a page, and the network will take into account the search query, page content and any other features when deciding how heavily (possibly negatively) to weight each feature. With the right training data, it would be able to tell that pages with certain content only make for good search results if their content is fresh, while other types of content mi

  • They have so many metrics involved in their ranking, and they've fallen victim to Goodhart's Law (relevant XKCD [xkcd.com]).
    This is natural, of course. Google has slowly trashed the end-user experience to presumably support their core business (advertising), so it's only fair that sites target the rankings with attempts at SEO that Google then has to spend inordinate amounts of effort to try to sift through.
  • by cascadingstylesheet ( 140919 ) on Wednesday May 29, 2024 @07:04AM (#64507247) Journal

    Fishkin added: "If there was one universal piece of advice I had for marketers seeking to broadly improve their organic search rankings and traffic, it would be: 'Build a notable, popular, well-recognized brand in your space, outside of Google search.'"

    In other words, "have content worth finding".

    Same as it ever was ...

    • by mjwx ( 966435 )

      Fishkin added: "If there was one universal piece of advice I had for marketers seeking to broadly improve their organic search rankings and traffic, it would be: 'Build a notable, popular, well-recognized brand in your space, outside of Google search.'"

      In other words, "have content worth finding".

      Same as it ever was ...

      But that's difficult and often not profitable. It's much easier to trick people into coming to your site and then annoy them.

      Either that or get laws enacted that force Google to pay you for the privilege of indexing your content and bringing users to your site, A.K.A. Plan Murdoch.

  • by DigitalSorceress ( 156609 ) on Wednesday May 29, 2024 @07:48AM (#64507343)

    I use SearXNG, which is great at filtering out a lot of the google ad stuff and other ads from multiple engines, but it's still just a meta search engine.

    I know this is probably a "pipe dream" but It seems to me that web searching is a public infrastructure, and as such, I really wish there was a "wiki of search" ... some kind of FOSS based search engine that could be curated by a community of volunteers with the over arching goal of trying to be fair and unbiased.

    Yes, I know that no matter what search engine you make, if it gets any traction, SEOs will find ways to try and game it - and when the algorithm is open source, I suppose it may be that this does happen, but I dunno, if somehow a big enough community of actual humans were even 1/4 as vigilant about weeding out cheats and spam sites from the index as the current wiki editors are about keeping folks from defacing/abusing wikipedia - it would be likely 100x better than any of the commercially driven pay to play engines we currently have.

    Would it be perfect? no
    Would it likely end up with cabals and in groups and drama? yeah
    but would it be oh so satisfying to have a non commercial, non ad driven search engine that people could contribute to helping make better by helping to jury results? I dunno - maybe certain groups would find ways to brigade results they disagreed with using false reports/takedowns and it would just turn into endless edit wars... I'm probably way too optimistic, but again, I see that for all its faults, wikipedia has managed to remain fairly useful and not completely collapse under the weight of science denial and active disinformation from various state level actors ... sort of .

    Yeah it occurs to me the massive amount of work and effort this would be and the fact that I don't know as you could get a community and culture of editing willed into existence - Wikipedias kind of grew organically over time and again I acknowledge it's far from perfect - still I just guess it strikes me that the fundamental truth is that web search is a vital public infrastructure, and the desire to maximize profit for the advertising industry who is running it is enshittifying it all so badly that we need to change something. /frustrated

    • ah yes, I look forward to my search engine begging for money on every page with a half-page banner before the search results... It'd be like NPR running my fridge.

      The problem with crowdfunding search is at this point the engineering work is not the scale bottleneck that gives Google a moat - there's a fair argument the state of the art for search peaked at least ten years ago, and has since *degraded*. But any competitors would struggle without access to massive compute capacity and storage to constantly re

      • Re: (Score:3, Insightful)

        by flink ( 18449 )

        ah yes, I look forward to my search engine begging for money on every page with a half-page banner before the search results... It'd be like NPR running my fridge.

        You mean like Wikipedia, which most people seem pretty happy with? They have a fundraising drive one a year. You just dismiss it one and it goes away.

    • Re: (Score:3, Interesting)

      ... but would it be oh so satisfying to have a non commercial, non ad driven search engine that people could contribute to helping make better by helping to jury results? ...I just guess it strikes me that the fundamental truth is that web search is a vital public infrastructure, and the desire to maximize profit for the advertising industry who is running it is enshittifying it all so badly that we need to change something. /frustrated

      I agree 100%, and if such a thing existed I would get involved and spend a substantial amount of time volunteering for the 'jury' and vetting search results.

      One important thing to realize is that your description of search enshittification applies to the commercial world at large. Consider a Microsoft operating system, or a fast food meal, or much of popular music, or an HP printer, home automation gear, IoT appliances, and on and on and on. All are being made or delivered with the smallest investment possi

    • by flink ( 18449 )

      I know this is probably a "pipe dream" but It seems to me that web searching is a public infrastructure, and as such, I really wish there was a "wiki of search" ... some kind of FOSS based search engine that could be curated by a community of volunteers with the over arching goal of trying to be fair and unbiased.

      I'm surprised the Internet Archive hasn't taken a stab at a search engine. They have a huge corpus to train it on and obviously have experience doing things at a massive scale.

  • Does amazon appear at the top of every google product search because they have a “broader set of queries and earn more link diversity”? No, they just pay more to deceptively game the system.

One of the chief duties of the mathematician in acting as an advisor... is to discourage... from expecting too much from mathematics. -- N. Wiener

Working...