Leaked In-House Google Documents Spark SEO Frenzy

A 2,500+ page document that appears to describe how Google ranks search results has appeared online.

The leaked documentation describes a version of Google's Content Warehouse API and provides a glimpse of Google Search’s inner workings.

The material appears to have been inadvertently published on a publicly accessible Google-owned repository, known as GitHub, around 13th March 2024 by the web giant's own automated tooling. An update to the repository on GitHub by Google on 7th May 2024 attempted to undo the leak.

What do these leaked Google Documents Contain?

These documents do not contain specific algorithmic code, but instead describe how to use Google's Content Warehouse API which is likely intended for internal use only; the leaked documentation includes numerous references to internal systems and projects. 

The files are noteworthy for what they reveal about the things Google considers important when ranking web pages for relevancy, a matter of enduring interest to anyone involved in the SEO business and/ or anyone operating a website and hoping Google will help it to win traffic.

Among the 2,500+ pages of documentation, there are details on more than 14,000 attributes accessible or associated with the API. However, there is no information about whether all these signals are used and their importance, which makes it impossible to calculate the weight Google applies to the attributes in its search result ranking algorithm.

But many SEO consultants believe the documents contain noteworthy details because they differ from public statements made by Google representatives.

Why is this leaked documentation a big deal?

The leaked documentation directly contradicts public statements made by Google over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and much more.

Google search advocate John Mueller, once said "we don’t have anything like a website authority score", which is a measure of whether Google considers a site authoritative and therefore worthy of higher rankings for search results. However looking at the leaked documents their are details of a "siteAuthority", which would suggest that Google does calculate a websites authority score as part of its Compressed Quality Signals.

The importance of clicks in Google search also appears to be used in determining how a web page rankings – including the different types of clicks (good, bad, long, etc.).

One of the biggest revelations is that Google appears to use data collected by its Chrome web browser, which could include the length of time spent on a page and a users general navigation of that page or website.

Additionally, the documents indicate that Google considers other factors like content freshness, authorship, whether a page is related to a site's central focus, alignment between page title and content, and the average weighted font size.

Where can I get a copy of the leaked documentation?

See, 

What did Google have to say following the leak?

Google has made a number of comments, notably:-

  • "Be aware that the accidentally revealed files may be missing vital context"
  • "We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information"
  • "We've shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation"