The U.S. Department of Justice released several new trial exhibits as part of the ongoing remedies hearing. These exhibits include interviews with two key Google engineers – Pandu Nayak and HJ Kim – which offer insights into Google’s ranking signals and systems, search features, and the future of Google.
Nayak defined some key Google terminology and explained Google’s search structure:
Other signals discussed by the engineers included:
Google also uses Twiddlers to re-rank results (which we learned about from last year’s Google’s internal Content API Warehouse leak). An internal “debugging interface” lets engineers see query expansion/decomposition and individual signal scores that determine final search result ranking.
Google discontinues poorly performing or outdated signals.
Ex-Googler Eric Lehman was asked whether Navboost trains on 13 months of user data, and testified:
Google’s search evolved from the traditional “Okapi BM25” ranking function to incorporate machine learning, starting with RankBrain (announced in 2016), then, later, DeepRank and RankEmbed.
Google found that BERT-based DeepRank machine learning signals could be “decomposed into signals that resembled the traditional signals” and that combining both types improved results. This essentially created a hybrid approach of traditional information retrieval and machine learning.
Google “avoids simply ‘predicting clicks,’” because they’re easily manipulated and don’t reliably measure user experience.
A key signal, RankEmbed, is a “dual encoder model” that embeds queries and documents into an “embedding space.” This space considers semantic properties and other signals. Retrieval and ranking are based on a “dot product” or “distance measure in the embedding space.”
RankEmbed is “extremely fast” and excels at common queries, but struggles with less frequent or specific long-tail queries. Google trained it on one month of search data.
The documents detail how Google determines a document’s relevance to a query, or “topicality.” Key components include the ABC signals:
These combine into T* (Topicality), which Google uses to judge a document’s relevance to query terms.
Beyond topicality, “Q*” (page quality), or “trustworthiness,” is “incredibly important,” especially in addressing “content farms.” HJ Kim notes, “Nowadays, people still complain about the quality and AI makes it worse.” PageRank feeds into the Quality score.
Other signals include:
Although machine learning is growing in importance, many Google signals are still “hand-crafted” by engineers. They analyze data, apply functions like sigmoids, and set thresholds to fine-tune signals.
“In the extreme,” this means manually selecting data mid-points. For most signals, Google uses regression analysis on webpage content, user clicks, and human rater labels.
The hand-crafted signals are important for transparency and easy troubleshooting. As Kim explained:
Complex machine learning systems are harder to diagnose and repair, Kim explained.
This means Google can respond to challenges and modify signals, such as adjusting them for “various media/public attention challenges.”
However, engineers note that “finding the correct edges for these adjustments is difficult” and these adjustments “would be easy to reverse engineer and copy from looking at the data.”
Google’s search index is the crawled content: titles and bodies. Separate indexes exist for content like Twitter feeds and Macy’s data. Query-based signals are generally calculated at query time, not stored in the search index, though some may be for convenience.
“User-side data,” to Google search engineers, means user interaction data, not user-generated content like links. Signals affected by user-side data vary in how much they are affected.
Google’s search features (e.g., knowledge panels) each have their own ranking algorithm. “Tangram” (formerly Tetris) aimed to apply a unified search principle to all these features.
The Knowledge Graph’s use extends beyond SERP panels to enhance traditional search. The documents also cite the “self-help suicide box,” highlighting the critical importance of accurate configuration and the extensive work behind determining the right “curves” and “thresholds.”
Google’s development, the documents emphasize, is driven by user needs. Google identifies and debugs issues, and incorporates new information to improve ranking. Examples include:
Google is “re-thinking their search stack from the ground-up,” with LLMs taking a bigger role. LLMs can enhance “query interpretation” and “summarized presentation of results.”
In a separate exhibit, we got a look at Google’s “combined search infrastructure” (although many parts of it were redacted):
Google is exploring how LLMs can reimagine ranking, retrieval, and SERP display. A key consideration is the computational cost of using LLMs.
While early machine learning models needed much data, Google now uses “less and less,” sometimes only 90 or 60 days’ worth. Google’s rule: use the data that best serves users.
Dig deeper. This is not the first time we’ve gotten an inside look at how Google Search ranking works, thanks to the DOJ trial. See more in these articles:
The DOJ trial exhibits. U.S. and Plaintiff States v. Google LLC [2020] – Remedies Hearing Exhibits:
The U.S. Department of Justice released several new trial exhibits as part of the ongoing remedies hearing. These exhibits include interviews with two key Google engineers – Pandu Nayak and HJ Kim – which offer insights into Google’s ranking signals and systems, search features, and the future of Google.
Nayak defined some key Google terminology and explained Google’s search structure:
Other signals discussed by the engineers included:
Google also uses Twiddlers to re-rank results (which we learned about from last year’s Google’s internal Content API Warehouse leak). An internal “debugging interface” lets engineers see query expansion/decomposition and individual signal scores that determine final search result ranking.
Google discontinues poorly performing or outdated signals.
Ex-Googler Eric Lehman was asked whether Navboost trains on 13 months of user data, and testified:
Google’s search evolved from the traditional “Okapi BM25” ranking function to incorporate machine learning, starting with RankBrain (announced in 2016), then, later, DeepRank and RankEmbed.
Google found that BERT-based DeepRank machine learning signals could be “decomposed into signals that resembled the traditional signals” and that combining both types improved results. This essentially created a hybrid approach of traditional information retrieval and machine learning.
Google “avoids simply ‘predicting clicks,’” because they’re easily manipulated and don’t reliably measure user experience.
A key signal, RankEmbed, is a “dual encoder model” that embeds queries and documents into an “embedding space.” This space considers semantic properties and other signals. Retrieval and ranking are based on a “dot product” or “distance measure in the embedding space.”
RankEmbed is “extremely fast” and excels at common queries, but struggles with less frequent or specific long-tail queries. Google trained it on one month of search data.
The documents detail how Google determines a document’s relevance to a query, or “topicality.” Key components include the ABC signals:
These combine into T* (Topicality), which Google uses to judge a document’s relevance to query terms.
Beyond topicality, “Q*” (page quality), or “trustworthiness,” is “incredibly important,” especially in addressing “content farms.” HJ Kim notes, “Nowadays, people still complain about the quality and AI makes it worse.” PageRank feeds into the Quality score.
Other signals include:
Although machine learning is growing in importance, many Google signals are still “hand-crafted” by engineers. They analyze data, apply functions like sigmoids, and set thresholds to fine-tune signals.
“In the extreme,” this means manually selecting data mid-points. For most signals, Google uses regression analysis on webpage content, user clicks, and human rater labels.
The hand-crafted signals are important for transparency and easy troubleshooting. As Kim explained:
Complex machine learning systems are harder to diagnose and repair, Kim explained.
This means Google can respond to challenges and modify signals, such as adjusting them for “various media/public attention challenges.”
However, engineers note that “finding the correct edges for these adjustments is difficult” and these adjustments “would be easy to reverse engineer and copy from looking at the data.”
Google’s search index is the crawled content: titles and bodies. Separate indexes exist for content like Twitter feeds and Macy’s data. Query-based signals are generally calculated at query time, not stored in the search index, though some may be for convenience.
“User-side data,” to Google search engineers, means user interaction data, not user-generated content like links. Signals affected by user-side data vary in how much they are affected.
Google’s search features (e.g., knowledge panels) each have their own ranking algorithm. “Tangram” (formerly Tetris) aimed to apply a unified search principle to all these features.
The Knowledge Graph’s use extends beyond SERP panels to enhance traditional search. The documents also cite the “self-help suicide box,” highlighting the critical importance of accurate configuration and the extensive work behind determining the right “curves” and “thresholds.”
Google’s development, the documents emphasize, is driven by user needs. Google identifies and debugs issues, and incorporates new information to improve ranking. Examples include:
Google is “re-thinking their search stack from the ground-up,” with LLMs taking a bigger role. LLMs can enhance “query interpretation” and “summarized presentation of results.”
In a separate exhibit, we got a look at Google’s “combined search infrastructure” (although many parts of it were redacted):
Google is exploring how LLMs can reimagine ranking, retrieval, and SERP display. A key consideration is the computational cost of using LLMs.
While early machine learning models needed much data, Google now uses “less and less,” sometimes only 90 or 60 days’ worth. Google’s rule: use the data that best serves users.
Dig deeper. This is not the first time we’ve gotten an inside look at how Google Search ranking works, thanks to the DOJ trial. See more in these articles:
The DOJ trial exhibits. U.S. and Plaintiff States v. Google LLC [2020] – Remedies Hearing Exhibits:
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using ‘Content here, content here’, making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for ‘lorem ipsum’ will uncover many web sites still in their infancy.
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using ‘Content here, content here’, making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for ‘lorem ipsum’ will uncover many web sites still in their infancy.
The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using ‘Content here, content here’, making
The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using ‘Content here, content here’, making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for ‘lorem ipsum’ will uncover many web sites still in their infancy.
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution
Copyright BlazeThemes. 2023