What Makes Google Search Work?

Zain Raza
9 min readJul 12, 2020

--

In this blog post, I want to share a model for understanding Google’s PageRank algorithm, based upon the concepts of Graph Theory.

PageRank is the algorithm Google uses to sift through the absurd amounts of web pages out there, so that you and I see only the most important and useful information in search results.

Credits to Danny Sullivan, on Search Engine Land for the above image (link to original site). I do not claim ownership of this image, it is simply used here for educational purposes, as is allowed under Fair Use.

How Graph Theory Relates to PageRank

If there’s two things that the Internet does really well, it’s:

  1. Storing Information, and
  2. Connecting people with that information

As it turns out, graphs are an optimal data structure to model the Web. Why? This is because graphs are also great at:

  • storing data in objects called vertices,
  • and connecting vertices together using edges

Real World Example

You and I are a part of social networks that we can use to answer questions. Similarly, PageRank uses networks — aka graphs — of web pages to answer our questions. Image credit: Search Engine Journal

Back in 2004, my parents decided to come to America. They didn’t have access to Google in Pakistan, so they began asking their most-trusted friends for advice on how to immigrate. Notice, I said their most-trusted friends.

As you will see, Internet search works the same way! PageRank is meant to index all of the information found on the web, and when the time comes to show results, they display the pages based on a PageRank (PR) rating from 1–10, where 1 is the most trusted, most relevant information — and 10 is the least relevant information.

But what is a Graph Data Structure?

More specifically, here are further details to specify the graph data structure we will use in this post:

The graph represents all web pages on the Internet. It’s like a web, a network of web pages all over the, well, — the Web. They connect to each using links, just like this one. This can be implemented in Python using an InternetGraph class.

The vertices in the graph are the web pages. Not all web pages instances in the graph are necessarily connected. As stated above, the edges between the pages represent hyperlinks on the Web.

In this case, the edges are also directed. This means because hyperlinks are one-way connections, you may have aPage A that links to a Page B for example, and the reverse is not necessarily true.

Edges are also weighted here, in order to represent the probability that a user goes from a certain site to another. This means there’s a numerical value associated with the link that goes from page to another (and we will delve further into this below so it makes more sense, I promise).

How the Model Works

This model is structured using Object-Orientated Programming, so that you can start by seeing what concepts strictly relate the a single page, and build up to seeing how the web works as a whole. In computer science terms, this means the properties and instance methods of the PageVertex and InternetGraph classes are below, and the implementation in Python code is at the end of this post.

The InternetGraph Class

A representation of the World Wide Web. This class is a composition of many PageVertex instances.

A composition is just a fancy way of saying the Internet is HUGE, and Google’s index of the web contains references to trillions of pages. Image source: How Google Search Works (in 5 minutes)

Attributes of InternetGraph

  • dict: pages: this dictionary maps the page_id attribute of each PageVertex instance, to the PageVertex instance itself.

The PageVertex Class

This is how we will represent a single web page in code.

Attributes of PageVertex

  • str: page_id: a unique name for the web page
  • dict: neighbors: this dictionary represents the other PageVertex instances that can be reached using hyperlinks from this web page. The dictionary maps the page_id attribute of the other PageVertex instances, to the other PageVertex object.
  • float: link_weight is the weight that each edge from this PageVertex instance carries. This attribute of the edge represents the probability a site visitor goes to any one of the neighboring sites linked by this PageVertex instance. It is calculated as by dividing 100%, by the number of neighbors that page has. For example, if a PageVertex has 2 neighbors that it links towards (aka its "outlinks"), then each of the weight values for those edges will be 0.50.

Problems to Investigate

So what’s PageRank even good for? In this project, we will take a look at several problems that Google uses the PageRank algorithm to solve, and modeling their solutions in code based upon the graph data structure we’re using.

Calculating the PageRank Rating for Each Page

This problem takes a look at the backlinks that point to each Page in the Internet, with the idea being that the more links leading a web page, the more useful that page is.

In this kind of algorithm, you can think of a link from one page to another, as the former recommending the user to visit the latter!

At the same time, a page that links to a lot of other pages might not really be that valuable — PageRank account accounts for this by assigning a lower value to a link from that page.

Image Source: Zach Star’s explanation of PageRank, “The algorithm that started google”

In this example, we have 4 web pages, called A, B, C, D. Each page is only allowed to give out 1 point in total, through all of its links.

For example, notice how A only has one link, pointing to B. Because it is A’s only link, it transfers 1 entire point to B.

However the links that originate from B only weigh half a point each, since B links to two other pages in total. At the end of the day, B can still only give out 1 point in total to the other pages.

Image Source: Zach Star’s explanation of PageRank, “The algorithm that started google”

Head spinning yet? Don’t worry, using the table to the left called an adjacency matrix, we can clearly see the links that each page both receives and gives.

In the table to the left, you can see the total value of links received by a page by adding up the values in its row; and for any page, you can see the pages it gives links to by looking down its column.

So then you may ask, how do we use the adjacency matrix to rank pages?

Well, one step is to simply add up the total weight of the backlinks (links that a page receives) for each page, then rank the pages from greatest to least.

However, we also need to account for the importance of the page giving these backlinks— that is, how credible was the page that the link came from?

The math for this gets complicated, but the high-level answer is that we take then eigenvalue of each page’s backlink values. This helps ensure that the PageRank ratings factor in both the quality and quantity of the links a page receives.

Determining Which Pages Can Be Reached, After Clicking N links Away from a Starting Page

This is another problem we can try, which applies Breadth-First Search. Breadth-first search is an algorithm that traverses the graph starting from one page, and progressively visits other pages by levels. This means that all the pages that can be reached within 1 link of the start, are all traversed in one pass.

In a Breadth-first search, we decide which vertices to search, based on the number of links they are away from the start. On the first pass, we visit all the vertices zero links away. Then, all the vertices 1 link away, together in one pass. Examine the red and green lines I have drawn to help you see these passes. This diagram was created by annotating the graph found on this page from CodeAbbey.

In the diagram to the left, I group all the vertices that are visited within pass. Consider the red line I drew — all those page are in the same pass, because all those pages are 1 link away from the start, which is the A page.

As we continue, all the pages are traversed that can be reached within 2 links away, 3 links away, 4 links away… and so on and so forth, until we either reach the pages that are n links away, or exhaust all possible pages in the graph!

Using the graph above as an example, the pages that can be found 2 links away from A would be in the green level, or the F, G , and E pages.

Finding the Shortest Path to Get From One Page to Another

This problem is an application of Dijkstra’s Shortest Path Algorithm, which can only be applied to weighted graphs.

In this algorithm, the weights of the links now represent distances, and we want to find the links that will minimize the total path between a starting page and a destination page (whom we will get to by clicking through those aforementioned links).

Constructing a table such as above, we can see the shortest path from our starting Page A, to any of the other pages.

For example, if we have the graph to the left, we can construct the minimum distance required to get from the starting_page, in this case Page A, to any other page.

For instance if we wanted to find the distance of the shortest path between A and C possible, we would look up C in the table, and return 25 as the distance. We wouldn’t even bother looking at the other paths from A to C, such as through D (A-D-C) — because those would be longer distances.

This is a simple example, since A directly links to all the other pages. Nevertheless bear in mind that this algorithm scales to scenarios where this is not the case, as long as all the vertices in the graph are still connected — it just requires more iterations.

Scalability

This section will report on how solvable the problems in this investigation become, as well as how efficiently the algorithms currently being used to solve them become as the size of the graph grows asymptotically.

PageRank Implementation

The PageRank algorithm Google uses today is much more advanced than what we’ve covered here today. It now takes into account hundreds of variables to rank pages, such as location, keywords, date the page was created, etc. In its full complexity, appears to still work as we add more and more web pages and hyperlinks to the Web. According to Search Engine Land, as of 2016 Google has indexed over 130 trillion web pages. Although Google does not publicly announce how PageRank’s implementation has changed over the years, this article on Search Engine Roundtable reports that John Mueller, a Webmaster Trends Analyst at Google, stated in February that the company still indeed uses the algorithm to rank pages internally in 2020.

Currently the method used to perform PageRank, InternetGraph.rank_pages, runs in O(P^2 + L), where P = number of PageVertexs and L = number of links in the InternetGraph. This is mainly due to the runtime used to find the inlinks leading to each PageVertex - this is performed by a helper function InternetGraph.compute_inlink_values. Therefore my code below is probably not an efficient algorithm for inputs where P + L is much greater than 30, perhaps on the scale of 1000s of pages and links.

Finding Neighbors n Links Away

This problem is not solvable as the size of the input scales asymptotically. The InternetGraph as we know it today is simoly too large to get around simply through clicking on links. Be grateful for search engines!

The function used to solve this problem is InternetGraph.find_pages_n_away. This implements the Breadth-first Search algorithm as a function of the pages and links in the graph, therefore it has a polynomial efficiency. In the worst case, the runtime of the function is O(P + L), where P = number of PageVertexs and L = number of links in the overall graph. For small inputs and inputs around 30, this is an efficient algorithm. However, on the true scale of the Internet, which is trillions of pages, it is not efficient.

Shortest Path-Finding

This problem is also not solvable as the size of the InternetGraph increases asymptotically, for the same reason as above: when dealing with trillions of web pages, it is simply better to use search engines and applications in order to quickly get between different web pages of interest.

The function used to solve this problem is InternetGraph.find_shortest_path, which implements Dijkstra's Shortest Path algorithm. This algorithm is usually O(P log P), aka it is a linearithmic or loglinear complexity. However, because an array is used instead of binary min heap in this implementation, the runtime is currently O(P^2). This would not be an efficient function to use on inputs of size 30 or more.

Resources

For more information explaining how PageRank works, and why you should care (as either a businessperson, web developer, or curious mind), please checkout:

  1. This awesome article on Search Engine Land, explaining PageRank for SEO’s.
  2. A 5 minute explanation of PageRank from Matt Cutts, the former Head of the Web Spam team at Google.
  3. Zach Star’s explanation of PageRank uses adjaceny matrices to explain the algorithm. This explanation most closely resembles what is used in this project.
  4. Google’s updated 5 minute explanation of Search will show how the algorithm has evolved a lot over the years!

The code used to implement the three problem solutions are below. The full repository is linked here.

Sign up to discover human stories that deepen your understanding of the world.

--

--

Zain Raza
Zain Raza

Written by Zain Raza

Software engineer interested in A.I., and sharing stories on how tech can allow us to be more human.

No responses yet

Write a response