Note
0.0(0)
MR

PageRank Notes

History of Link Analysis

  • Bibliometrics has been active since at least the 1960s.

  • Definition of Bibliometrics from Wikipedia:

    • "Bibliometrics is a set of methods to quantitatively analyze academic literature.

    • Citation analysis and content analysis are commonly used bibliometric methods.

    • While bibliometric methods are most often used in the field of library and information science, bibliometrics have wide applications in other areas."

  • Many research fields use bibliometric methods to explore:

    • The impact of their field.

    • The impact of a set of researchers.

    • The impact of a particular paper.

What is PageRank?

  • PageRank is a web link analysis algorithm introduced by Google.

  • Developed at Stanford University by Google founders Larry Page and Sergey Brin.

  • The paper describing PageRank was co-authored by Rajeev Motwani and Terry Winograd.

  • PageRank is a "vote" by all the other pages on the Web about how important a page is.

    • A link to a page counts as a vote of support.

    • If there's no link, there's no support (abstention rather than a vote against).

  • PageRank says nothing about the content or size of a page, the language, or the anchor text of a link.

  • PageRank is a probability distribution representing the likelihood that a person randomly clicking on links will arrive at any particular page.

PageRank Patent

  • Patent for the PageRank algorithm.

  • Larry Page is credited as the inventor.

  • The patent was awarded to Stanford University.

  • The patent was filed in January 1998.

  • The PageRank patent expired in 2017.

  • Google holds a perpetual license to the patent.

  • Google has never pursued other search engine companies for using the PageRank algorithm.

Initial PageRank Formulation

  • PR(u) = \sum_{v \in in(u)} \frac{PR(v)}{|out(v)|}

    • The PageRank value for a page u is dependent on the PageRank values for each page v contained in the set in(u) (the set of all pages linking to page u), divided by the number out(v) of outgoing links from page v.

Steps for Simplified Algorithm

  1. Iteration 0: Initialize all ranks to be 1 / (number of total pages).

  2. Iteration 1: For each page u, update u's rank to be the sum of each incoming page v's rank from the previous iteration, divided by the total number of links from page v.

Simplified PageRank Algorithm - Example 1

  1. Iteration 0: Initialize all pages to have rank 1/5.

  2. Iteration 1:

  3. P1 has 1 link from P3, and P3 has 4 outbound links, so we take the rank of P3 from iteration 0 and divide it by 4, which results in rank (1/5)/4 = 1/20 for P_1.

    • PR(P_1) = (1/5) / 4 = 1/20

  4. P2: has 2 links from P1 and P3, P1 has 1 outbound link and P3 has 4 outbound links, so we take (the rank of P1 from iteration 0 and divide it by 1) and add that to (the rank of P3 from iteration 0 and divided that by 4) to get 1/5 + 1/20 = 5/20 for P2

    • PR(P_2) = 1/5 + (1/5)/4 = 5/20

Simplified PageRank Algorithm - Example 1: After 2 iterations

  • Final Ranking: P5, P4, P3, P2, P1

  • PR(P_5) = 1/5 + 1/5 * 1/2 + 1/5 * 1/4 = 7/20

Another Example

  • Say we have four pages: A, B, C and D.

  • Links from a page to itself are ignored.

  • Multiple links from one page to another are treated as a single link.

  • In this example, every page would start out with a rank of 0.25

Example 2

  • Since B, C, and D all have outbound links to A, the Pagerank of A will be 0.75 upon the first iteration: (B with rank of 0.25) + (C with rank of 0.25) + (D with rank of 0.25) would transfer all of those ranks to A

  • But wait! What about ranks of pages B,C, and D?

  • Because B, C, and D have no incoming edges and they give all their rank to A, they will all end up with a rank of 0. This doesn't add up to 1 …

  • So the simplified algorithm needs to be able to handle border cases, so we must generalize the PageRank algorithm!

Complete PageRank Algorithm

  • Quoting from the original Google paper, PageRank is defined like this:

    • "We assume page A has pages T1… Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A."*

  • The PageRank of a page A is given as follows:

    • PR(A) = (1-d) + d * (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

  • Note:

    • That the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

    • *The Anatomy of a Large-Scale Hypertextual Web Search Engine by Brin and Page, http://infolab.stanford.edu/pub/papers/google.pdf

Explanation

  • PR(A) is PageRank of Page A (one we want to calculate)

  • PR(T_1) is the PageRank of Site T1 pointing to page A

  • C(T_1) is the number of outgoing links of page T1

  • PR(Tn)/C(Tn): If page A has a backlink from page "Tn" the share of the vote page A will get is "PR(Tn)/C(Tn)"

  • d(…): All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is “damped down” by multiplying it by the factor "d", (0.85)

  • (1-d): Since "sum of all web pages' PageRanks will be one". Even if the d(…) is 0 then the page will get a small PR of 0.15

How PageRank is Calculated

  • PR of each page depends on PR of other pages which are pointing to it. But we don't know PR of a given page until the PR of other pages is calculated and so on…

  • From the Google paper:

    • PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web

  • What this means is that we can calculate the PR without knowing the final PR of other pages

  • Recurrence Relation: an equation that recursively defines a sequence of values once one or more initial terms are given;

  • We calculate PR iteratively a number of times until we start converging to the same value.

How Fast Does the PageRank Algorithm Converge

  • Early experiments on Google used 322 million links

  • PR (322 Million Links): 52 iterations

  • PR (161 Million Links): 45 iterations

  • Number of iterations required for convergence is empirically (but not formally derived) O(log n) (where n is the number of links)

  • Hence the calculation is quite efficient

Computing PageRank by Iteration

  • Example

    • Consider 2 pages: Page A and Page B pointing to each other.

    • C(A) = 1, number of outgoing links of A

    • C(B) = 1, number of outgoing links of B

    • What should be the initial value of P(A) or P(B) ?

Guess 1:

  • Suppose the initial values are:

    • P(A)=1 and P(B) = 1 and d= 0.85; then

      • PR(A) = (1 - d) + d*(PR(B)/1)

      • PR(B) = (1 - d) + d*(PR(A)/1)

      • PR(A) = 0.15 + 0.85 * 1 = 1

      • PR(B) = 0.15 + 0.85 * 1 = 1

    • In one iteration we are done

  • Let's try another set of initial values.

Guess 2 With 3 Iterations

  • Initial Values : P(A) = 0, P(B) = 0 and d= 0.85

    • PR(A) = (1-d) + d(PR(B)/1)

    • PR(B) = (1 - d) + d(PR(A)/1)

  • PR(A) = 0.15 + 0.85 * 0 = 0.15

  • PR(B) = 0.15 + 0.85 * 0.15 = 0.2775

  • Iterating again we get:

    • PR(A) = 0.15 + 0.85 * 0.2775 = 0.385875

    • PR(B) = 0.15 + 0.85 * 0.385875 = 0.47799375

  • And iterating again

    • PR(A) = 0.15 + 0.85 * 0.47799375 = 0.5562946875

    • PR(B) = 0.15 + 0.85 * 0.5562946875 = 0.62285048437

Guess 2: Continued

  • After 20 iterations…

  • Both approaching to 1,

  • The last three slides might seem confusing - because they use the current (rather than previous) values of PR(A), to calculate PR(B). "Technically" we ARE supposed to be using PR(A) and PR(B) from the previous step, to update them for the current step. We can 'get away with' being careless/sloppy because… it doesn't matter (which we use - previous or current values)! After enough steps, either way will result in correct, converged PR values.

  • The following screengrabs illustrate the "proper" way of using prior values, but, with arbitrary starting values to show they don't matter.

PageRank Convergence?

  • Some observations about the damping factor

    • The damping factor value and its effect:

      • For certain graphs the simple update rule can cause pagerank to accumulate and get stuck in certain parts of the graph

        • E.g. if a page has no outgoing links to other pages, it is called a sink

      • The simplified pagerank algorithm can get stuck in sinks

      • This is fixed by having each node on every round

        • Give a d fraction of its pagerank to its neighbors

        • Give a (1-d) fraction of its pagerank to everyone in the graph

      • As a result, pages with no incoming links will get some pagerank

      • If too high, more iterations are required

      • If too low, you get repeated over-shoot,

        • Both above and below the average – the numbers just swing like pendulum and never settle down.

Examples of Relative PageRanks 1

  • Every page has at least a PR of 0.15 to start out.

  • Page D has no votes but still it has a PR of 0.15

  • It is believed that Google undergoes a post-spidering phase whereby any pages that have no incoming links at all are ignored wrt PageRank

  • Examples on the following pages are taken from http://www.sirgroane.net/google-page-rank/

Example 2: Simple hierarchy with Some Outgoing Links

  • Home has the most PR

  • But average PR is 0.378

  • "External site" pages are not voting for anyone.

  • Links within your own site can have a significant effect on PageRank.

Example 3: Linking External Sites Back into our Home Page

  • Comparing this example with the previous one, now the Pagerank of the Home Page has increased significantly.

  • Moreover, the average PageRank is better because of all the external sites linking back to the home page.

Example 4: Simple Hierarchy

  • Home Page has PageRank of 2.5 times the page rank of its child pages.

  • A hierarchy structure concentrates votes and PageRank into one page.

Example 5: Hierarchical - But with One Link In and One Out

  • The PageRank of Home page has increased from 1.92 (Previous Example) to 3.31

  • Site A contributed 0.85 PR to Home page and the raised PageRank in the "About", "Product” and “More” pages has had a lovely "feedback" effect, pushing up the home page's PageRank even further!

  • A well structured site will amplify the effect of any contributed PR.

Example 6: Looping

  • All the pages have the same number of incoming links, i.e. all pages are of equal importance to each other.

  • Each page has PR of 1.0

  • Average PR is 1.0

Example 7: Looping - But with a Link in and a Link Out

  • PageRank of our home page has gone up a little, but what's happened to the "More" page? Its PageRank has gone down

  • Now the Pagerank value of the external Site B is equal to the "More" page.

  • The vote of the “Product” page has been split evenly between "More" page and the external site B.

  • This is good for Site B for otherwise its PageRank would be 0.15

Example 8: Extensive Interlinking or Fully Meshed

  • All pages are of equal importance to each other so the resulting PageRank is no different than what was gotten in Example 6

Example 9: Fully Meshed – But with One Vote in and One Vote Out

  • "Product" page has kept three quarters of its vote within our site unlike example 9 where it was giving away fully half of it's vote to the external site!

  • Keeping just this small extra fraction of the vote within our site has had a very nice effect on the Home Page too.

Example 10: Previous … Next… Documentation Page Layout

  • The first page of the document has a higher PR than the Home Page! This is because page B is getting all the vote from page A, but page A is only getting fractions of pages B, C and D.

  • In order to give users of your site a good experience, you may have to take a hit against your PR

Example 11: Getting Higher PR the Wrong Way!

  • Average PR: 1.000

  • 1000 incoming links and only one outgoing link

  • It doesn't matter how many pages you have in your site, your average PR will always be 1.0 at best.

  • But a hierarchical layout can strongly concentrate votes, and therefore the PR, into the home page!

  • This is a technique used by some disreputable sites

  • A link farm is set of web pages created with the sole aim of linking to a target page, in an attempt to improve that page's search engine ranking.

Some Suggestions Based on What We Have Seen in Examples.

  • Suggestions for improving your page rank

    • Increasing the internal links in your site can minimize the damage to your PR when you give away votes by linking to external sites.

    • If a particular page is highly important - use a hierarchical structure with the important page at the "top".

    • Where a group of pages may contain outward links – increase the number of internal links to retain as much PR as possible.

    • Where a group of pages do not contain outward links - the number of internal links in the site has no effect on the site's average PR. You might as well use a link structure that gives the user the best navigational experience.

    • Use Site Maps: Linking to a site map on each page increases the number of internal links in the site, spreading the PR out and protecting you against your vote "donations" to other sites

Importance of PageRank

  • PageRank is just one factor Google uses to determine a page's relevance. It assumes that people will link to your page only if they think the page is good. But this is not always true.

  • Content is still the king!!!

    • Anchor, body, title tags etc. still are very important for search engines

  • From Chris Ridings' Paper, "PageRank Uncovered" (http://www.voelspriet2.nl/PageRank.pdf):

    • You must do enough "on the page" work and/or anchor text work to get into that subset of top pages for your chosen key phrase, otherwise your high PageRank will be completely in vain.

  • PageRank is a multiplier factor.

  • If Google wants to penalize any page they can set the PageRank equal to a small number, even 0. As a result it will be pushed down on the search results page.

Note
0.0(0)