Bibliometrics has been active since at least the 1960s.
Definition of Bibliometrics from Wikipedia:
"Bibliometrics is a set of methods to quantitatively analyze academic literature.
Citation analysis and content analysis are commonly used bibliometric methods.
While bibliometric methods are most often used in the field of library and information science, bibliometrics have wide applications in other areas."
Many research fields use bibliometric methods to explore:
The impact of their field.
The impact of a set of researchers.
The impact of a particular paper.
PageRank is a web link analysis algorithm introduced by Google.
Developed at Stanford University by Google founders Larry Page and Sergey Brin.
The paper describing PageRank was co-authored by Rajeev Motwani and Terry Winograd.
PageRank is a "vote" by all the other pages on the Web about how important a page is.
A link to a page counts as a vote of support.
If there's no link, there's no support (abstention rather than a vote against).
PageRank says nothing about the content or size of a page, the language, or the anchor text of a link.
PageRank is a probability distribution representing the likelihood that a person randomly clicking on links will arrive at any particular page.
Patent for the PageRank algorithm.
Larry Page is credited as the inventor.
The patent was awarded to Stanford University.
The patent was filed in January 1998.
The PageRank patent expired in 2017.
Google holds a perpetual license to the patent.
Google has never pursued other search engine companies for using the PageRank algorithm.
PR(u) = \sum_{v \in in(u)} \frac{PR(v)}{|out(v)|}
The PageRank value for a page u is dependent on the PageRank values for each page v contained in the set in(u) (the set of all pages linking to page u), divided by the number out(v) of outgoing links from page v.
Iteration 0: Initialize all ranks to be 1 / (number of total pages).
Iteration 1: For each page u, update u's rank to be the sum of each incoming page v's rank from the previous iteration, divided by the total number of links from page v.
Iteration 0: Initialize all pages to have rank 1/5.
Iteration 1:
P1 has 1 link from P3, and P3 has 4 outbound links, so we take the rank of P3 from iteration 0 and divide it by 4, which results in rank (1/5)/4 = 1/20 for P_1.
PR(P_1) = (1/5) / 4 = 1/20
P2: has 2 links from P1 and P3, P1 has 1 outbound link and P3 has 4 outbound links, so we take (the rank of P1 from iteration 0 and divide it by 1) and add that to (the rank of P3 from iteration 0 and divided that by 4) to get 1/5 + 1/20 = 5/20 for P2
PR(P_2) = 1/5 + (1/5)/4 = 5/20
Final Ranking: P5, P4, P3, P2, P1
PR(P_5) = 1/5 + 1/5 * 1/2 + 1/5 * 1/4 = 7/20
Say we have four pages: A, B, C and D.
Links from a page to itself are ignored.
Multiple links from one page to another are treated as a single link.
In this example, every page would start out with a rank of 0.25
Since B, C, and D all have outbound links to A, the Pagerank of A will be 0.75 upon the first iteration: (B with rank of 0.25) + (C with rank of 0.25) + (D with rank of 0.25) would transfer all of those ranks to A
But wait! What about ranks of pages B,C, and D?
Because B, C, and D have no incoming edges and they give all their rank to A, they will all end up with a rank of 0. This doesn't add up to 1 …
So the simplified algorithm needs to be able to handle border cases, so we must generalize the PageRank algorithm!
Quoting from the original Google paper, PageRank is defined like this:
"We assume page A has pages T1… Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A."*
The PageRank of a page A is given as follows:
PR(A) = (1-d) + d * (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))
Note:
That the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
*The Anatomy of a Large-Scale Hypertextual Web Search Engine by Brin and Page, http://infolab.stanford.edu/pub/papers/google.pdf
PR(A) is PageRank of Page A (one we want to calculate)
PR(T_1) is the PageRank of Site T1 pointing to page A
C(T_1) is the number of outgoing links of page T1
PR(Tn)/C(Tn): If page A has a backlink from page "Tn" the share of the vote page A will get is "PR(Tn)/C(Tn)"
d(…): All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is “damped down” by multiplying it by the factor "d", (0.85)
(1-d): Since "sum of all web pages' PageRanks will be one". Even if the d(…) is 0 then the page will get a small PR of 0.15
PR of each page depends on PR of other pages which are pointing to it. But we don't know PR of a given page until the PR of other pages is calculated and so on…
From the Google paper:
PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web
What this means is that we can calculate the PR without knowing the final PR of other pages
Recurrence Relation: an equation that recursively defines a sequence of values once one or more initial terms are given;
We calculate PR iteratively a number of times until we start converging to the same value.
Early experiments on Google used 322 million links
PR (322 Million Links): 52 iterations
PR (161 Million Links): 45 iterations
Number of iterations required for convergence is empirically (but not formally derived) O(log n) (where n is the number of links)
Hence the calculation is quite efficient
Example
Consider 2 pages: Page A and Page B pointing to each other.
C(A) = 1, number of outgoing links of A
C(B) = 1, number of outgoing links of B
What should be the initial value of P(A) or P(B) ?
Suppose the initial values are:
P(A)=1 and P(B) = 1 and d= 0.85; then
PR(A) = (1 - d) + d*(PR(B)/1)
PR(B) = (1 - d) + d*(PR(A)/1)
PR(A) = 0.15 + 0.85 * 1 = 1
PR(B) = 0.15 + 0.85 * 1 = 1
In one iteration we are done
Let's try another set of initial values.
Initial Values : P(A) = 0, P(B) = 0 and d= 0.85
PR(A) = (1-d) + d(PR(B)/1)
PR(B) = (1 - d) + d(PR(A)/1)
PR(A) = 0.15 + 0.85 * 0 = 0.15
PR(B) = 0.15 + 0.85 * 0.15 = 0.2775
Iterating again we get:
PR(A) = 0.15 + 0.85 * 0.2775 = 0.385875
PR(B) = 0.15 + 0.85 * 0.385875 = 0.47799375
And iterating again
PR(A) = 0.15 + 0.85 * 0.47799375 = 0.5562946875
PR(B) = 0.15 + 0.85 * 0.5562946875 = 0.62285048437
After 20 iterations…
Both approaching to 1,
The last three slides might seem confusing - because they use the current (rather than previous) values of PR(A), to calculate PR(B). "Technically" we ARE supposed to be using PR(A) and PR(B) from the previous step, to update them for the current step. We can 'get away with' being careless/sloppy because… it doesn't matter (which we use - previous or current values)! After enough steps, either way will result in correct, converged PR values.
The following screengrabs illustrate the "proper" way of using prior values, but, with arbitrary starting values to show they don't matter.
Some observations about the damping factor
The damping factor value and its effect:
For certain graphs the simple update rule can cause pagerank to accumulate and get stuck in certain parts of the graph
E.g. if a page has no outgoing links to other pages, it is called a sink
The simplified pagerank algorithm can get stuck in sinks
This is fixed by having each node on every round
Give a d fraction of its pagerank to its neighbors
Give a (1-d) fraction of its pagerank to everyone in the graph
As a result, pages with no incoming links will get some pagerank
If too high, more iterations are required
If too low, you get repeated over-shoot,
Both above and below the average – the numbers just swing like pendulum and never settle down.
Every page has at least a PR of 0.15 to start out.
Page D has no votes but still it has a PR of 0.15
It is believed that Google undergoes a post-spidering phase whereby any pages that have no incoming links at all are ignored wrt PageRank
Examples on the following pages are taken from http://www.sirgroane.net/google-page-rank/
Home has the most PR
But average PR is 0.378
"External site" pages are not voting for anyone.
Links within your own site can have a significant effect on PageRank.
Comparing this example with the previous one, now the Pagerank of the Home Page has increased significantly.
Moreover, the average PageRank is better because of all the external sites linking back to the home page.
Home Page has PageRank of 2.5 times the page rank of its child pages.
A hierarchy structure concentrates votes and PageRank into one page.
The PageRank of Home page has increased from 1.92 (Previous Example) to 3.31
Site A contributed 0.85 PR to Home page and the raised PageRank in the "About", "Product” and “More” pages has had a lovely "feedback" effect, pushing up the home page's PageRank even further!
A well structured site will amplify the effect of any contributed PR.
All the pages have the same number of incoming links, i.e. all pages are of equal importance to each other.
Each page has PR of 1.0
Average PR is 1.0
PageRank of our home page has gone up a little, but what's happened to the "More" page? Its PageRank has gone down
Now the Pagerank value of the external Site B is equal to the "More" page.
The vote of the “Product” page has been split evenly between "More" page and the external site B.
This is good for Site B for otherwise its PageRank would be 0.15
All pages are of equal importance to each other so the resulting PageRank is no different than what was gotten in Example 6
"Product" page has kept three quarters of its vote within our site unlike example 9 where it was giving away fully half of it's vote to the external site!
Keeping just this small extra fraction of the vote within our site has had a very nice effect on the Home Page too.
The first page of the document has a higher PR than the Home Page! This is because page B is getting all the vote from page A, but page A is only getting fractions of pages B, C and D.
In order to give users of your site a good experience, you may have to take a hit against your PR
Average PR: 1.000
1000 incoming links and only one outgoing link
It doesn't matter how many pages you have in your site, your average PR will always be 1.0 at best.
But a hierarchical layout can strongly concentrate votes, and therefore the PR, into the home page!
This is a technique used by some disreputable sites
A link farm is set of web pages created with the sole aim of linking to a target page, in an attempt to improve that page's search engine ranking.
Suggestions for improving your page rank
Increasing the internal links in your site can minimize the damage to your PR when you give away votes by linking to external sites.
If a particular page is highly important - use a hierarchical structure with the important page at the "top".
Where a group of pages may contain outward links – increase the number of internal links to retain as much PR as possible.
Where a group of pages do not contain outward links - the number of internal links in the site has no effect on the site's average PR. You might as well use a link structure that gives the user the best navigational experience.
Use Site Maps: Linking to a site map on each page increases the number of internal links in the site, spreading the PR out and protecting you against your vote "donations" to other sites
PageRank is just one factor Google uses to determine a page's relevance. It assumes that people will link to your page only if they think the page is good. But this is not always true.
Content is still the king!!!
Anchor, body, title tags etc. still are very important for search engines
From Chris Ridings' Paper, "PageRank Uncovered" (http://www.voelspriet2.nl/PageRank.pdf):
You must do enough "on the page" work and/or anchor text work to get into that subset of top pages for your chosen key phrase, otherwise your high PageRank will be completely in vain.
PageRank is a multiplier factor.
If Google wants to penalize any page they can set the PageRank equal to a small number, even 0. As a result it will be pushed down on the search results page.