12/14/15

Java Program to Implement Simple PageRank Algorithm

When you go and type some keywords in Google Search Engine a list of Web Pages will be displayed ,but how does the search engine know which page to be shown first to the user ? To solve this problem a  algorithm called PageRank was developed at Stanford university by Larry Page and Sergey Brin in 1996.The PageRank Algorithm uses probabilistic distribution to calculate rank of a Web page and using this rank display the search results to the user. The Pagerank is recalculated every time  the search engine crawls the web.

Java Program to Implement Simple PageRank Algorithm

The original Page Rank algorithm which was described by Larry Page and Sergey Brin is :

PR(A) = (1-d) + d (PR(W1)/C(W1) + ... + PR(Wn)/C(Wn))

Where :
PR(A) – Page Rank of page A
PR(Wi) – Page Rank of pages Wi which link to page A
C(Wi) - number of outbound links on page Wi
d - damping factor which can be set between 0 and 1

To calculate PageRank for the n Webpages ,First we initialise all Webpages with equal page rank of 1/n each.Then Step by Step we calculate Page Rank for each Webpage one after the other.


Let us take one example :

Java Program to Implement Simple PageRank Algorithm


There are 5 Web pages represented by Nodes A, B, C , D, E .The hyperlink from each webpage to the other is represented by the arrow head.
Java Program to Implement Simple PageRank Algorithm

At 0th Step we have all Webpages PageRank values 0.2 that is 1/5 (1/n) . To get PageRank of Webpage A ,consider all the incoming links to A .So we have 1/4th the Page Rank of C is pointed to A. So it will be (1/5)*(1/4) which is (1/20) or 0.05 the Page Rank of A. 

Similarly the Page Rank of B will be  (1/5)*(1/4)+(1/5)*(1/1) which is (5/20) or 0.25 because A's PageRank value is 1/5 or 0.2 from Step 0 . Even though we got 0.05 of A's PageRank in Step 1 we are considering 0.05 when we are Calculating Page Rank of B in Step 2.

The general rule is --> we consider (N-1)th step values when we are calculating the Page Rank values for Nth Step . Not Clear ? Please Comment it below .

In Similar way we calculate all the Page Rank Values and Sort them to Get the Most important Webpage to be displayed in the Search Results .

Java Program to Implement Simple PageRank Algorithm
Edith Law - lecture12


Java Code for Page Rank Algorithm :


import java.util.*;
import java.io.*;
public class PageRank {

 public int path[][] = new int[10][10];
 public double pagerank[] = new double[10];

 public void calc(double totalNodes) {

  double InitialPageRank;
  double OutgoingLinks = 0;
  double DampingFactor = 0.85;
  double TempPageRank[] = new double[10];
  int ExternalNodeNumber;
  int InternalNodeNumber;
  int k = 1; // For Traversing
  int ITERATION_STEP = 1;
  InitialPageRank = 1 / totalNodes;
  System.out.printf(" Total Number of Nodes :" + totalNodes + "\t Initial PageRank  of All Nodes :" + InitialPageRank + "\n");

  // 0th ITERATION  _ OR _ INITIALIZATION PHASE //
  
  for (k = 1; k <= totalNodes; k++) {
   this.pagerank[k] = InitialPageRank;
  }

  System.out.printf("\n Initial PageRank Values , 0th Step \n");
  for (k = 1; k <= totalNodes; k++) {
   System.out.printf(" Page Rank of " + k + " is :\t" + this.pagerank[k] + "\n");
  }

  while (ITERATION_STEP <= 2) // Iterations
  {
   // Store the PageRank for All Nodes in Temporary Array 
   for (k = 1; k <= totalNodes; k++) {
    TempPageRank[k] = this.pagerank[k];
    this.pagerank[k] = 0;
   }

   for (InternalNodeNumber = 1; InternalNodeNumber <= totalNodes; InternalNodeNumber++) {
    for (ExternalNodeNumber = 1; ExternalNodeNumber <= totalNodes; ExternalNodeNumber++) {
     if (this.path[ExternalNodeNumber][InternalNodeNumber] == 1) {
      k = 1;
      OutgoingLinks = 0; // Count the Number of Outgoing Links for each ExternalNodeNumber
      while (k <= totalNodes) {
       if (this.path[ExternalNodeNumber][k] == 1) {
        OutgoingLinks = OutgoingLinks + 1; // Counter for Outgoing Links
       }
       k = k + 1;
      }
      // Calculate PageRank     
      this.pagerank[InternalNodeNumber] += TempPageRank[ExternalNodeNumber] * (1 / OutgoingLinks);
     }
    }
   }

   System.out.printf("\n After " + ITERATION_STEP + "th Step \n");

   for (k = 1; k <= totalNodes; k++)
    System.out.printf(" Page Rank of " + k + " is :\t" + this.pagerank[k] + "\n");

   ITERATION_STEP = ITERATION_STEP + 1;
  }
  // Add the Damping Factor to PageRank
  for (k = 1; k <= totalNodes; k++) {
   this.pagerank[k] = (1 - DampingFactor) + DampingFactor * this.pagerank[k];
  }

  // Display PageRank
  System.out.printf("\n Final Page Rank : \n");
  for (k = 1; k <= totalNodes; k++) {
   System.out.printf(" Page Rank of " + k + " is :\t" + this.pagerank[k] + "\n");
  }

 }
 
 public static void main(String args[]) {
  int nodes, i, j, cost;
  Scanner in = new Scanner(System.in);
  System.out.println("Enter the Number of WebPages \n");
  nodes = in .nextInt();
  PageRank p = new PageRank();
  System.out.println("Enter the Adjacency Matrix with 1->PATH & 0->NO PATH Between two WebPages: \n");
  for (i = 1; i <= nodes; i++)
   for (j = 1; j <= nodes; j++) {
    p.path[i][j] = in .nextInt();
    if (j == i)
     p.path[i][j] = 0;
   }
  p.calc(nodes);

 }
}
Please Click and Drag from the beginning to End of the above Source code for Selection and Copying.

To Compile and Run for above Example:

javac PageRank.java
java PageRank

Enter the Number of WebPages : 5
Enter the Adjacency Matrix with 1->PATH & 0->NO PATH Between two WebPages:

0 1 0 0 0
0 0 0 0 1
1 1 0 1 1
0 0 1 0 1
0 0 0 1 0

 Total Number of Nodes :5.0      Initial PageRank  of All Nodes :0.2

 Initial PageRank Values , 0th Step
 Page Rank of 1 is :    0.2
 Page Rank of 2 is :    0.2
 Page Rank of 3 is :    0.2
 Page Rank of 4 is :    0.2
 Page Rank of 5 is :    0.2

 After 1th Step
 Page Rank of 1 is :    0.05
 Page Rank of 2 is :    0.25
 Page Rank of 3 is :    0.1
 Page Rank of 4 is :    0.25
 Page Rank of 5 is :    0.35

 After 2th Step
 Page Rank of 1 is :    0.025
 Page Rank of 2 is :    0.07500000000000001
 Page Rank of 3 is :    0.125
 Page Rank of 4 is :    0.375
 Page Rank of 5 is :    0.4

 Final Page Rank  :
 Page Rank of 1 is :    0.17125
 Page Rank of 2 is :    0.21375000000000002
 Page Rank of 3 is :    0.25625000000000003
 Page Rank of 4 is :    0.46875
 Page Rank of 5 is :    0.49000000000000005

Note:
  • Final Page Rank Includes Damping Factor of 0.85 which is usually set between 0 and 1.
  • InternalNodeNumber represents the Node which you are currently calculating its PageRank.
  • ExternalNodeNumber represents the Nodes Other than InternalNodeNumber.


For every InternalNodeNumber check if there is any Incoming Links from ExternalNodeNumber if No - Ignore and move to next ExternalNodeNumber,If Yes - Count all the OutgoingLinks for that ExternalNodeNumber.

Finally Calculate Pagerank :
PR(InternalNodeNumber) += PR(ExternalNodeNumber)/All OutgoingLinks for ExternalNodeNumber

So from the above values , We have Webpage A(1) is the most important Page , Webpage B(2) and C(3) have almost equal importance with B(2) slightly more importance ,Webpage D(4) has some importance and Webpage E(5) has least importance.This helps to Rank Webpages in the Search results.

Please Note: Actual google Page rank Algorithm for large network of webpages grows logarithmic and slightly different from the one above. This Page Rank algorithm is fully owned by google inc and I just illustrated with a help of a Java Program to implement this Algorithm .I hope you enjoyed this .Thanks Have Nice Day.

Update1: New Example has been Added and Images are Updated.
Update2: I have Considered Damping Factor in my Implementation which is set to 0.85.
Update3:while(u<=2) Changed to while(ITERATION_STEP<=2).


14 comments:

  1. why while condition is u<=2? Please explain

    ReplyDelete
    Replies
    1. Thanks Sunayana Salunkhe for your comment .

      Some of the pages on the web have no links to other pages. Some of the links in the database have been picked up from fetched web pages but the web pages those links are pointing to are not yet fetched. If a page has not been fetched it is considered to be a page that has no links to other pages. These pages, called dangling links, without any links are a problem because their PageRank cannot be redistributed to other pages. Their PageRank “leaks out” of the graph and therefore gets lost.

      To prevent this, all pages without links and links to these pages are removed before calculation of PageRank values. Removing pages from the graph can cause other pages to lose all their links and become dangling links themselves. This is why removing dangling links is an iterative process.

      On every iteration all dangling links are identified before they are removed, and the number of iterations needed is counted. It is not necessary to remove all dangling links. A couple of iterations will remove most of them.

      Ref: Google: data structures and algorithms

      Delete
  2. what is the purpose of scanner class in the code: Scanner in = new Scanner(System.in); ?

    ReplyDelete
    Replies
    1. There are various ways to read input from the keyboard, the java.util.Scanner class is one of them.The Java Scanner class breaks the input into tokens using a delimiter that is whitespace bydefault. It provides many methods to read and parse various primitive values Java Scanner class is widely used to parse text for string and primitive types using regular expression.Java Scanner class extends Object class and implements Iterator and Closeable interfaces.

      Example : Scanner in = new Scanner(System.in);
      int a = in.nextInt(); // Variable "a" has Integer input from Console

      Delete
  3. How do I use the damping factor?

    ReplyDelete
    Replies
    1. I have Updated my Post and Added DampingFactor of 0.85

      Delete
  4. can you explain,in step 2, how D become 0.1 without any incoming edges and how E become 0.2 ..and also how you have got Adjacency Matrix as you mentioned .(the way you creater the matrix)thank you

    ReplyDelete
    Replies
    1. I have Updated my Post. Adjacency Matrix is got from the Directed Graph. Check if there is a Direct Path between two web pages then take path has 1, if there is no direct path or path goes through other nodes then take path has 0. :)

      Delete
  5. good job, this explanation is clear

    ReplyDelete
  6. You need to apply the damping factor in each iteration.

    ReplyDelete
  7. Why did you used this keyword ?

    ReplyDelete
  8. the keyword 'this' is used to refer current object instance method

    ReplyDelete