PROJECT DETAILS

Due Date: April 7, 11:59p.

Late assignments will be accepted until April 9, 11:59p - not thereafter.

Last time a lot of people having problems did not come see me (or Dr. Ding). Please, if you need help let me know!


Your project is to build a simple search engine. Please note it has several interlinking parts that you will need to hand in separately. There are no restrictions with respect to using Java libraries this time, however if you use one from elsewhere you must provide a pointer to where it came from. You are also welcome to use C++, Perl or Python. If you want to use a different language than one of these four please check with me.

Here is what you must do:

You will be given a tarball which contains 500 files named page0 through page499. Each page contains a list of other pages which it points to (there may be anywhere from zero to three hundred of these, pages may refer back to themselves). Following this there will be a row of hyphens (always the same amount) and then a list of 20 words which the page contains (all pages will contain 20 words).

From these you need to create a program that produces two output files - a page rank file named PageRank and a reverse index file named ReverseIndex. Your program will then enter a loop which takes as inputs a list of one or more words and then returns the top ten hits (list of pages) for those words. If only one word is used, then the program should print the top ten pages for that word. If multiple words are used, then the program should print two sets of output - the top ten pages that contain all the words (AND semantics), and the top ten pages that contain any of those words (OR semantics).

Your program should loop until the input is *end* (including the asterisks).

Your program will take two command line argument - the page rank stopping difference and the "d" value, in that order, details below.

Please note that there is one special page -- it is page500. This is a page that is pointed to in the crawl, but for which there is no page file. Consider this page as if it was an image or other such document. Thus, it does participate in the computation, but it does not itself link out to anything. To handle this page you need to

  - save page500 in your structure
   - calculate its pagerank (like all other pages)
   - include it in the pagerank list
(note that this means there are 501 pages in the system -- i.e. that N=501.)

Some Other Details

Note: the search should be case sensitive, and punctuation counts (i.e. "don't" is not the same as "dont") -- in essence, use string equality.

When you produce the output, please use ASCII order (i.e. uppercase letters befre lower case).

1) Page Rank list

In class we briefly went over the page rank algorithm, we discussed the pseudocode:

PageRank(page[0..n-1], d,DIFF);
 1 INITIALIZE:
     PR(i) = 1/N for all i;
 2  diff = 0;
   For each P in Page
    PRbefore= PR(P);
    PR(P) = (1-d)/N + d * ((PR(t1)/CR(t1) + PR(t2)/CR(t2) ...)  
           for all ti that point to P 
	   but are not equal to P
      (CR(ti) is the outbranching of ti)
    diff=diff + AbsoluteValue(PRbefore - PR(P);
 3 if diff >= DIFF then goto 2
 4 done.

Your program will take two arguments, the stopping difference and the "d" value, and compute the page rank for all 500 pages, stopping when the cumulative diff for all 500 goes below that number (a real number greater than zero) - note, be careful when debugging - for some settings of these two parameters your program could loop forever (we'll provide some useful values for debugging)

Your output should be a file with a list of the pages in page rank order (highest to lowest), for each page listing its outbranching (how many pages it connects to) and the page rank for that page (to 8 significant digits).

Example
page500  137  .0023781
page503  196  .0022139
page1500   7  .0021101

2) Reverse index

Your program should also produce a file which shows its reverse index. That is, for every word (note: including commonly used ones) you should output the word followed by a list of the pages that include that word (please alphabetize the word list, the pages can be produced in any order you choose - but each page should only appear once!).

Example
a page500 page522 page916 page803
an page999 page921 page998 page763 page1227 page501 page937 page862 page500 page522 page916 page803 page919 page997
at page900 page2000 page1227 page501 page937 page862 
zuchini page1000

3) Search engine

Given the datastructures containing the above, your search engine is quite simple. You take as input a list of words and you output the top ten pages (sorted by page rank) that contain them (if less than 10, then print all of them). When a single word is entered, just output the pages for that word. When multiple words are entered, output two lists - the top ten pages with all the words and the top ten pages with any of the words (multiple words do not effect the ranking here).

Example
Enter Word: moose
page1000 page2001 page2000 

Enter Word: moose cow
AND (moose cow) page1000
OR (moose cow) page1000 page2006 page2001 page1983 page776 page842 page777 page2000 page963 page871

Enter Word: *end*

submission details

Submission details will be posted a fewe days before the assignment is due.