Late assignments will be accepted until April 9, 11:59p - not thereafter.
Last time a lot of people having problems did not come see me (or Dr. Ding). Please, if you need help let me know!
Your project is to build a simple search engine. Please note it has several interlinking parts that you will need to hand in separately. There are no restrictions with respect to using Java libraries this time, however if you use one from elsewhere you must provide a pointer to where it came from. You are also welcome to use C++, Perl or Python. If you want to use a different language than one of these four please check with me.
Here is what you must do:
You will be given a tarball which contains 500 files named page0 through page499. Each page contains a list of other pages which it points to (there may be anywhere from zero to three hundred of these, pages may refer back to themselves). Following this there will be a row of hyphens (always the same amount) and then a list of 20 words which the page contains (all pages will contain 20 words).
From these you need to create a program that produces two output files - a page rank file named PageRank and a reverse index file named ReverseIndex. Your program will then enter a loop which takes as inputs a list of one or more words and then returns the top ten hits (list of pages) for those words. If only one word is used, then the program should print the top ten pages for that word. If multiple words are used, then the program should print two sets of output - the top ten pages that contain all the words (AND semantics), and the top ten pages that contain any of those words (OR semantics).
Your program should loop until the input is *end* (including the asterisks).
Your program will take two command line argument - the page rank stopping difference and the "d" value, in that order, details below.
Please note that there is one special page -- it is page500. This is a page that is pointed to in the crawl, but for which there is no page file. Consider this page as if it was an image or other such document. Thus, it does participate in the computation, but it does not itself link out to anything. To handle this page you need to
- save page500 in your structure - calculate its pagerank (like all other pages) - include it in the pagerank list(note that this means there are 501 pages in the system -- i.e. that N=501.)
Note: the search should be case sensitive, and punctuation counts (i.e. "don't" is not the same as "dont") -- in essence, use string equality.
When you produce the output, please use ASCII order (i.e. uppercase letters befre lower case).
In class we briefly went over the page rank algorithm, we discussed the pseudocode:
PageRank(page[0..n-1], d,DIFF); 1 INITIALIZE: PR(i) = 1/N for all i; 2 diff = 0; For each P in Page PRbefore= PR(P); PR(P) = (1-d)/N + d * ((PR(t1)/CR(t1) + PR(t2)/CR(t2) ...) for all ti that point to P but are not equal to P (CR(ti) is the outbranching of ti) diff=diff + AbsoluteValue(PRbefore - PR(P); 3 if diff >= DIFF then goto 2 4 done.
Your program will take two arguments, the stopping difference and the "d" value, and compute the page rank for all 500 pages, stopping when the cumulative diff for all 500 goes below that number (a real number greater than zero) - note, be careful when debugging - for some settings of these two parameters your program could loop forever (we'll provide some useful values for debugging)
Your output should be a file with a list of the pages in page rank order (highest to lowest), for each page listing its outbranching (how many pages it connects to) and the page rank for that page (to 8 significant digits).
Example page500 137 .0023781 page503 196 .0022139 page1500 7 .0021101
Your program should also produce a file which shows its reverse index. That is, for every word (note: including commonly used ones) you should output the word followed by a list of the pages that include that word (please alphabetize the word list, the pages can be produced in any order you choose - but each page should only appear once!).
Example a page500 page522 page916 page803 an page999 page921 page998 page763 page1227 page501 page937 page862 page500 page522 page916 page803 page919 page997 at page900 page2000 page1227 page501 page937 page862 zuchini page1000
Given the datastructures containing the above, your search engine is quite simple. You take as input a list of words and you output the top ten pages (sorted by page rank) that contain them (if less than 10, then print all of them). When a single word is entered, just output the pages for that word. When multiple words are entered, output two lists - the top ten pages with all the words and the top ten pages with any of the words (multiple words do not effect the ranking here).
Example Enter Word: moose page1000 page2001 page2000 Enter Word: moose cow AND (moose cow) page1000 OR (moose cow) page1000 page2006 page2001 page1983 page776 page842 page777 page2000 page963 page871 Enter Word: *end*
Submission details will be posted a fewe days before the assignment is due.