CSCI4964/COMM4965 - Web Science

Project1: Building the Web Graph


Due

This project is due at 11:59 PM (East Coast US) on Thursday March 6th. After that it is late.

Details of submitting the assignment will be posted the week before the assignment is due.

Late Policy

The project will be accepted up until 11:59 PM (East Coast US) on Monday March 10 (yes, that is during break), at a considerable mark-down. After that it will not be accepted unless premission of the instructor is obtained prior to March 10.

Assignment

Objectives

The primary purpose of this assignment is to understand the portion of the Web that is, and is not, included in "Web Graph" analyses. Along the way, you will also learn: Note: You may use any programming language you like, although Java is recommended (and there will be a lecture in class on some of the JDK libraries that will be useful in building your crawler). If you use any other language, you must identify any other packages, libraries, objects, APIs or etc. that you use, and must include a short textual description (in a separate file) describing what you did (i.e. how you modified or used the code).

Your mission

Your robot will be given one starting URL and a list of allowed hosts, and it will produce a Web graph representation (and some other information) by following the URLs found in the pages on those hosts, You may use any search order you choose (note that if you go after the extra credit, you may need a depth or node limit. You should not need one for the primary assignment - if you find even hundreds of pages, you're doing something wrong.) Details:

The input file 

The input file consists of the starting point URL and the list of allowed websites (input.txt).
[STARTING POINT URL]
http://tw.rpi.edu/2008/CSCI4964/starting.html

[ALLOWED HOSTS]
tw.rpi.edu
inference-web.org

Output Specification

Your robot must produce an output file with the following three sections - see (example output.txt). Each section starts with a section name surrounded by rectangle bracket.

1. A list of all the  different content-types returned from the HTTP response headers on the URIs you have tested
[CONTENT TYPES]
text/plain
text/html
application/xml+rdf

2. A section called [NODES]. Each node corresponds to a URL and its metadata (see below):
 The description of a node should be printed on one line, and all the values should be delimitated by commas. For example:
[NODES]
1, 200, http://foo.example.org/, OK
2, 200, http://foo.example.org/page1, OK
3, 301, http://foo.example.org/page12, http://foo.example.org/page23
4, 900, mailto:hendler@cs.rpi.edu, mailto
5, 901, http://foo.example.org/disallow/page1, DISALLOW
6, 404, http://foo.example/page1, NG
7, 500, http://foo1.example.org/page1, UNKNOWN

3. A section called [ARCS] which is a list of the links between nodes. Each arc consists of an id and a pair of nodes (from, to) delimitated by comma. The pair of nodes indicates a hyperlink. Fro example:
[ARCS]
1, 2
2, 6
2, 5
2, 2

Extra Credit

If, and only if, you finish the above, you can get extra credit by participating in the following challenge. Using your crawler (suitably modified), the starting node "http://www.cs.rpi.edu/" and staying only in the cs.rpi.edu domain (and observing the robots.txt), produce a list of email addresses you can scrape from the pages. Feel free to be as creative as you like in looking for them, as long as you observe the rules of being a polite robot (and if your robot is very fast, please include some sleep commands). Do NOT create the output above for the extra credit - just a list of email addresses and the web page they were found on.

To make this a bit more fun, the person finding the most (unique) email addresses will not only get bragging rights, but also will get 20 extra points on the midterm exam. Second place will get 10 extra points. >/p>

Some Hints