CSCI4964/COMM4965 - Web Science

Project1: Building the Web Graph

Due

This project is due at 11:59 PM (East Coast US) on Thursday March 6th. After that it is late.

Details of submitting the assignment will be posted the week before the assignment is due.

Late Policy

The project will be accepted up until 11:59 PM (East Coast US) on Monday March 10 (yes, that is during break), at a considerable mark-down. After that it will not be accepted unless premission of the instructor is obtained prior to March 10.

Assignment

Objectives

The primary purpose of this assignment is to understand the portion of the Web that is, and is not, included in "Web Graph" analyses. Along the way, you will also learn:

The basics of how a web crawler works
What happens "below the browser" in the exciting world of Web protocols ( i.e. how HTTP requests and responses really work).
How to build a polite Web robot (and how not to build an impolite one).

Note: You may use any programming language you like, although Java is recommended (and there will be a lecture in class on some of the JDK libraries that will be useful in building your crawler). If you use any other language, you must identify any other packages, libraries, objects, APIs or etc. that you use, and must include a short textual description (in a separate file) describing what you did (i.e. how you modified or used the code).

Your mission

Your robot will be given one starting URL and a list of allowed hosts, and it will produce a Web graph representation (and some other information) by following the URLs found in the pages on those hosts, You may use any search order you choose (note that if you go after the extra credit, you may need a depth or node limit. You should not need one for the primary assignment - if you find even hundreds of pages, you're doing something wrong.) Details:

On each page, your robot only needs to find URLs indicated by "href"

the value of "href" could be surrounded by single-quote or double-quote
note that a URI cannot contain white space characters
the URL obtained from the value of "href" could be either an absolute or a relative URL (fragment) - in the latter case you will need to handle it appropriately

Your robot should not visit the same URL twice (i.e. you need to do Loop/Dup elimination)
Your robot must only visit URLs on the allowed hosts
Your robot should stop when it cannot find any more URLs (see note above on limits)
You must make sure that your HTTP requests are properly formed and include:

http-version: use HTTP 1.0 instead of HTTP 1.1 (This simplifies the headers and reduces some potential errors)
user-agent: your robot should identify itself by name. That name must be "csci4964-<your official RPI email account>" ( where <your official RPI email account> must be replaced by your rcs account name, e.g. my robot is called csci4964-hendlj2.

Your robot must not visit any pages that are disallowed by a "robots.txt" file.
Upon receiving an HTTP response, the robot needs to process its header (not just the content) to be able to generate the output information specified below (this is important because many packages hide this detail from the user)
You should extract URIs ONLY from pages with the MIME types "text/plain" or "text/html"

The input file

The input file consists of the starting point URL and the list of allowed websites (input.txt).

[STARTING
POINT URL]

http://tw.rpi.edu/2008/CSCI4964/starting.html

[ALLOWED HOSTS]

tw.rpi.edu

inference-web.org

Output Specification

Your robot must produce an output file with the following three sections - see (example output.txt). Each section starts with a section name surrounded by rectangle bracket.

1. A list of all the different content-types returned from the HTTP response headers on the URIs you have tested

[CONTENT
TYPES]

text/plain

text/html

application/xml+rdf

2. A section called [NODES]. Each node corresponds to a URL and its metadata (see below):

ID - an automatically generated unique positive integer for the URL
CODE

900 - if the URL's scheme is different from "http" (case-insensitive)
901 - if the URL is "disallowed" by the corresponding robots.txt
902 - if the URL is not hosted by any of the allowed web servers
the code returned in http response

URL - the URL found by the crawler
DESCRIPTION

the URL's scheme - for 900 (see above)
DISALLOW - for 901 (see above)
OK - for 200s response code
the redirected URL - for 300s response code
NG - for any 400s response code
UNKNOWN - for none of the above

The description of a node should be printed on one line, and all the values should be delimitated by commas. For example:

[NODES]

1, 200, http://foo.example.org/, OK

2, 200, http://foo.example.org/page1, OK

3, 301, http://foo.example.org/page12, http://foo.example.org/page23

4, 900, mailto:hendler@cs.rpi.edu, mailto

5, 901, http://foo.example.org/disallow/page1, DISALLOW

6, 404, http://foo.example/page1, NG

7, 500, http://foo1.example.org/page1, UNKNOWN

3. A section called [ARCS] which is a list of the links between nodes. Each arc consists of an id and a pair of nodes (from, to) delimitated by comma. The pair of nodes indicates a hyperlink. Fro example:

[ARCS]

1, 2

2, 6

2, 5

2, 2

Extra Credit

If, and only if, you finish the above, you can get extra credit by participating in the following challenge. Using your crawler (suitably modified), the starting node "http://www.cs.rpi.edu/" and staying only in the cs.rpi.edu domain (and observing the robots.txt), produce a list of email addresses you can scrape from the pages. Feel free to be as creative as you like in looking for them, as long as you observe the rules of being a polite robot (and if your robot is very fast, please include some sleep commands). Do NOT create the output above for the extra credit - just a list of email addresses and the web page they were found on.

To make this a bit more fun, the person finding the most (unique) email addresses will not only get bragging rights, but also will get 20 extra points on the midterm exam. Second place will get 10 extra points. >/p>

Some Hints

it may be useful to use the online tool http://web-sniffer.net/) to see the http request header, http response header, and http response content for any URL. (Better than the LiveHeaders we've been using in class)
There is a nice web site describing HTTP redirection which may help.
Wikipedia has a good article on Web crawlers that includes a section on Open source Web crawlers but be careful, if you use these they are likely to hide some of the detais that you need to see, so you may end up having more trouble modifying these programs than writing your own.