CSCI4964/COMM4965 - Web Science
Project1: Building the Web Graph
Due
This project is due at 11:59 PM (East Coast US) on Thursday March 6th. After that it is late.
Details of submitting the assignment will be posted the week before the assignment is due.
Late Policy
The project will be accepted up until 11:59 PM (East Coast US) on Monday March 10 (yes, that is during break), at a considerable mark-down. After that it will not be accepted unless premission of the instructor is obtained prior to March 10.
Assignment
Objectives
The primary purpose of this assignment is to understand the portion of the Web that is, and is not, included in "Web Graph" analyses. Along the way, you will also learn:
- The basics of how a web crawler works
- What happens "below the browser" in the exciting world of Web protocols (
i.e. how HTTP requests and responses really work).
- How to build a polite Web robot (and how not to build an impolite one).
Note: You may use any programming language you like, although Java is recommended (and there will be a lecture in class on some of the JDK libraries that will be useful in building your crawler). If you use any other language, you must identify any other packages, libraries, objects, APIs or etc. that you use, and must include a short textual description (in a separate file) describing what you did (i.e. how you modified or used the code).
Your mission
Your robot will be given one starting URL
and a list
of allowed hosts, and it will produce a Web graph representation (and some other information) by
following the URLs found in the pages on those hosts, You may use any search order you choose (note that if you go after the extra credit, you may need a depth or node limit. You should not need one for the primary assignment - if you find even hundreds of pages, you're doing something wrong.) Details:
- On each page, your robot only needs to find URLs
indicated by "href"
- the value of "href" could be surrounded by single-quote
or double-quote
- note that a URI cannot contain white space
characters
- the URL obtained from the value of "href" could be
either an absolute or a relative URL (fragment) - in the latter case you will need to handle it appropriately
- Your robot should not visit the same URL twice (i.e. you need to do Loop/Dup elimination)
- Your robot must only visit URLs on the allowed hosts
- Your robot should stop when it cannot find any more URLs (see note above on limits)
- You must make sure that your HTTP requests are properly formed and include:
- http-version: use HTTP 1.0 instead of HTTP 1.1 (This simplifies the headers and reduces some potential errors)
- user-agent: your robot should identify
itself by name. That name must be "csci4964-<your official RPI email account>" ( where <your official RPI email account> must
be replaced by your rcs account name, e.g. my robot is called
csci4964-hendlj2.
- Your robot must not visit any pages that are disallowed by a
"robots.txt"
file.
- Upon receiving an HTTP response, the robot needs to process
its header (not just the content) to be able to generate the output information specified below (this is important because many packages hide this detail from the user)
- You should extract URIs ONLY from pages with the
MIME types "text/plain" or "text/html"
The input file
The input file consists of the starting point URL and the list of
allowed websites (input.txt).
[STARTING
POINT URL]
http://tw.rpi.edu/2008/CSCI4964/starting.html
[ALLOWED HOSTS]
tw.rpi.edu
inference-web.org
Output Specification
Your robot must produce an output file with the following three sections - see (example
output.txt). Each section starts with a section name
surrounded by rectangle bracket.
1. A list of all the different content-types returned from the HTTP response
headers on the URIs you have tested
[CONTENT
TYPES]
text/plain
text/html
application/xml+rdf
2. A section called [NODES]. Each node corresponds to a URL and its metadata
(see below):
- ID - an automatically generated unique positive
integer for the URL
- CODE
- 900 - if the URL's scheme is different from
"http" (case-insensitive)
- 901 - if the URL is "disallowed" by
the corresponding robots.txt
- 902 - if the URL is not hosted by any of the
allowed web servers
- the code returned in http response
- URL - the URL found by the crawler
- DESCRIPTION
- the URL's scheme - for 900 (see above)
- DISALLOW - for 901 (see above)
- OK - for 200s response code
- the redirected URL - for 300s response code
- NG - for any 400s response code
- UNKNOWN - for none of the above
The description of a node should be printed on one line, and
all
the values should be delimitated by commas. For example:
[NODES]
1, 200, http://foo.example.org/, OK
2, 200, http://foo.example.org/page1, OK
3, 301, http://foo.example.org/page12, http://foo.example.org/page23
4, 900, mailto:hendler@cs.rpi.edu, mailto
5, 901, http://foo.example.org/disallow/page1, DISALLOW
6, 404, http://foo.example/page1, NG
7, 500, http://foo1.example.org/page1, UNKNOWN
3. A section called [ARCS] which is a list of the links between nodes. Each arc consists of an id and a pair of nodes
(from, to) delimitated by comma. The pair of
nodes indicates
a hyperlink. Fro example:
[ARCS]
1, 2
2, 6
2, 5
2, 2
Extra Credit
If, and only if, you finish the above, you can get extra credit by participating in the following challenge. Using your crawler (suitably modified), the starting node "http://www.cs.rpi.edu/" and staying only in the cs.rpi.edu domain (and observing the robots.txt), produce a list of email addresses you can scrape from the pages. Feel free to be as creative as you like in looking for them, as long as you observe the rules of being a polite robot (and if your robot is very fast, please include some sleep commands). Do NOT create the output above for the extra credit - just a list of email addresses and the web page they were found on.
To make this a bit more fun, the person finding the most (unique) email addresses will not only get bragging rights, but also will get 20 extra points on the midterm exam. Second place will get 10 extra points.
>/p>
Some Hints
- it may be useful to use the online tool http://web-sniffer.net/)
to see the http request header, http response header, and http response
content for any URL. (Better than the LiveHeaders we've been using in class)
- There is a nice web site describing HTTP redirection which may help.
- Wikipedia has a good article on Web crawlers that includes a section on Open source Web crawlers but be careful, if you use these they are likely to hide some of the detais that you need to see, so you may end up having more trouble modifying these programs than writing your own.