CSCI4964: Crawling the Web
0. Prolog
Tim Berners Lee's talk on Web day 2007
1. From Web Surfing to Web Crawling
1.1 Surfing the web
- The first http web page is created by Tim Berners-Lee in late 1990 (see Examples of early WWW hypertext )
- Our personal web surfing experiences are mainly enabled by Web browser, Hyperlinks and HTTP protocol .
- Web surfing always need a seeding URL as the starting point. In early days, we use paperback Internet Yellow Pages
- Then, the Web became huge, and the web search engines (such as Google and Yahoo!) made huge amount of web pages a few surfing steps away from us! They collect URLs by
1.2 Crawling the web
- Web crawlers automatically collect web pages, and it is one of the key technologies of web search engines
2. Building a Simple Java Web Crawler
Here is a ten years old online tutorial: http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/ (
By Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold of Muscle Fish, LLC, Jan 1998). Beyond the example code, we need to handle the following issues:
2.1 Java Issues
Java has evolved in the past 10 years, so we should use new
technologies and avoid deprecated code. You should consider use JDK 5
or higher version.
Below are some useful classes from JDK:
- Vector => Set, List
- URL
- URLConnection
- HttpURLConnection
- String functions, e.g. regular expression match
2.2 HTTP content negotiation
Checkout this Exmaple
URLConnection con = null;
con = url.openConnection();
con.setRequestProperty("User-Agent", HTTP_USER_AGENT);
We need to set properties of HTTP Request Header
We need to get properties of HTTP Response Header
- Content-Type
- http response code
task.m_conn = (HttpURLConnection)connection;
// process http response information
task.m_nHttpResponseCode = task.m_conn.getResponseCode();
2.3 Download content of web page
be care of charater encoding when converting Byte array into String
(not necessary in this work, but imporant in real world practice)
2.4 Extract URLs from web page
typical pattern of hyperlinks
<a href="http://foo.com/"> example </a>
URL parsing approaches, e.g. simple string matching or regular expression matching
URL normalization
- /example.html -- relative URL
- http://foo.com/example.html -- absolute URL
- http://www.google.com/search?q=test -- URL with query part
2.5 When to skip a URL for Politeness or other issue
acknowledge robots.txt
only visit URL hosted by certain website
2.6 Scheduling
do not revisit some URLs (e.g. content negotiation)
which URLs in the "frontier" pool should be visited first
temination conditions
aviod crawler traps
3. Advanced Topic
3.1 Revisit Policy
A web page may change over time ( see example cs department home page)
How frequently should a crawler revisit a discovered URL to update its status?
3.2 Web Page duplication detection
Avoid index the same document again and again
handle URLs containing query parameters, the order of parameters may cause exponetial number of URL duplications
3.2 Advanced crawling scheduling
- depth first
- breath first
- focused crawling