DEV Community

Cover image for Case Study: Web Crawler
Paul Ngugi
Paul Ngugi

Posted on

Case Study: Web Crawler

This case study develops a program that travels the Web by following hyperlinks. The World Wide Web, abbreviated as WWW, W3, or Web, is a system of interlinked hypertext documents on the Internet. With a Web browser, you can view a document and follow the hyperlinks to view other documents. In this case study, we will develop a program that automatically traverses the documents on the Web by following the hyperlinks. This type of program is commonly known as a Web crawler. For simplicity, our program follows for the hyperlink that starts with http://. Figure below shows an example of traversing the Web. We start from a Web page that contains three URLs named URL1, URL2, and URL3. Following URL1 leads to the page that contains three URLs named URL11, URL12, and URL13. Following URL2 leads to the page that contains two URLs named URL21 and URL22. Following URL3 leads to the page that contains four URLs named URL31, URL32, and URL33, and URL34. Continue to traverse the Web following the new hyperlinks. As you see, this process may continue forever, but we will exit the program once we have traversed 100 pages.

Image description

The program follows the URLs to traverse the Web. To ensure that each URL is traversed only once, the program maintains two lists of URLs. One list stores the URLs pending for traversing and the other stores the URLs that have already been traversed. The algorithm for this program can be described as follows:

Add the starting URL to a list named listOfPendingURLs;
while listOfPendingURLs is not empty and size of listOfTraversedURLs
<= 100 {
Remove a URL from listOfPendingURLs;
if this URL is not in listOfTraversedURLs {
Add it to listOfTraversedURLs;
Display this URL;
Read the page from this URL and for each URL contained in the page {
Add it to listOfPendingURLs if it is not in listOfTraversedURLs;
}
}
}

The program below gives the program that implements this algorithm.

package demo;
import java.util.Scanner;
import java.util.ArrayList;

public class WebCrawler {

    public static void main(String[] args) {
        java.util.Scanner input = new java.util.Scanner(System.in);
        System.out.print("Enter a URL: ");
        String url = input.nextLine();
        crawler(url); // Traverse the Web from a starting url
    }

    public static void crawler(String startingURL) {
        ArrayList<String> listOfPendingURLs = new ArrayList<>();
        ArrayList<String> listOfTraversedURLs = new ArrayList<>();

        listOfPendingURLs.add(startingURL);
        while(!listOfPendingURLs.isEmpty() && listOfTraversedURLs.size() <= 100) {
            String urlString = listOfPendingURLs.remove(0);
            if(!listOfTraversedURLs.contains(urlString)) {
                listOfTraversedURLs.add(urlString);
                System.out.println("Craw " + urlString);

                for(String s: getSubURLs(urlString)) {
                    if(!listOfTraversedURLs.contains(s))
                        listOfPendingURLs.add(s);
                }
            }
        }
    }

    public static ArrayList<String> getSubURLs(String urlString){
        ArrayList<String> list = new ArrayList<>();

        try {
            java.net.URI uri = new java.net.URI(urlString);
            java.net.URL url = uri.toURL();
            Scanner input = new Scanner(url.openStream());
            int current = 0;
            while(input.hasNext()) {
                String line = input.nextLine();
                current = line.indexOf("http:", current);
                while(current > 0) {
                    int endIndex = line.indexOf("\"", current);
                    if(endIndex > 0) { // Ensure that a correct URL is found
                        list.add(line.substring(current, endIndex));
                        current = line.indexOf("http:", endIndex);
                    }
                    else
                        current = -1;
                }
            }
        }
        catch(Exception ex) {
            System.out.println("Error: " + ex.getMessage());
        }

        return list;
    }

}

Enter fullscreen mode Exit fullscreen mode

The program prompts the user to enter a starting URL (lines 9–10) and invokes the crawler(url) method to traverse the web (line 11).

The crawler(url) method adds the starting url to listOfPendingURLs (line 18) and repeatedly processes each URL in listOfPendingURLs in a while loop (lines 19–30). It removes the first URL in the list (line 20) and processes the URL if it has not been processed (lines 21–29). To process each URL, the program first adds the URL to listOfTraversedURLs (line 22). This list stores all the URLs that have been processed. The getSubURLs(url) method returns a list of URLs in the Web page for the specified URL (line 25). The program uses a foreach loop to add each URL in the page into listOfPendingURLs if it is not in listOfTraversedURLs (lines 25–28).

The getSubURLs(url) method reads each line from the Web page (line 42) and searches for the URLs in the line (line 43). Note that a correct URL cannot contain line break characters. So it is sufficient to limit the search for a URL in one line of the text in a Web page. For simplicity, we assume that a URL ends with a quotation mark " (line 45). The method obtains a URL and adds it to a list (line 47). A line may contain multiple URLs. The method continues to search for the next URL (line 48). If no URL is found in the line, current is set to -1 (line 51). The URLs contained in the page are returned in the form of a list (line 59).

The program terminates when the number of traversed URLs reaches to 100 (line 19). This is a simple program to traverse the Web.

Top comments (0)