Java web crawler

In this tutorial, you will learn how to crawl a website using java. Before we start to write java web crawler we will see how a simple web crawler is designed.

What is a crawler?

A web crawler is a program that browses the World Wide Web. Web crawlers are also known as spiders, bots and  automatic indexers. This process is called Web crawling or spidering. Web crawling is used to collect information about web pages.

Simple Web Crawler

Here are the modules of a Simple Web Crawler. See the diagram below.

  • Get URL of a web document from processing queue.
  • Schedule the downloading of the document.
  • Download the document
  • Parse the downloaded document content and extract links and metadata from the document. Links are added to the processing queue
  • The Parsed web document is stored or may be processed further such as indexing of text.
Java Web Crawler

Crawler Architecture

Frameworks for crawling in Java

There are lots of frameworks that can be used for the crawling in java. I am mentioning here the free and most used libraries. These libraries can be used on their own or they can also be used complementing each other like one used for scraping and the other used for parsing. It all depends on the needs of the crawler you want to create.

 

Our First Java Crawler

We are going to write our first java crawler. A simple program that will count the total number of pages downloaded. We will use crawler4j for crawling as it is very simple to create.

Two things that should keep in mind when writing a crawler.

  • Never put too much load on a website.  You should crawl a website with a delay of 300ms. If you don’t honor this, your crawlers may be blocked altogether.
  • Give your crawler a name. You can do so by setting the useragent string.

Note for newbies : So you should keep in mind that your crawler activity can be monitored by the webmaster of the website, you are crawling. Your IP may be blocked from crawling a website if the webmaster doesn’t see a reason to allow your spider to crawl the site. Spidering a site uses the resources on the website. Every page the crawler downloads need the processing on the website server.

After all the talk, let’s get down to code. We are going to create 2 classes. A Controller and the Webcawler class.

Maven dependency

In case, you don’t know how to create maven project you can read create maven project

<dependency>
        <groupId>edu.uci.ics</groupId>
        <artifactId>crawler4j</artifactId>
        <version>4.2</version>
</dependency>

WebCrawler Class

This class decides what to do with the a particular page. You can decide here which pages should be downloaded. You may not want to download the js files.  We will create a class that extends  edu.uci.ics.crawler4j.crawler.WebCrawler and must implement two methods

  • shouldVisit : Here you can decide which pages should be crawled
  • visit : The method is called when a page is downloaded. Here you can do the processing of the page data.
package com.programtalk.learn.webcrawler;

import java.util.Set;
import java.util.regex.Pattern;

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;

public class MyCrawler extends WebCrawler {
	
    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
                                                           + "|png|mp3|mp3|zip|gz))$");

    /**
     * This method receives two parameters. The first parameter is the page
     * in which we have discovered this new url and the second parameter is
     * the new url. You should implement this function to specify whether
     * the given url should be crawled or not (based on your crawling logic).
     * In this example, we are instructing the crawler to ignore urls that
     * have css, js, git, ... extensions and to only accept urls that start
     * with "http://www.ics.uci.edu/". In this case, we didn't need the
     * referringPage parameter to make the decision.
     */
     @Override
     public boolean shouldVisit(Page referringPage, WebURL url) {
         String href = url.getURL().toLowerCase();
         return !FILTERS.matcher(href).matches()
                && href.startsWith("http://www.ics.uci.edu");
     }

     /**
      * This function is called when a page is fetched and ready
      * to be processed by your program.
      */
     @Override
     public void visit(Page page) {
         String url = page.getWebURL().getURL();
         System.out.println("URL: " + url);
         
        InMemoryDB inMemoryDB = InMemoryDB.getInstance();
		int totalPages = inMemoryDB.get(DataKeys.numPage) != null ?  Integer.valueOf(inMemoryDB.get(DataKeys.numPage)) +1 : 1;
		inMemoryDB.put(DataKeys.numPage, String.valueOf(totalPages));
         
         if (page.getParseData() instanceof HtmlParseData) {
             HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
             String text = htmlParseData.getText();
             String html = htmlParseData.getHtml();
             Set<WebURL> links = htmlParseData.getOutgoingUrls();

             System.out.println("Text length: " + text.length());
             System.out.println("Html length: " + html.length());
             System.out.println("Number of outgoing links: " + links.size());
             System.out.println(htmlParseData.getMetaTags());
         }
    }
}

Controller Class

Here we are going do the configurations.

  • We are going to add the seeds that  will define the entry point of our crawler. The crawler will start crawling a website from the seed we define
  • We will set the useragent.
  • We are also going to set the politenessdelay that will define the delay between the visits to the website.
  • Since crawler4j is a multithreaded crawler, we are going to set the number of threads that can run in parallel.
package com.programtalk.learn.webcrawler;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;

public class Controller {
	
    public static final String pageUrl = "http://www.ics.uci.edu/";
    
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "/data/crawl/root";
        int numberOfCrawlers = 7;

        CrawlConfig config = new CrawlConfig();
        config.setUserAgentString("learnCrawler");
        config.setMaxPagesToFetch(100);
        config.setPolitenessDelay(300);
        config.setCrawlStorageFolder(crawlStorageFolder);
        config.setResumableCrawling(false);

        /*
         * Instantiate the controller for this crawl.
         */
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        /*
         * For each crawl, you need to add some seed urls. These are the first
         * URLs that are fetched and then the crawler starts following links
         * which are found in these pages
         */
        controller.addSeed(pageUrl +"~lopes/");
        controller.addSeed(pageUrl +"~welling/");
		controller.addSeed(pageUrl);

        /*
         * Start the crawl. This is a blocking operation, meaning that your code
         * will reach the line after this only when crawling is finished.
         */
        controller.start(MyCrawler.class, numberOfCrawlers);
        
        System.out.println("Total number of unique pages found :" + InMemoryDB.getInstance().get(DataKeys.numPage));
    }
}

DataKeys

This is an enum that holds the keys for various data attributes that we may want to store

package com.programtalk.learn.webcrawler;

public enum DataKeys {

	numPage;
}

Inmemory DataBase

A simple Map that stores the data for the time the crawler is running. You may want to replace this with some database like mysql, postgres.

package com.programtalk.learn.webcrawler;

import java.util.HashMap;
import java.util.Map;

public class InMemoryDB {

	private final static InMemoryDB IN_MEMORY_DB = new InMemoryDB();
	
	private Map<DataKeys,String> data = new HashMap<>();
	
	public static final InMemoryDB getInstance(){
		return IN_MEMORY_DB;
	}
	
	private InMemoryDB(){
		//rem: no outside instantiation possible
	}
	
	public synchronized String get(DataKeys key){
		return data.get(key);
	}
	public synchronized void put(DataKeys key, String value) {
		data.put(key, value);
	}
}

Output

Like this post? Don’t forget to share it!

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.