Web Crawler using jsoup

In this tutorial we will be looking at creating a simple web crawler using jsoup.

1. Declare Maven Dependency

In case you don’t know how to create a project, you can read about it create new maven project.

<dependency>
  <!-- jsoup HTML parser library @ http://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.10.2</version>
</dependency>

2.  Choose a wepage to crawl

In this tutorial we will choose Hollinger NBA Player Statistics. So we will get the details of only one player.  Rest you can improvise and get more details.

This is what we are downloading.

3.  Start downloading the page

Jsoup provide a method to download a page.

Document document = Jsoup.connect(url).get();

While downloading we will take care that a page is not crawled more than once. And we will also set the user agent string. One more important thing to consider would be to set the minimum time delay between two successive crawls. We are doing this all in our download method.

	private static Document download(String url) throws IOException, InterruptedException {
		// don't visit already visited pages
		if (url != null && !url.trim().isEmpty() && !urlsVisited.contains(url)) {
			// we should be nice to websites we visit
			if((new Date().getTime() -lastVisitTime) < 300){
				System.out.println("wait for time :" + (new Date().getTime() -lastVisitTime));
				Thread.sleep(300 - (new Date().getTime() - lastVisitTime));
			}
			
			urlsVisited.add(url);
			Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 crawl-by-Jsoup")
				      .get();
			lastVisitTime = new Date().getTime();
			return document;
		}
		return null;
	}

4. Parse the content

Once we have downloaded the page we will need to parse the html to find the data that we are looking for. For that we need to look into the html of the page. Here is a screenshot of the html

 

So we would be looking for the header of the table and then the detail for each player for each header. The most important method here is

  •   doc.select() . Find elements that match the Selector CSS query, with this element as the starting context. Matched elements  may include this element, or any of its children. Look at the cookbook for examples
      private static void parseDocument(Document doc) throws IOException, InterruptedException {
		Elements headers = doc.select(".tablehead .colhead");
		Elements oddRows = doc.select(".tablehead .oddRow");

		for (Element each : oddRows) {
			System.out.println("#################### Player Details ####################### ");
			int i = 0;
			String detailsUrl = null;
			for (Node eachPlayer : each.childNodes()) {
				if (headers.get(0).childNode(i).childNode(0).hasAttr("title")) {
					System.out.print(headers.get(0).childNode(i++).childNode(0).attr("title") + " : ");
				} else {
					System.out.print(headers.get(0).childNode(i++).childNode(0).toString() + " : ");
				}
				for (Node insideScript : eachPlayer.childNodes()) {
					if (!insideScript.childNodes().isEmpty()) {
						detailsUrl =insideScript.attr("href");
						System.out.print(insideScript.childNode(0).toString());
					} else {
						System.out.print(insideScript.toString());
					}
				}
				System.out.println(" ,");
			}
			readEachPalyerDoc(download(detailsUrl));
			System.out.println("#################### For tutorial I am breaking it here ####################### ");
			break;
		}
	}

5. Complete example

Here is the complete example for downloading the information. I have written a basic crawler to show how useful jsoup can be. You can do much more with jsoup . For more information refer to jsoup cookbook

package com.programtalk.learn.webcrawler.jsoup;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Date;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;

public class WebCrawler {


	private static ArrayList<String> urlsVisited = new ArrayList<String>();
	
	private static long lastVisitTime = 0;

	public WebCrawler() {
		// TODO Auto-generated constructor stub
	}

	public static void main(String[] args) throws IOException, InterruptedException {
		System.out.println("Starting to crawl http://insider.espn.com/nba/hollinger/statistics");

		crawl("http://insider.espn.com/nba/hollinger/statistics");

		System.out.println("Urls visited : " + urlsVisited);
		System.out.println("Crawling END");
	}

	private static void crawl(String url) throws IOException, InterruptedException {

		Document doc = download(url);
		if (doc != null) {
			parseDocument(doc);
		}

	}

	private static Document download(String url) throws IOException, InterruptedException {
		// don't visit already visited pages
		if (url != null && !url.trim().isEmpty() && !urlsVisited.contains(url)) {
			// we should be nice to websites we visit
			if((new Date().getTime() -lastVisitTime) < 300){
				System.out.println("wait for time :" + (new Date().getTime() -lastVisitTime));
				Thread.sleep(300 - (new Date().getTime() - lastVisitTime));
			}
			
			urlsVisited.add(url);
			Document document = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6 crawl-by-Jsoup")
				      .get();
			lastVisitTime = new Date().getTime();
			return document;
		}
		return null;
	}

	private static void parseDocument(Document doc) throws IOException, InterruptedException {
		Elements headers = doc.select(".tablehead .colhead");
		Elements oddRows = doc.select(".tablehead .oddRow");

		for (Element each : oddRows) {
			System.out.println("#################### Player Details ####################### ");
			int i = 0;
			String detailsUrl = null;
			for (Node eachPlayer : each.childNodes()) {
				if (headers.get(0).childNode(i).childNode(0).hasAttr("title")) {
					System.out.print(headers.get(0).childNode(i++).childNode(0).attr("title") + " : ");
				} else {
					System.out.print(headers.get(0).childNode(i++).childNode(0).toString() + " : ");
				}
				for (Node insideScript : eachPlayer.childNodes()) {
					if (!insideScript.childNodes().isEmpty()) {
						detailsUrl =insideScript.attr("href");
						System.out.print(insideScript.childNode(0).toString());
					} else {
						System.out.print(insideScript.toString());
					}
				}
				System.out.println(" ,");
			}
			readEachPalyerDoc(download(detailsUrl));
			System.out.println("#################### For tutorial I am breaking it here ####################### ");
			break;
		}
	}

	private static void readEachPalyerDoc(Document doc) throws IOException {
		Elements playerGeneralInfo = doc.select(".player-bio .general-info");
		Elements playerMetaData = doc.select(".player-bio .player-metadata");

		System.out.println("####### Player General Info ########");
		for (Element each : playerGeneralInfo) {
			for (Node eachLi : each.childNodes()) {
				for (Node eachLiT : eachLi.childNodes()) {
					if (!eachLiT.childNodes().isEmpty()) {
						// readEachPalyerDoc(download(eachLi.attr("href")));
						System.out.print(eachLiT.childNode(0).toString() + " ");
					} else {
						System.out.print(eachLiT.toString() + " ");
					}
				}
			}
		}
		System.out.println("\n ####### Player Meta Data ########");
		for (Element each : playerMetaData) {
			for (Node eachLi : each.childNodes()) {
				for (Node eachLiT : eachLi.childNodes()) {
					if (!eachLiT.childNodes().isEmpty()) {
						// readEachPalyerDoc(download(eachLi.attr("href")));
						System.out.print(eachLiT.childNode(0).toString() + " ");
					} else {
						System.out.println(eachLiT.toString() + " ");
					}
				}
			}
		}

	}

}


6. Output of our Crawler

Starting to crawl http://insider.espn.com/nba/hollinger/statistics
#################### Player Details #######################
RK : 1 ,
PLAYER : Russell Westbrook, OKC ,
GP : 41 ,
MPG : 34.7 ,
True Shooting Percentage : .541 ,
Assist Ratio : 23.7 ,
Turnover Ratio : 12.2 ,
Usage Rate : 42.4 ,
Offensive Rebound Rate : 6.0 ,
Defensive Rebound Rate : 27.7 ,
Rebound Rate : 17.1 ,
Player Efficiency Rating : 29.74 ,
Value Added : 398.1 ,
Estimated Wins Added : 13.3 ,
wait for time :16
####### Player General Info ########
#0 PG 6' 3", 200 lbs Oklahoma City Thunder
####### Player Meta Data ########
Born Nov 12, 1988 in Long Beach, CA (Age: 28)
Drafted 2008: 1st Rnd, 4th by SEA
College UCLA
Experience 8 years
#################### For tutorial I am breaking it here #######################
Urls visited : [http://insider.espn.com/nba/hollinger/statistics, http://insider.espn.go.com/nba/players/hollinger?playerId=3468]
Crawling END

Like this post? Don’t forget to share it!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.