Sanju Shaw

Posted on Aug 4

Building a Web Scraper in 1 Java File☕

#java #webscraping #backend #programming

I was deep in the Oracle java.net docs, coffee in hand, when the idea hit me: build a minimalist web scraper in a single file—just to see if I could :)

✨ Introduction

Ever wondered how search engines crawl web pages?
Or how job sites gather listings from other websites?

That’s all thanks to web scraping. In this post, I’ll show you how I built a simple web scraper in pure Java, without any libraries, all in a single file.

Even if you’re new to Java’s networking or regex features, don’t worry—I’ll walk you through it all.

📌What you'll learn

How to take a website URL from the user
How to fetch its HTML using HttpURLConnection
How to extract the <title> and all <a href="..."> links using regex
How to handle common errors gracefully
Why this is a powerful starting point for automating the web

Let's dive in 🤿

Create a file WebScraper.java(or name anything you desire).

public class WebScraper {
 public static void main(String[] args) {
        if (args.length < 1) {
            System.err.println("❌ Please provide a URL as the first argument.");
            return;
        }

        String inputUrl = args[0];

What this code snippet does is to take provided url from CLI and if not found any URL then throw an error and quit running.

Next part :

 try {
        URI uri = URI.create(inputUrl); 
            URL url = uri.toURL();
            String html = fetchHTML(url);

            if (html == null || html.isEmpty()) {
                System.err.println(" Failed to fetch content or empty page.");
                return;
            }

            String title = extractTitle(html);
            System.out.println(" Page Title: " + (title != null ? title : "N/A"));

            List<String> links = extractLinks(html);

            System.out.println("\n Links found:");
            if (links.isEmpty()) {
                System.out.println("No <a> tags found.");
            } else {
                links.forEach(System.out::println);
            }

        } catch (Exception e) {
            System.err.println(" Error: " + e.getMessage());
        }
    }

This portion creates URL from provided string in CLI with the help of URI package as direct use of URL has been deprecated from Java 11+.

Next we're fetching HTML from a function, thereafter fetching associated links in the site and printing them gracefully.

Here's fetchHTML fucntion:

private static String fetchHTML(URL url) {
        StringBuilder html = new StringBuilder();

        try {
            HttpURLConnection conn = (HttpURLConnection) url.openConnection();
            conn.setRequestMethod("GET");

            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(conn.getInputStream()))) {

                String line;
                while ((line = reader.readLine()) != null) {
                    html.append(line).append("\n");
                }
            }

            return html.toString();

        } catch (Exception e) {
            System.err.println("❌ Failed to fetch HTML: " + e.getMessage());
            return null;
        }
    }

First off all we're opening a connection using openConnection() method in URL and type-casting the return into HttpURLConnection type for dispatching a GET request to access the site.
Later creating an InputStream using BufferedReader and storing entire HTML content in our string builder one at a time. Go ahead print this string-builder and see how it looks.

Now, we've 2 more functions one for extracting title from html and extracting links.

-extractTitle(html):

private static String extractTitle(String html) {
        Pattern pattern = Pattern.compile("<title>(.*?)</title>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Matcher matcher = pattern.matcher(html);
        return matcher.find() ? matcher.group(1).trim() : null;
    }

Why not you tell me in the comments how this pattern matching works?
Let's see how much you are following😀

extractLinks(html):

 private static List<String> extractLinks(String html) {
        List<String> links = new ArrayList<>();
        Pattern pattern = Pattern.compile("<a\\s+(?:[^>]*?\\s+)?href=[\"'](.*?)[\"']", Pattern.CASE_INSENSITIVE);
        Matcher matcher = pattern.matcher(html);

        while (matcher.find()) {
            String link = matcher.group(1).trim();
            if (!link.isEmpty()) {
                links.add(link);
            }
        }

        return links;
    }

Matching <a> tag pattern and storing them to a list.

Here's the full code :

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URI;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.ArrayList;
import java.util.List;

public class WebScrapper {

    public static void main(String[] args) {
        if (args.length < 1) {
            System.err.println("❌ Please provide a URL as the first argument.");
            return;
        }

        String inputUrl = args[0];

        try {
        URI uri = URI.create(inputUrl); 
            URL url = uri.toURL();
            String html = fetchHTML(url);

            if (html == null || html.isEmpty()) {
                System.err.println(" Failed to fetch content or empty page.");
                return;
            }

            String title = extractTitle(html);
            System.out.println(" Page Title: " + (title != null ? title : "N/A"));

            List<String> links = extractLinks(html);

            System.out.println("\n Links found:");
            if (links.isEmpty()) {
                System.out.println("No <a> tags found.");
            } else {
                links.forEach(System.out::println);
            }

        } catch (Exception e) {
            System.err.println(" Error: " + e.getMessage());
        }
    }

    private static String fetchHTML(URL url) {
        StringBuilder html = new StringBuilder();

        try {
            HttpURLConnection conn = (HttpURLConnection) url.openConnection();
            conn.setRequestMethod("GET");

            try (BufferedReader reader = new BufferedReader(
                    new InputStreamReader(conn.getInputStream()))) {

                String line;
                while ((line = reader.readLine()) != null) {
                    html.append(line).append("\n");
                }
            }

    System.out.println(html+"\n");
            return html.toString();

        } catch (Exception e) {
            System.err.println("❌ Failed to fetch HTML: " + e.getMessage());
            return null;
        }
    }

    private static String extractTitle(String html) {
        Pattern pattern = Pattern.compile("<title>(.*?)</title>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Matcher matcher = pattern.matcher(html);
        return matcher.find() ? matcher.group(1).trim() : null;
    }

    private static List<String> extractLinks(String html) {
        List<String> links = new ArrayList<>();
        Pattern pattern = Pattern.compile("<a\\s+(?:[^>]*?\\s+)?href=[\"'](.*?)[\"']", Pattern.CASE_INSENSITIVE);
        Matcher matcher = pattern.matcher(html);

        while (matcher.find()) {
            String link = matcher.group(1).trim();
            if (!link.isEmpty()) {
                links.add(link);
            }
        }

        return links;
    }
}

Compile and Run the file:

javac WebScraper.java

java WebScraper https://github.com

Output:

🙌 Final Thoughts:

Yes, I know this is a simple scraper. Several developments can be made integration of tools such as JSoup can be done and much more.

But, This small scraper gave me a deep understanding of:

How Java connects to the web
How to read and process HTML
How regex can extract meaningful data

If you found this useful, drop a 💬 or ❤️.
And if you want me to turn this into a downloadable CLI tool, let me know in the comments!

✨ Connect With Me

You can follow me here on Dev.to or connect with me on GitHub for more hands-on Java + tooling content!

DEV Community