Darshan Khandelwal

Posted on Jan 29, 2023 • Originally published at serpdog.io

Web Scraping Google With Java

#webdev #beginners #java #programming

Introduction

Java is on the list as one of the oldest and most popular programming languages. Its popularity is evident from the fact that it runs on more than a billion android devices.

It is also one of the most powerful multithread languages which can be used to conduct various tasks. One of the useful tasks it can do is web scraping.

Web Scraping is the process of extracting or collecting data from websites or other sources and storing it in the needed format. It is used for various tasks such as data mining, price monitoring, lead generation, SEO, etc. The scraped data can be used by businesses to make informed decisions and gain information about their target market.

In this blog, we will learn how to scrape Google search results using Java and its libraries.

Why Java for scraping Google?

Java is a very user-friendly language to understand for beginners. The community support available for Java is also large, which can help you face any error while programming your scraper.

You can solve your errors by asking questions in large communities present in both Reddit and Discord.

Java is a powerful language, and with the support of its high-performance capability, it can be a good choice for scraping Google.

Scraping Google Search Results With Java

In this blog, we will be designing a script on Java to scrape the first 10 Google Search Results. The output would be consisting of the link, title, description, and position of the respective results. This data can be used for various purposes like SEO, media monitoring, ad verification, etc.

The Google search results scraping is divided into two parts:

Extracting the HTML data by making an HTTP request on the target URL.
Parsing the HTML to get the required data.

Requirements:

Many libraries in Java can be used for scraping Google Search Results, but in this tutorial, we will be going with:

Jsoup — It is a Java library that can be used for both extracting and parsing HTML.

Set-Up:

Create a folder and save the file with the name you want with the .java extension. If you have not installed Java, you can install it by reading the following articles:

Process:

So, we have set up our Java project to scrape Google. We will now use Jsoup to make a connection with the Google web page by passing this URL as a parameter:

https://www.google.com/search?q=Java&gl=us

You can choose any query and location in the URL.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class GoogleScraper {
    public static void main(String[] args) throws IOException {

        String googleUrl = "https://www.google.com/search?q=java&gl=us";

        // Connect to the Google search page
        Document doc = Jsoup.connect(googleUrl).userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36").get();
        // Document object represents the HTML dom (Talking about "doc" here)

Step-by-step explanation:

First, we imported all the required classes from the jsoup library and the Java io package.
After declaring the class and the main method we initialized our Google URL.
After that, we made a connection, with the Google web page using the URL and the User Agent with the help of JSOUP’s connect method. Then we extracted the HTML data with the help of the get method and stored it in the document data type.

User-Agent is used to identify the application, operating system, vendor, and version of the requesting user agent, which can save help in making a fake visit to Google by acting as a real user.

This will help us to extract the raw HTML code. Then we will parse this HTML with the help of the select method.

Let us first identify the tags we have to select from the HTML to fetch the required data.

If you inspect the HTML, you will get to know that every result is contained inside a “div” container with a class name g.

We will now select all the divs with the class name g.

Elements results = doc.select("div.g");

select() — It is used to select matching elements from the HTML or XML document.

And then, we will loop over these selected divs.

        int c = 0;
        for (Element result : results) {
            // Extract the title and link of the result
            String title = result.select("h3").text();
            String link = result.select(".yuRUbf > a").attr("href");
            String snippet = result.select(".VwiC3b").text();
            System.out.println("Title: " + title);
            System.out.println("Link: " + link);
            System.out.println("Snippet: " + snippet);
            System.out.println("Position: "+ (c+1));
            System.out.println("\n");
            c++;
        }

You can find the tags for the title, snippet, and link under the g.div. Let us inspect the HTML so we can find them.

From the image, you can say that the tag for the title is h3, for the link it is .yuRUbf > a and for the snippet it is .VwiC3b.

After running the code successfully your results should look like this:

Title: Java | Oracle
Link: https://www.java.com/
Snippet: Get Java for desktop applications. Download Java · What is Java? Uninstall help. Happy Java User. Are you a software developer looking for JDK downloads?
Position: 1


Title: Java Downloads | Oracle
Link: https://www.oracle.com/java/technologies/downloads/
Snippet: The JDK includes tools for developing and testing programs written in the Java programming language and running on the Java platform. Linux; macOS; Windows.
Position: 2


Title: Java - Wikipedia
Link: https://en.wikipedia.org/wiki/Java
Snippet: Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of ...
Position: 3

But if you go by this method, Google may block your IP easily. You can avoid this to some extent by using random User Agents for each request. Let me show you, how you can do this:

Initialize an array of User Agents:

String userAgents[] =  {"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
    }

Then choose a random number between 0 and the length of the array.

int rnd = (int)(Math.random()*userAgents.length);

Then you can pass it when you scrape the HTML.

Document doc = Jsoup.connect(googleUrl).userAgent(userAgents[rnd]).get();

So, this is how you can prepare a basic script to scrape Google Search Results with Java.

If you are looking for a more streamlined and maintenance-free solution, then you might consider our Google Search API for scraping Google Search Results.

Advantages of scraping Google Search Results

Scraping Google Search Results can provide you with many benefits:

SERP Monitoring — It can be used, to monitor website rankings on Google, which can help you to increase your website visibility in the market.
Scalable — Scraping Google Search Results allows you to collect a large amount of data without any hindrance, which can be used for various purposes like lead generation, market trends analysis, etc.
Price Monitoring — It can be used to gather the pricing of the products sold by your competitors or online retailers to remain competitive in the market.
Lead Generation — It can be used to gather the email addresses of your potential customers.
Access to real-time data — It provides you with access to the most up-to-date data as Google Search Results keep updating frequently.
Inexpensive — Scraping Google Search Results is a very cost-effective choice instead of using official API which is not affordable by most businesses.

Why Official Google Search API is not a better alternative?

There are a few reasons why businesses don’t use official Google Search API:

Not Affordable — Their API is priced at 5$ for 1k requests, which is not affordable to most businesses. It is not feasible for those who are on a tight budget.
Limited Access — The API provides only a limited amount of data, that’s why people consider scrapers available in the market, which extract the HTML directly from the web page, giving them full control over the results.
Complex Setup — The Google Search API is very complex to set up for users who don’t have any knowledge about coding.

Conclusion

In this tutorial, we learned to scrape Google Search Results using Java. Feel free to message me anything you need clarification on. Follow me on Twitter. Thanks for reading!

DEV Community