How to parse Google Search result in Java?

#java #google #search #api

Google is an amazing resource but there are no APIs to really parse google search results. This is the JAVA code that I wrote that will help you parse any google search results.

How does Google Search work?

For example, if you are searching for "How to parse Google Search result in Java" then this is the URL that you would want to hit: https://www.google.com/search?q=How+to+parse+Google+Search+result+in+Java&num=10

The part after "q" is used to pass the query and "num": This tells google how many results to return.

Getting HTML search results from Google

This is a code that will search Google and return the HTML of the page.

/**
   * The method will return the search page result in a {@link String} object
   *
   * @param googleSearchQuery the google search query
   * @return the content as {@link String} object
   * @throws Exception
   */
  public static String getSearchContent(String googleSearchQuery) throws Exception {
    //URL encode string in JAVA to use with google search
    System.out.println("Searching for: " + googleSearchQuery);
    googleSearchQuery = googleSearchQuery.trim();
    googleSearchQuery = URLEncoder
        .encode(googleSearchQuery, StandardCharsets.UTF_8.toString());
    String queryUrl = "https://www.google.com/search?q=" + googleSearchQuery + "&num=10";
    System.out.println(queryUrl);
    final String agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
    URL url = new URL(queryUrl);
    final URLConnection connection = url.openConnection();
    /**
     * User-Agent is mandatory otherwise Google will return HTTP response
     * code: 403
     */
    connection.setRequestProperty("User-Agent", agent);
    final InputStream stream = connection.getInputStream();
    return getString(stream);
  }

The above code will "URL" encode the given search term string and then call google with the given search string and return the results in JAVA using URLConnection class. You can change the headers as well to prevent google from blocking you.

Parsing results from google search result HTML

We want to only find the real results from Google Search and for this we can use this simple Jsoup based HTML parser in JAVA:

/**
   * Parse all links
   *
   * @param html the page
   * @return the list with all URLSs
   * @throws Exception
   */
  public static List<String> parseLinks(final String html) throws Exception {
    List<String> result = new ArrayList<String>();
    Document doc = Jsoup.parse(html);
    Elements results = doc.select("a > h3");
    for (Element link : results) {
      Elements parent = link.parent().getAllElements();
      String relHref = parent.attr("href");
      if (relHref.startsWith("/url?q=")) {
        relHref = relHref.replace("/url?q=", "");
      }
      String[] splittedString = relHref.split("&sa=");
      if (splittedString.length > 1) {
        relHref = splittedString[0];
      }
      //System.out.println(relHref);
      result.add(relHref);
    }
    return result;
  }

The above code is a bit tricky. So it is first finding "h3" elements inside "a". Then it looks at the parent element of the current node. From the parent node it then finds the URL.

Google search result URL starts with "/url?q=", so we use regex to remove this String. Also, the URL is followed by "&sa=", so we split the string at this character and use the first part of the URL.