Converting website data to LLM-ready structured format using the Website Crawler API

#scraping #crawling #website #crawler

LLMs learn from data. Every popular AI tool available today had fed on website data for several months. Their intelligence is backed by the vast amount of information available on the internet.

While the existing AI platforms have access to humongous amount of data, what about the ones that are in development or will be developed? Creating a robust crawler that will scrape data and will transform the data into structured data is difficult. Why not use a website crawler which has been already developed instead of developing a new one?

This is where the Website Crawler API comes comes in. The free to use Website Crawler API offers 5 endpoints, each serving a different purpose and are equally important. The endpoints let users get the crawling status, retrieve JSON forma structured data, submit domain name for crawling, and clear the crawl job status.

If you're using an opensource LLM for your project and require JSON format data for feeding the LLM for the purpose of training it, Website Crawler is one of the best available options for you.

Here are the endpoints offered by the API
/crawl/start endpoint serves two purposes - sending a domain/website to Websitecrawler for crawling and getting the latest status of the crawler.

/crawl/currentURL is another easy to use endpoint. When you pass a valid key and the URL to this endpoint, you'll see the page WebsiteCrawler is currently processing.

/crawl/clear/ endpoints will remove the job once a site has been crawled.

/crawl/cwdata is the endpoint that retrieves JSON data. It returns a JSON array containing N JSON Objects. N is the number of URLs/limit you had submitted using the endpoint /crawl/start/

The official Github Repository has covered the endpoints in detail. In case you want to know more about the endpoints, you can go through the Github readme file of WebsiteCrawler.

How to use the Website Crawler API to scrape data from websites and retrieve structured data once the site has been crawled?

Below I've shared a working example of getting meta description and url of each page on a website using the Website Crawler Java library. For making API requests programmatically, you'll need an API key. You can get the free API key from WebsiteCrawler.org. In the follow Java code, replace the MY_API_KEY, MY_URL, MY_LIMIT with your API key, domain you want to analyze, and the number of pages you want WebsiteCrawler to analyze.

public Map getScrapedData(String MY_KEY, String MY_URL, int MY_LIMIT) {
        {
            Map mp = null;
            try {
                String status;
                String data;
                WebsiteCrawlerConfig cfg = new WebsiteCrawlerConfig(MY_KEY);
                WebsiteCrawlerClient client = new WebsiteCrawlerClient(cfg);
                mp = new HashMap<>();

                client.submitUrlToWebsiteCrawler(MY_URL, MY_LIMIT);
                boolean taskStatus;
                while (true) {
                    taskStatus = client.getTaskStatus();
                    Thread.sleep(3500);
                    if (taskStatus == true) {
                        status = client.getCrawlStatus();
                        data = client.getcwData();

                        if (status != null && status.equals("Completed!")) {
                            if (data != null) {
                                System.out.println("Json Data::" + data);
                                Thread.sleep(24000);
                                break;
                            }
                        }
                    }
                }

                if (data != null) {
                    JSONArray arr = new JSONArray(data);
                    for (int i = 0; i < arr.length(); i++) {
                        JSONObject obj = arr.getJSONObject(i);
                        if (obj.has("md") && obj.has("url")) {
                            String mdes = obj.getString("md"); // extracting meta description from the JSON Array
                            String url = obj.getString("url"); //extracting url from the JSON Array
                            mp.put(url, mdes);
                        }
                    }
                }
                mp.forEach((key, value) -> System.out.println("URL::" + key + " Meta description::" + value));
            } catch (Exception x) {
                x.printStackTrace();
            }
            return mp;
        }
    }

The above method returns a Map containing URLs and the meta description.