DEV Community

Cover image for State of the Art Automated Web Scraper using Bright Data
Ranjan Dailata
Ranjan Dailata

Posted on

State of the Art Automated Web Scraper using Bright Data

This is a submission for the Bright Data Web Scraping Challenge: Scrape Data from Complex, Interactive Websites, Build a Web Scraper API to Solve Business Problems and Most Creative Use of Web Data for AI Models (All 3 prompts).

Prerequisite:

  1. Please read the following blog post to get started on the Bright Data for the AI use-cases Bright Data to Tackle AI Data Challenges

  2. Depending upon the operating system you are using, Please make sure to install .NET 8

  3. Signup or Login to the Google AI Studio. It's required for building the fine-tuned instruction set.

  4. Make sure to sign up on BrightData. Also complete the steps for the initial setup for Proxies & Scraping Infrastructure and Web Scraping API. Please make a note on the WSS Browser Credential, Webscraper Api Token

WSS Browser Credential

Bright Data Webscraper Token

What I Built

I have built an Automated Web Scrapper, a .NET C# Library with a console solution for solving various problems of automated web scraping with the custom fine-tuned AI model building including data aggregation or collection via the 3rd party providers supported by the Bright Data.

Automated Webscraper Solution

Source Code - https://github.com/ranjancse26/AutomatedWebscraper

The Automated Web Scrapper project mainly focuses on the following aspects.

  1. Proxies and Scraping - Headless Web Scraping with the help of Bright Data.
  2. Web Scraper API using Bright Data Web Scrapper APIs for Data Collection and Management related aspects.
  3. Web Data using Bright Data. Considering an example of Glassdoor.
  4. Google Gemini 2.0 (gemini-2.0-flash-exp) LLM integration for getting the structured information for the fine-tuning of the AI models.

Headless Web Scraper Class Diagram

Headless Web Scraper Class Diagram

Web Scraper Class Diagram

Web Scraper Class Diagram

Service Layer Class Diagram

Service Layer Class Diagram

Demo

Source Code - https://github.com/ranjancse26/AutomatedWebscraper

Please make sure to follow the pre-requisite and update the appSettings.json

Getting started with the Console Demo.

Automated Web Scrapper Console

Here's the screenshot of the Bright Data - Booking.com provider-based response by utilizing the Proxies and Scraping mechanism of Headless Web Scraping based solution. One can easily build or extend the library for other providers too.

BookingCom Headless Browser Response

Booking.com Web Scraping using the Bright Data Web Scraping API.
Note - Depending on the data you are requesting, you may have to wait for few seconds to minutes to complete the Web Scraping API call.

BookingCom Web Scraping Response

Here's the Glassdoor Request/Response using the Bright Data Web Scraper API.

Glassdoor Request

Glassdoor Response

Glassdoor Response

Glassdoor Debug Information

Glassdoor Debug Response

Glassdoor Finetune InstructionSet

Here's the screenshot of the Glassdoor Finetune instruction set preparation for fine tuning the Google Gemma model. Similar techniques could be utilized for fine tuning of various LLM models.

Glassdoor Finetune Sample

Web Data Easy Scraper

Web Data Easy Webscraping Glassdoor

Web Data Easy Webscraping Glassdoor History

Bright Data - Web Data - Glassdoor company finetuned instruction set creation.

Bright Data - Web Data - Glassdoor

What problem does it solve?

Below you will be discussed with the real-world problem that is being solved by the Automated Webscraper Solution.

  • Using the Headless Browser or the Bright Data Web Scraper API based approach, one could literally accomplish the highly reliable and guaranteed method of automated web scraping in an easy manner. Building a highly scalable, scheduled or automated on-demand or real time vs batch by batch offline method of web scraping of Bright Data's solution could be utilized for performing the web scraping for solving various data collection or aggregation aspects.

  • Using the Bright Data's Web Data solution, one could simply perform the easy web scraping and download the data for further building the custom fine-tuned AI model.

How I Used Bright Data

The following section explains on how the Bright Data is being leveraged for accomplishing the Automated Web Scraping solution using various techniques.

  1. Proxies and Scraping - Using the Headless Web Scraping with the help of Bright Data, the Booking.com implementation has been accomplished. The Booking.com is being considered as an example. However, one could leverage and build the headless browser-based web scraping solution for any of the Bright Data supported providers.

  2. Web Scraper API - Using Bright Data Web Scrapper APIs for Data Collection and Management related aspects. The Glassdoor has been considered as an example to demonstrate on the Web Scraper API based implementation for fetching the Glassdoor company information by a specific URL.

  3. Web Data - Using Bright Data's Web Data. Considering an example of Glassdoor and by utilizing the Easy Web Scraping technique for initiating and downloading the Glassdoor company information and then later used the downloaded JSON information for building the Google Gemma fine-tuned instruction set by utilizing the Google's latest gemini-2.0-flash-exp LLM model. Google Gemini 2.0 (gemini-2.0-flash-exp) LLM integration is done for building the structured information for the fine-tuning of the AI models.

This submission does qualify for the below prompts

  • Scrape Data from Complex, Interactive Websites
  • Build a Web Scraper API to Solve Business Problems
  • Most Creative Use of Web Data for AI Models

Top comments (2)

Collapse
 
mishafauci profile image
Misha Fauci • Edited

Awesome, this is a game-changer for scraping. Last night, I was doing the same kind of work, but somehow, it didn't go through as I wanted.