Saarthak Saxena

Posted on Dec 30, 2024

Detoxify: Make your YouTube Feed 100x better

#brightdatachallenge

This is a submission for the Bright Data Web Scraping Challenge: Most Creative Use of Web Data for AI Models

challenge : https://dev.to/challenges/brightdata

In the digital age, YouTube serves as a repository of knowledge, entertainment, and a variety of content. But with that comes a challenge: how do you filter through irrelevant videos to find the videos that ACTUALLY matter to you?

Detoxify: an intelligent Chrome extension is designed to de clutter your YouTube feed by filtering out the content that you don’t need.

Overview of Detoxify

Detoxify isn’t just another productivity extension— it’s made to enhance your learning experience on YouTube as you always wanted. By leveraging Google’s BERT model, Detoxify ensures every video on your feed matches what you want to see on your feed and not the platform.

What Does It Do?

Fine-tuning the BERT model on data scraped by Brightdata’s scraping API, Detoxify delivers categorization accuracy at 87.8%. The API-provided data is used to train the classification model and automatically hides irrelevant videos from your feed, leaving you with only meaningful content. It supports predefined categories like Chess, Coding, and **Mathemat

Why Detoxify Stands Out?

This is the first product of its kind, making YouTube more effective for people who want to focus on learning and educational content.

How Detoxify Works?

Seamless User Interaction

Install the extension and select a category—Chess, Coding, or Mathematics.
Detoxify starts working instantly, fetching and filtering content as you browse.

Data Scraping with Bright Data API

Real-time metadata fetching powered by Bright Data ensures the latest content is always analyzed.

Backend with FastAPI

The backend routes data to Detoxify’s classification engine for immediate results.

BERT-Powered Categorization

Detoxify’s fine-tuned BERT model categorizes video metadata into predefined buckets.

Instant Display Updates

Videos irrelevant to your chosen category are dynamically hidden for a cleaner viewing experience.

Technical Architecture

Scraping Layer I: BrightData API

The first layer of the scraping architecture leverages BrightData to efficiently extract a large dataset of YouTube video metadata, which is then used to fine-tune the BERT model.

1. Frontend: The Extension

The Chrome extension provides an intuitive interface for users to select a category (Chess, Coding, or Mathematics). Once a category is chosen, the extension:

Initiates a scraping request to fetch YouTube metadata in real time.
Sends the scraped data to a backend server for classification.
Dynamically hides videos that don’t match the selected category.

2. Backend: Powered by FastAPI, Huggingface, Render

In this implementation, the backend FastAPI primarily facilitates handling GET and POST requests. The model itself is hosted on the Hugging Face Hub, with the inference pipeline deployed on Render. FastAPI acts as the intermediary, efficiently routing API calls to the Hugging Face-hosted model for predictions and delivering responses back to the client in real-time.

3. Model: Fine-tuned BERT

The backbone of Detoxify’s classification system is a fine-tuned BERT model trained specifically for categorizing YouTube content.

Training Dataset:

A curated collection of YouTube video metadata.
Categories include Chess, Coding, Mathematics, and Others.

Training Configuration:

Epochs: 10
Learning Rate: 2e-5
Batch Size: 16

4. Scraping Layer II: Selenium

Selenium, integrated directly within the Chrome extension, is utilized for real-time scraping of YouTube video metadata. This includes titles, descriptions, tags, and other attributes that serve as input features for the classification model.

Overcoming Challenges

Dataset Creation

Sourcing a diverse and balanced dataset of YouTube video metadata was a critical step. By leveraging Bright Data’s scraping capabilities, metadata for various categories like Chess, Coding, and Mathematics was curated. This ensured the model was trained on relevant and high-quality data.

Model Accuracy

Achieving high classification accuracy required extensive experimentation with hyperparameters. Fine-tuning the BERT model on domain-specific data improved the accuracy to 87.8%.

Performance Metrics

The BERT model’s classification abilities have been rigorously evaluated, showcasing high accuracy and strong ROC-AUC scores across categories:

Chess: 0.976
Coding: 0.971
Mathematics: 0.949
Other: 0.941

Future Directions

While Detoxify is currently limited to three categories, its architecture still has a lot of scope for improvement. Possible future enhancements include:

Reducing Processing Time: Optimizing the extension to improve the speed of content analysis and delivery.
User-Centric Customization: Allowing users to input their preferences, enhancing flexibility and personalization in content selection.
Advanced Machine Learning Integration: Utilizing more advanced machine learning algorithms to provide better classifications, streamlining the user experience.

Code

Experience Detoxify in action:
GitHub Repository

Closing Thoughts

Detoxify redefines how users interact with YouTube, delivering a personalized, distraction-free experience. By combining AI, Bright Data’s robust scraping capabilities, and user-centric design, Detoxify stands as a powerful tool for content curation. Whether you're diving into coding tutorials, mastering chess strategies, or exploring mathematical concepts, Detoxify ensures every moment on YouTube adds value to your journey.

Let’s make digital content smarter, one category at a time.

Contact

Have questions or suggestions? Reach out to the team:

Saarthak Saxena: saarthaksaxena7@gmail.com, @curlydazai
Prakhar Agrawal: prakhar2085@gmail.com

GitHub Repository: Detoxify on GitHub

DEV Community