This is a submission for the Bright Data Web Scraping Challenge: Most Creative Use of Web Data for AI Models
challenge : https://dev.to/challenges/brightdata
In the digital age, YouTube serves as a repository of knowledge, entertainment, and a variety of content. But with that comes a challenge: how do you filter through irrelevant videos to find the videos that ACTUALLY matter to you?
Detoxify: an intelligent Chrome extension is designed to de clutter your YouTube feed by filtering out the content that you don’t need.
Overview of Detoxify
Detoxify isn’t just another productivity extension— it’s made to enhance your learning experience on YouTube as you always wanted. By leveraging Google’s BERT model, Detoxify ensures every video on your feed matches what you want to see on your feed and not the platform.
What Does It Do?
Fine-tuning the BERT model on data scraped by Brightdata’s scraping API, Detoxify delivers categorization accuracy at 87.8%. The API-provided data is used to train the classification model and automatically hides irrelevant videos from your feed, leaving you with only meaningful content. It supports predefined categories like Chess, Coding, and **Mathemat
Why Detoxify Stands Out?
This is the first product of its kind, making YouTube more effective for people who want to focus on learning and educational content.
How Detoxify Works?
Seamless User Interaction
- Install the extension and select a category—Chess, Coding, or Mathematics.
- Detoxify starts working instantly, fetching and filtering content as you browse.
Data Scraping with Bright Data API
Real-time metadata fetching powered by Bright Data ensures the latest content is always analyzed.
Backend with FastAPI
The backend routes data to Detoxify’s classification engine for immediate results.
BERT-Powered Categorization
Detoxify’s fine-tuned BERT model categorizes video metadata into predefined buckets.
Instant Display Updates
Videos irrelevant to your chosen category are dynamically hidden for a cleaner viewing experience.
Technical Architecture
Scraping Layer I: BrightData API
The first layer of the scraping architecture leverages BrightData to efficiently extract a large dataset of YouTube video metadata, which is then used to fine-tune the BERT model.
1. Frontend: The Extension
The Chrome extension provides an intuitive interface for users to select a category (Chess, Coding, or Mathematics). Once a category is chosen, the extension:
- Initiates a scraping request to fetch YouTube metadata in real time.
- Sends the scraped data to a backend server for classification.
- Dynamically hides videos that don’t match the selected category.
2. Backend: Powered by FastAPI, Huggingface, Render
In this implementation, the backend FastAPI primarily facilitates handling GET and POST requests. The model itself is hosted on the Hugging Face Hub, with the inference pipeline deployed on Render. FastAPI acts as the intermediary, efficiently routing API calls to the Hugging Face-hosted model for predictions and delivering responses back to the client in real-time.
3. Model: Fine-tuned BERT
The backbone of Detoxify’s classification system is a fine-tuned BERT model trained specifically for categorizing YouTube content.
Training Dataset:
- A curated collection of YouTube video metadata.
- Categories include Chess, Coding, Mathematics, and Others.
Training Configuration:
- Epochs: 10
- Learning Rate: 2e-5
- Batch Size: 16
4. Scraping Layer II: Selenium
Selenium, integrated directly within the Chrome extension, is utilized for real-time scraping of YouTube video metadata. This includes titles, descriptions, tags, and other attributes that serve as input features for the classification model.
Overcoming Challenges
Dataset Creation
Sourcing a diverse and balanced dataset of YouTube video metadata was a critical step. By leveraging Bright Data’s scraping capabilities, metadata for various categories like Chess, Coding, and Mathematics was curated. This ensured the model was trained on relevant and high-quality data.
Model Accuracy
Achieving high classification accuracy required extensive experimentation with hyperparameters. Fine-tuning the BERT model on domain-specific data improved the accuracy to 87.8%.
Performance Metrics
The BERT model’s classification abilities have been rigorously evaluated, showcasing high accuracy and strong ROC-AUC scores across categories:
- Chess: 0.976
- Coding: 0.971
- Mathematics: 0.949
- Other: 0.941
Future Directions
While Detoxify is currently limited to three categories, its architecture still has a lot of scope for improvement. Possible future enhancements include:
- Reducing Processing Time: Optimizing the extension to improve the speed of content analysis and delivery.
- User-Centric Customization: Allowing users to input their preferences, enhancing flexibility and personalization in content selection.
- Advanced Machine Learning Integration: Utilizing more advanced machine learning algorithms to provide better classifications, streamlining the user experience.
Code
Experience Detoxify in action:
GitHub Repository
Closing Thoughts
Detoxify redefines how users interact with YouTube, delivering a personalized, distraction-free experience. By combining AI, Bright Data’s robust scraping capabilities, and user-centric design, Detoxify stands as a powerful tool for content curation. Whether you're diving into coding tutorials, mastering chess strategies, or exploring mathematical concepts, Detoxify ensures every moment on YouTube adds value to your journey.
Let’s make digital content smarter, one category at a time.
Contact
Have questions or suggestions? Reach out to the team:
- Saarthak Saxena: saarthaksaxena7@gmail.com, @curlydazai
- Prakhar Agrawal: prakhar2085@gmail.com
Top comments (0)