(SOTA) AI agent to generate real-time dataset for AI ML projects on demand - Perpendicular AI

Prasanjit dutta — Sun, 25 May 2025 20:00:15 +0000

This is a submission for the Bright Data AI Web Access Hackathon

This is a project that I built for Bright Data MCP Hackathon. The reason I participated in this is to experiment with the MCP and also because I like building. I am currently open to work and have put a lot of effort into developing this project. I would be very thankful if you could react to my article and share it.

What I Built

Perpendicular AI is an AI agent designed to generate real-time datasets for AI/ML projects by leveraging advanced web scraping. It solves the challenge of acquiring up-to-date, trustworthy dataset by:

Interpreting user queries to identify specific data needs
Locating relevant sources via Bright Data’s search tools provided by Bright Data MCP
Extracting and structuring data from diverse web pages using Bright Data MCP
Creating tailored schemas for seamless data integration

Capabilties

Perpendicular can create realtime datasets from :

Any specific site when provided with a URL
General web
Twitter posts
LinkedIn data
Instagram Posts
Booking.com
Zillow
Amazon data
Youtube
ZoomInfo

Demo

Demo of a dataset generation using the perpendicular ai agent.

Perpendicular Github Repo
Here is the github repo link for the project. Clone it and follow the instructions in README.md to set it up and get it up running.

You will need Gemini API keys and Bright Data MCP setup.

Some screenshots of output

How I Used Bright Data's Infrastructure

The system leverages Bright Data's infrastructure through its MCP (Model Context Protocol):

Web Content Access: The agent uses Bright Data's tools to:
- Bypass websites with bot protection and CAPTCHAs
- Extract structured data from protected sites (Amazon, LinkedIn, etc.)
- Navigate complex web pages using remote browser capabilities
Real-time Search: Bright Data's search engine enables the agent to:
- Discover up-to-date sources for requested data
- Verify information freshness
- Expand search coverage beyond standard search engines
MCP Integration: The system leverages following Bright Data MCP tools:

Uses search_engine tool to perform comprehensive web searches
Leverages scraping_browser_get_text to extract visible content from pages
Uses platform specific tools like web_data_amazon_product_reviews, web_data_youtube_videos whenever a platform like Instagram, LinkedIn, Amazon, Facebook, X, Zillow, Booking.com, YouTube are Detected as a data source.
Uses Bright Data MCP tools to also navigate the general sites whenever a discovery source is not among the above sites.

Performance Improvements

By leveraging Bright Data's real-time web access, the system achieves significant improvements:

Data Accuracy: Eliminates hallucinations and fake data by:
- Accessing primary sources directly
- Verifying information against multiple sources
- Using up-to-date web content
Data Collection Efficiency: Optimizes data collection through:
- Automated navigation of complex sites
- Structured data extraction from diverse formats
- Rapid adaptation to changing web structures
- Minimizing manual intervention in data gathering
- Fast gathering of web data
Reliability: Ensures consistent operation with:
- Automatic retry mechanisms
- Bot protection bypass
- CAPTCHA solving capabilities

Conclusion

Bright Data MCP server is good. But Bright Data's own abilities are excellent. Its ability to scrap and navigate web pages and bypass bot and captcha protected pages is good. It is fast and its retry mechanism is reliable.