<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Prasanjit dutta</title>
    <description>The latest articles on DEV Community by Prasanjit dutta (@prasanjit101).</description>
    <link>https://dev.to/prasanjit101</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3169058%2F7ed0471f-3251-437d-877c-148a106c0de8.png</url>
      <title>DEV Community: Prasanjit dutta</title>
      <link>https://dev.to/prasanjit101</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prasanjit101"/>
    <language>en</language>
    <item>
      <title>(SOTA) AI agent to generate real-time dataset for AI ML projects on demand - Perpendicular AI</title>
      <dc:creator>Prasanjit dutta</dc:creator>
      <pubDate>Sun, 25 May 2025 20:00:15 +0000</pubDate>
      <link>https://dev.to/prasanjit101/sota-ai-agent-to-generate-real-time-dataset-for-ai-ml-projects-on-demand-perpendicular-ai-4n31</link>
      <guid>https://dev.to/prasanjit101/sota-ai-agent-to-generate-real-time-dataset-for-ai-ml-projects-on-demand-perpendicular-ai-4n31</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/brightdata-2025-05-07"&gt;Bright Data AI Web Access Hackathon&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is a project that I built for Bright Data MCP Hackathon. The reason I participated in this is to experiment with the MCP and also because I like building. I am currently open to work and have put a lot of effort into developing this project. I would be very thankful if you could react to my article and share it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;Perpendicular AI is an AI agent designed to generate real-time datasets for AI/ML projects by leveraging advanced web scraping. It solves the challenge of acquiring up-to-date, trustworthy dataset by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Interpreting user queries to identify specific data needs&lt;/li&gt;
&lt;li&gt;Locating relevant sources via Bright Data’s search tools provided by Bright Data MCP&lt;/li&gt;
&lt;li&gt;Extracting and structuring data from diverse web pages using Bright Data MCP&lt;/li&gt;
&lt;li&gt;Creating tailored schemas for seamless data integration&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Capabilties
&lt;/h2&gt;

&lt;p&gt;Perpendicular can create realtime datasets from :&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Any specific site when provided with a URL&lt;/li&gt;
&lt;li&gt;General web&lt;/li&gt;
&lt;li&gt;Twitter posts&lt;/li&gt;
&lt;li&gt;LinkedIn data&lt;/li&gt;
&lt;li&gt;Instagram Posts&lt;/li&gt;
&lt;li&gt;Booking.com&lt;/li&gt;
&lt;li&gt;Zillow&lt;/li&gt;
&lt;li&gt;Amazon data&lt;/li&gt;
&lt;li&gt;Youtube&lt;/li&gt;
&lt;li&gt;ZoomInfo&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;

&lt;p&gt;Demo of a dataset generation using the perpendicular ai agent.&lt;br&gt;
  &lt;iframe src="https://www.youtube.com/embed/Dvq2Jw3fOt0"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/prasanjit101/perpendicular" rel="noopener noreferrer"&gt;Perpendicular Github Repo&lt;/a&gt;&lt;br&gt;
Here is the github repo link for the project. Clone it and follow the instructions in README.md to set it up and get it up running. &lt;/p&gt;

&lt;p&gt;You will need Gemini API keys and Bright Data MCP setup.&lt;/p&gt;

&lt;p&gt;Some screenshots of output&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forf3p9dzb6unldph6vvf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forf3p9dzb6unldph6vvf.png" alt="Create dataset from amazon product review"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F86qxxlqqp6k2w49qchae.png" alt=" "&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  How I Used Bright Data's Infrastructure
&lt;/h2&gt;

&lt;p&gt;The system leverages Bright Data's infrastructure through its MCP (Model Context Protocol):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Web Content Access&lt;/strong&gt;: The agent uses Bright Data's tools to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bypass websites with bot protection and CAPTCHAs&lt;/li&gt;
&lt;li&gt;Extract structured data from protected sites (Amazon, LinkedIn, etc.)&lt;/li&gt;
&lt;li&gt;Navigate complex web pages using remote browser capabilities&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Real-time Search&lt;/strong&gt;: Bright Data's search engine enables the agent to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discover up-to-date sources for requested data&lt;/li&gt;
&lt;li&gt;Verify information freshness&lt;/li&gt;
&lt;li&gt;Expand search coverage beyond standard search engines&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MCP Integration&lt;/strong&gt;: The system leverages following Bright Data MCP tools:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Uses &lt;code&gt;search_engine&lt;/code&gt; tool to perform comprehensive web searches&lt;/li&gt;
&lt;li&gt;Leverages &lt;code&gt;scraping_browser_get_text&lt;/code&gt; to extract visible content from pages&lt;/li&gt;
&lt;li&gt;Uses platform specific tools like &lt;code&gt;web_data_amazon_product_reviews&lt;/code&gt;, &lt;code&gt;web_data_youtube_videos&lt;/code&gt; whenever a platform like Instagram, LinkedIn, Amazon, Facebook, X, Zillow, Booking.com, YouTube are Detected as a data source.&lt;/li&gt;
&lt;li&gt;Uses Bright Data MCP tools to also navigate the general sites whenever a discovery source is not among the above sites.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance Improvements
&lt;/h2&gt;

&lt;p&gt;By leveraging Bright Data's real-time web access, the system achieves significant improvements:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Accuracy&lt;/strong&gt;: Eliminates hallucinations and fake data by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accessing primary sources directly&lt;/li&gt;
&lt;li&gt;Verifying information against multiple sources&lt;/li&gt;
&lt;li&gt;Using up-to-date web content&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Collection Efficiency&lt;/strong&gt;: Optimizes data collection through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automated navigation of complex sites&lt;/li&gt;
&lt;li&gt;Structured data extraction from diverse formats&lt;/li&gt;
&lt;li&gt;Rapid adaptation to changing web structures&lt;/li&gt;
&lt;li&gt;Minimizing manual intervention in data gathering&lt;/li&gt;
&lt;li&gt;Fast gathering of web data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reliability&lt;/strong&gt;: Ensures consistent operation with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic retry mechanisms&lt;/li&gt;
&lt;li&gt;Bot protection bypass&lt;/li&gt;
&lt;li&gt;CAPTCHA solving capabilities&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Bright Data MCP server is good. But Bright Data's own abilities are excellent. Its ability to scrap and navigate web pages and bypass bot and captcha protected pages is good. It is fast and its retry mechanism is reliable.&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>brightdatachallenge</category>
      <category>ai</category>
      <category>webdata</category>
    </item>
  </channel>
</rss>
