DEV Community

uratmangun
uratmangun

Posted on

Make Cursor Composer Smarter with Bright Web Scraping Capabilities

This is a submission for the Bright Data Web Scraping Challenge: Most Creative Use of Web Data for AI Models

What I Built

IDE nowadays like cursor has an AI agent to help us coding faster and easier but the cons of ai agent in this case composer, it has limited capabilities, for example chat in cursor has capabilities of searching web, and if we put url into it, it will automatically scrape that web and answer based on the scraped web, the ai agent in cursor named composer doesnt have that capabilities. This is chat feature in cursor below, as you can see there are web search

Image description

But if you go to cursor composer it doesnt have that, especially the agent one,

Image description

So my goal here is creating a web scraper and web searcher using bright and gemini openai compatible model to make cursor composer more smarter with functionality like web search and web scrape

Demo

Web Scraper

A powerful web scraping utility built with Playwright that can extract content and links from websites.

Prerequisites

  • Node.js 18+
  • pnpm (recommended) or bun

Installation & Usage

You can use this tool in two ways:

1. Using npx (Recommended)

Run directly without installation using npx:

npx @uratmangun/scraper-tool show-content <url>
# or
npx @uratmangun/scraper-tool search "<query>"
Enter fullscreen mode Exit fullscreen mode

2. Local Installation

  1. Install dependencies:
pnpm install
# or
bun install
Enter fullscreen mode Exit fullscreen mode
  1. Set up environment variables:
cp .env.example .env.local
Enter fullscreen mode Exit fullscreen mode

Then edit .env.local and set your BRIGHT_PLAYWRIGHT_URL for the Playwright CDP connection.

Commands

View Content

# Using npx
npx @uratmangun/scraper-tool show-content <html|text> <url>

# Using local installation
pnpm run scrape show-content <html|text> <url>
Enter fullscreen mode Exit fullscreen mode

Example:

npx @uratmangun/scraper-tool show-content html https://example.com
# or
npx @uratmangun/scraper-tool show-content text https://example.com
Enter fullscreen mode Exit fullscreen mode

This will display either the HTML or plain text content of the specified URL…

How I Used Bright Data

So in order to do that i'm gonna make a script first to use bright api web scraper:

Bright Data - Web Data Platform

World's largest proxy service with a residential proxy network of 72M IPs worldwide and proxy management interface for zero coding. Start a 7-day free trial »

favicon brightdata.com

Let's create script called scrape.mjs with script below:

You can use it like this:

pnpm run scrape show-content <url>
Enter fullscreen mode Exit fullscreen mode

It will show the HTML content of a URL like this:

...
  <div id="js-global-screen-reader-notice" class="sr-only mt-n1" aria-live="polite" aria-atomic="true"></div>
    <div id="js-global-screen-reader-notice-assertive" class="sr-only mt-n1" aria-live="assertive" aria-atomic="true"></div>

Enter fullscreen mode Exit fullscreen mode

So this is good, but it is still unstructured , before we make it structured data that can be understood by both machine and human, we gonna make a search using web scraper of bright as well so create this script for search.mjs:

You can run this using pnpm run search <your query> and will show something like this:

...
AAAAAEAACgIQAAAAAACgAAAAAAAAAAAAAABIAAAAAAAAECAABEJCAAAEAAAAAMACAAAILAABAgAEAAAAAAAEAAgAIEAEYL__OAAAAAAAAAAAAAQCABEAAAAAAHABABAE0d4AAQAAAAgAAAAMAAAAQAAAAAAAAAUAAAAAAAAAAAQAAAAAAAAAAAAAAAABAPoBAAAAAAAAAAAAAAACAAAAAABggAIAAvgBAAAAAACAAwAAAAABAQAAOAIGIAAAAAAAAAD3AcDjAeGQwgIAAAAAAAAAAAAAAAABSBDMgfQXBCAAAAAAAAAAAAAAAAAAAJAiaOJyAwAC/d=0/dg=0/br=1/rs=ACT90oEbXjTDEsqDs2o3NzHTmzVZxjp5ng/m=sy27z,sy28k,sy27w,sy29c,sy28x,sy28v,sy288,sy280,M0O4le?xjs=s4" nonce=""></script></body></html>
Enter fullscreen mode Exit fullscreen mode

Again this is still unstructured now we need to parse all of this using AI, i'm using gemini to parse this content to more parseable format like JSON, we gonna extract the link out of the search unstructured data first, to do that let's make some script to consume that unstructured data and convert it to JSON using AI.

Now when you run this you will get something like this:

{
  content: '[\n' +
    '    {\n' +
    '        "description": "Order Panda Express | A Fast Casual Chinese Restaurant ...",\n' +
    '        "title": "Order Panda Express | A Fast Casual Chinese Restaurant ...",\n' +
    '        "url": "https://www.pandaexpress.com/"\n' +
    '    },\n' +
    '    {\n' +
    `        "description": "The only natural habitat for giant pandas in the world is located in southwestern China. Combined with the requirement that all cubs must return to China this creates the sense that pandas belong in and to China, and a country can only receive them if they have good relations with the People's Republic.",\n` +
    `        "title": "The Giant Pandas Have Left the National Zoo. What's Next for U.S. ...",\n` +
    `         "url": "https://www.georgetown.edu/news/the-giant-pandas-have-left-the-national-zoo-whats-next-for-u-s-china-relations/#:~:text=The%20only%20natural%20habitat%20for,relations%20with%20the%20People's%20Republic."\n` +
    '    },\n' +
    '    {\n' +
    '        "description": "Red pandas are the only living members of their taxonomic family, Ailuridae, while giant pandas are in the bear family, Ursidae.",\n' +
    '        "title": "Is a Red Panda a Bear? And More Red Panda Facts ...",\n' +
    '        "url": "https://nationalzoo.si.edu/animals/news/red-panda-bear-and-more-red-panda-facts"\n' +
    '    },\n' +
    '  {\n' +
    '        "description": "Giant pandas live in a few mountain ranges in south central China, in Sichuan, Shaanxi and Gansu provinces. They once lived in lowland areas, but farming, forest clearing and other development now restrict giant pandas to the mountains.",\n' +
    '        "title": "Giant panda",\n' +
    '        "url": "https://nationalzoo.si.edu/animals/giant-panda#:~:text=Giant%20pandas%20live%20in%20a,giant%20pandas%20to%20the%20mountains."\n' +
    '    },\n' +
    '     {\n' +
    `        "description": "Pandas have excellent camouflage for their habitat. The giant panda's distinct black-and-white markings have two functions: camouflage and communication. Most of the panda - its face, neck, belly, rump - is white to help it hide in snowy habitats. The arms and legs are black, helping it to hide in shade.",\n` +
    '        "title": "Top 10 facts about Pandas - WWF",\n' +
    '        "url": "https://www.wwf.org.uk/learn/fascinating-facts/pandas#:~:text=Pandas%20have%20excellent%20camouflage%20for,it%20to%20hide%20in%20shade."\n' +
    '    },\n' +
    '     {\n' +
    '        "description": "The giant panda (Ailuropoda melanoleuca), also known as the panda bear or simply panda, is a bear species endemic to China. It is characterised by its white coat with black patches around the eyes, ears, legs and shoulders. Its body is rotund; adult individuals weigh 100 to 115 kg and are typically 1.2 to 1.9 m long.",\n' +
    '        "title": "Giant panda",\n' +
    '        "url": "https://en.wikipedia.org/wiki/Giant_panda"\n' +
    '    },\n' +
    '   {\n' +
    `        "description": "The giant panda is the rarest member of the bear family and among the world's most threatened animals. Learn about WWF's giant panda conservation efforts.",\n` +
    '        "title": "Giant Panda | Species | WWF",\n' +
    '        "url": "https://www.worldwildlife.org/species/giant-panda"\n' +
    '    },\n' +
    '      {\n' +
    '        "description": "Panda Security antivirus: tailor-made computer security solutions. All our expertise to protect and simplify your life online.",\n' +
    '          "title": "Panda Security | Official Website",\n' +
    '        "url": "https://www.pandasecurity.com/"\n' +
    '    }\n' +
    ']',
  role: 'assistant'
}
Enter fullscreen mode Exit fullscreen mode

The parsed html result. Now we need to put this all into one file so that we can run this globally so composer can use it to gain more knowledge using search engine. I create scrape-or-search.mjs file which contain merged version of search.mjs and scrape.mjs that look like below:

And we also create bin/cli.js to run command line globally later like this npx @uratmangun/scraper-tool search <query> or npx @uratmangun/scraper-tool scrape <text|html> <url>:

We then need to publish this change our package.json accordingly:

To publish this to npm you need to run both of this command respectively:

npm login
npm publish --access public
Enter fullscreen mode Exit fullscreen mode

This way you can use it using npx, now we need to set our global environment variable inside config.fish because i use fish shell, you can ask chatgpt for any other shell, so to set that vi ~/.config/fish/config.fish then we put both of this:

set -gx BRIGHT_PLAYWRIGHT_URL <your url>
set -gx GEMINI_API_KEY <your api key>

Enter fullscreen mode Exit fullscreen mode

Run source ~/.config/fish/config.fish to apply changes immediately, we can then use this globally which mean our cursor composer agent can use that to search the web as well when you run this command npx @uratmangun/scraper-tool search "web scraping tutorials", ok now to test whether or not it's working let's create some project first we will use something called .cursorrules so cursorrules is basically kinda like system message in AI so before the AI do anything he will read the cursorrules, so let's try without cursorrules first, we gonna make an ethereum project with scaffold-eth and frog, so frog is a https://frog.fm/ a framework to create farcaster frame, farcaster frame is relatively new framework so it may not know what it is, i also will add https://docs.airstack.xyz/airstack-docs-and-faqs/farcaster/farcaster-frames/frames-validator , let's create new folder called testing-composer

Image description

So there is nothing in this folder let's ask the cursor composer agent if he understand what is scaffold-eth and build project out of that, this is the page of scaffold-eth https://scaffoldeth.io/ a framework to create ethereum project easily we will just ask agent this one:

Image description

After running it it hallucinates a lot like for example:

Image description

To install frog.fm basically you just need

pnpm add frog
Enter fullscreen mode Exit fullscreen mode

thats it, not package frog.fm, now let's try using web search, let's add .cursorrules:

I'm trying to fix the scaffold-eth project that can't run frog using this prompt let's see if he can search the web now:

Image description

It started to understand now:

Image description

Sometimes whats not good about composer is that it stuck and doesnt respond anymore like this:

Image description

So we need to close the IDE and turn it back on again and prompt agent again in a new tab:

Image description

It's looking good it's using our tools to search the web

Image description

He also trying to see the content of the web using our tool

Image description

He also use the correct tool

Image description

There's still some error but overall the fact that they use our tool means that he now got superpower to search the web its still not as perfect but, it kinda help composer agent to see the world in different way. so instead of just use his hallucinated mind he can see it in another perspective which is great

Top comments (0)