This is a submission for the Bright Data Web Scraping Challenge: Most Creative Use of Web Data for AI Models
What I Built
IDE nowadays like cursor has an AI agent to help us coding faster and easier but the cons of ai agent in this case composer, it has limited capabilities, for example chat in cursor has capabilities of searching web, and if we put url into it, it will automatically scrape that web and answer based on the scraped web, the ai agent in cursor named composer doesnt have that capabilities. This is chat feature in cursor below, as you can see there are web search
But if you go to cursor composer it doesnt have that, especially the agent one,
So my goal here is creating a web scraper and web searcher using bright and gemini openai compatible model to make cursor composer more smarter with functionality like web search and web scrape
Demo
Web Scraper
A powerful web scraping utility built with Playwright that can extract content and links from websites.
Prerequisites
- Node.js 18+
- pnpm (recommended) or bun
Installation & Usage
You can use this tool in two ways:
1. Using npx (Recommended)
Run directly without installation using npx:
npx @uratmangun/scraper-tool show-content <url>
# or
npx @uratmangun/scraper-tool search "<query>"
2. Local Installation
- Install dependencies:
pnpm install
# or
bun install
- Set up environment variables:
cp .env.example .env.local
Then edit .env.local
and set your BRIGHT_PLAYWRIGHT_URL
for the Playwright CDP connection.
Commands
View Content
# Using npx
npx @uratmangun/scraper-tool show-content <html|text> <url>
# Using local installation
pnpm run scrape show-content <html|text> <url>
Example:
npx @uratmangun/scraper-tool show-content html https://example.com
# or
npx @uratmangun/scraper-tool show-content text https://example.com
This will display either the HTML or plain text content of the specified URL…
How I Used Bright Data
So in order to do that i'm gonna make a script first to use bright api web scraper:
Let's create script called scrape.mjs
with script below:
You can use it like this:
pnpm run scrape show-content <url>
It will show the HTML content of a URL like this:
...
<div id="js-global-screen-reader-notice" class="sr-only mt-n1" aria-live="polite" aria-atomic="true"></div>
<div id="js-global-screen-reader-notice-assertive" class="sr-only mt-n1" aria-live="assertive" aria-atomic="true"></div>
So this is good, but it is still unstructured , before we make it structured data that can be understood by both machine and human, we gonna make a search using web scraper of bright as well so create this script for search.mjs
:
You can run this using pnpm run search <your query>
and will show something like this:
...
AAAAAEAACgIQAAAAAACgAAAAAAAAAAAAAABIAAAAAAAAECAABEJCAAAEAAAAAMACAAAILAABAgAEAAAAAAAEAAgAIEAEYL__OAAAAAAAAAAAAAQCABEAAAAAAHABABAE0d4AAQAAAAgAAAAMAAAAQAAAAAAAAAUAAAAAAAAAAAQAAAAAAAAAAAAAAAABAPoBAAAAAAAAAAAAAAACAAAAAABggAIAAvgBAAAAAACAAwAAAAABAQAAOAIGIAAAAAAAAAD3AcDjAeGQwgIAAAAAAAAAAAAAAAABSBDMgfQXBCAAAAAAAAAAAAAAAAAAAJAiaOJyAwAC/d=0/dg=0/br=1/rs=ACT90oEbXjTDEsqDs2o3NzHTmzVZxjp5ng/m=sy27z,sy28k,sy27w,sy29c,sy28x,sy28v,sy288,sy280,M0O4le?xjs=s4" nonce=""></script></body></html>
Again this is still unstructured now we need to parse all of this using AI, i'm using gemini to parse this content to more parseable format like JSON, we gonna extract the link out of the search unstructured data first, to do that let's make some script to consume that unstructured data and convert it to JSON using AI.
Now when you run this you will get something like this:
{
content: '[\n' +
' {\n' +
' "description": "Order Panda Express | A Fast Casual Chinese Restaurant ...",\n' +
' "title": "Order Panda Express | A Fast Casual Chinese Restaurant ...",\n' +
' "url": "https://www.pandaexpress.com/"\n' +
' },\n' +
' {\n' +
` "description": "The only natural habitat for giant pandas in the world is located in southwestern China. Combined with the requirement that all cubs must return to China this creates the sense that pandas belong in and to China, and a country can only receive them if they have good relations with the People's Republic.",\n` +
` "title": "The Giant Pandas Have Left the National Zoo. What's Next for U.S. ...",\n` +
` "url": "https://www.georgetown.edu/news/the-giant-pandas-have-left-the-national-zoo-whats-next-for-u-s-china-relations/#:~:text=The%20only%20natural%20habitat%20for,relations%20with%20the%20People's%20Republic."\n` +
' },\n' +
' {\n' +
' "description": "Red pandas are the only living members of their taxonomic family, Ailuridae, while giant pandas are in the bear family, Ursidae.",\n' +
' "title": "Is a Red Panda a Bear? And More Red Panda Facts ...",\n' +
' "url": "https://nationalzoo.si.edu/animals/news/red-panda-bear-and-more-red-panda-facts"\n' +
' },\n' +
' {\n' +
' "description": "Giant pandas live in a few mountain ranges in south central China, in Sichuan, Shaanxi and Gansu provinces. They once lived in lowland areas, but farming, forest clearing and other development now restrict giant pandas to the mountains.",\n' +
' "title": "Giant panda",\n' +
' "url": "https://nationalzoo.si.edu/animals/giant-panda#:~:text=Giant%20pandas%20live%20in%20a,giant%20pandas%20to%20the%20mountains."\n' +
' },\n' +
' {\n' +
` "description": "Pandas have excellent camouflage for their habitat. The giant panda's distinct black-and-white markings have two functions: camouflage and communication. Most of the panda - its face, neck, belly, rump - is white to help it hide in snowy habitats. The arms and legs are black, helping it to hide in shade.",\n` +
' "title": "Top 10 facts about Pandas - WWF",\n' +
' "url": "https://www.wwf.org.uk/learn/fascinating-facts/pandas#:~:text=Pandas%20have%20excellent%20camouflage%20for,it%20to%20hide%20in%20shade."\n' +
' },\n' +
' {\n' +
' "description": "The giant panda (Ailuropoda melanoleuca), also known as the panda bear or simply panda, is a bear species endemic to China. It is characterised by its white coat with black patches around the eyes, ears, legs and shoulders. Its body is rotund; adult individuals weigh 100 to 115 kg and are typically 1.2 to 1.9 m long.",\n' +
' "title": "Giant panda",\n' +
' "url": "https://en.wikipedia.org/wiki/Giant_panda"\n' +
' },\n' +
' {\n' +
` "description": "The giant panda is the rarest member of the bear family and among the world's most threatened animals. Learn about WWF's giant panda conservation efforts.",\n` +
' "title": "Giant Panda | Species | WWF",\n' +
' "url": "https://www.worldwildlife.org/species/giant-panda"\n' +
' },\n' +
' {\n' +
' "description": "Panda Security antivirus: tailor-made computer security solutions. All our expertise to protect and simplify your life online.",\n' +
' "title": "Panda Security | Official Website",\n' +
' "url": "https://www.pandasecurity.com/"\n' +
' }\n' +
']',
role: 'assistant'
}
The parsed html result. Now we need to put this all into one file so that we can run this globally so composer can use it to gain more knowledge using search engine. I create scrape-or-search.mjs
file which contain merged version of search.mjs
and scrape.mjs
that look like below:
And we also create bin/cli.js
to run command line globally later like this npx @uratmangun/scraper-tool search <query>
or npx @uratmangun/scraper-tool scrape <text|html> <url>
:
We then need to publish this change our package.json
accordingly:
To publish this to npm you need to run both of this command respectively:
npm login
npm publish --access public
This way you can use it using npx
, now we need to set our global environment variable inside config.fish
because i use fish shell, you can ask chatgpt for any other shell, so to set that vi ~/.config/fish/config.fish
then we put both of this:
set -gx BRIGHT_PLAYWRIGHT_URL <your url>
set -gx GEMINI_API_KEY <your api key>
Run source ~/.config/fish/config.fish to apply changes immediately, we can then use this globally which mean our cursor composer agent can use that to search the web as well when you run this command npx @uratmangun/scraper-tool search "web scraping tutorials"
, ok now to test whether or not it's working let's create some project first we will use something called .cursorrules so cursorrules is basically kinda like system message in AI so before the AI do anything he will read the cursorrules, so let's try without cursorrules first, we gonna make an ethereum project with scaffold-eth and frog, so frog is a https://frog.fm/ a framework to create farcaster frame, farcaster frame is relatively new framework so it may not know what it is, i also will add https://docs.airstack.xyz/airstack-docs-and-faqs/farcaster/farcaster-frames/frames-validator , let's create new folder called testing-composer
So there is nothing in this folder let's ask the cursor composer agent if he understand what is scaffold-eth and build project out of that, this is the page of scaffold-eth https://scaffoldeth.io/ a framework to create ethereum project easily we will just ask agent this one:
After running it it hallucinates a lot like for example:
To install frog.fm basically you just need
pnpm add frog
thats it, not package frog.fm, now let's try using web search, let's add .cursorrules
:
I'm trying to fix the scaffold-eth project that can't run frog using this prompt let's see if he can search the web now:
It started to understand now:
Sometimes whats not good about composer is that it stuck and doesnt respond anymore like this:
So we need to close the IDE and turn it back on again and prompt agent again in a new tab:
It's looking good it's using our tools to search the web
He also trying to see the content of the web using our tool
He also use the correct tool
There's still some error but overall the fact that they use our tool means that he now got superpower to search the web its still not as perfect but, it kinda help composer agent to see the world in different way. so instead of just use his hallucinated mind he can see it in another perspective which is great
Top comments (0)