I built a web capture API in my spare time. Here is what happened when I added AI voice narration.

#webdev #javascript #buildinpublic #showdev

I built a web capture API in my spare time. Here is what happened when I added AI voice narration.

It started with a Puppeteer instance I was tired of babysitting.

Every project I built needed the same things: take a screenshot of a page, generate a PDF, maybe slap together an OG image for social sharing. And every time, I had the same three choices. Run my own Puppeteer instance. Pay for three separate APIs. Or just skip it and ship without those features.

I kept choosing option three, which felt wrong.

So I started building something for myself. A simple API: send a URL, get a PNG back. That is it. Except, once you have Chromium running in a managed way, you start asking "what else can I do here?"

How it grew

The screenshot API came first. Then someone asked about PDFs. Then I needed OG images for the product itself, so I added that. Then I wanted to record a quick demo for a product launch and realized I had everything I needed to support video recordings too.

Before I really noticed it, I had built seven things that all shared one authentication key and one billing meter.

The seven tools in PageBolt now:

Screenshots (PNG, JPEG, WebP, 25+ device presets)
PDF generation from a URL or raw HTML
OG image generation (3 built-in templates, custom HTML on Growth tier)
Video recording with cursor effects and browser chrome
Audio Guide, which I will get to in a moment
Multi-step browser sequences (login flows, form fills, then capture)
Page inspection, which returns a structured map of elements with CSS selectors

Here is the simplest possible thing it does, a screenshot in cURL:

curl -X POST https://pagebolt.dev/api/v1/screenshot \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "blockBanners": true,
    "blockAds": true
  }' \
  --output screenshot.png

That blockBanners flag removes GDPR cookie popups automatically. I added that after the third time I saw a screenshot ruined by a CookieBot overlay.

The Audio Guide story

Video recording was already working. You could record a multi-step browser sequence, add step notes that appear as tooltip annotations, and get back an MP4 of the whole flow.

But a friend asked: "Can the video narrate itself?"

My first reaction was "that is scope creep." My second reaction was "I already have the step notes. Each step has text attached to it. All I need is to turn those into speech and sync them to the timeline."

I looked at the Azure and OpenAI voice APIs. Both have usable TTS at reasonable cost. I tested about a dozen voices and picked ten that sounded like something you would actually want listening to on a demo video: ava, andrew, emma, brian, aria, guy, jenny, davis, christopher, michelle from Azure, and alloy, echo, fable, nova, onyx, shimmer from OpenAI.

The integration took two weeks longer than I expected, mostly because synchronizing narration to browser actions is harder than it sounds. The narration needs to fit the pace of the recording, not the other way around.

There are two modes. You can add a narration field to individual steps, and the audio aligns to each action. Or you can write a single script string with {{1}}, {{2}} markers that sync to specific steps, like a guided tour.

{
  "audioGuide": {
    "enabled": true,
    "voice": "nova",
    "script": "Welcome to the dashboard. {{1}} Click Settings to manage your account. {{2}} Here you can update your profile."
  }
}

The use case I keep seeing: narrated product demos for sales and onboarding. Record once, the video explains itself.

The MCP server

Model Context Protocol is the standard Anthropic published for letting AI assistants call external tools natively. If you use Claude Desktop, Cursor, or Windsurf, you can configure MCP servers and the AI assistant gets new capabilities without any custom integration work on your end.

I built a first-party MCP server for PageBolt because the alternative felt silly. If someone is using an AI coding assistant to build a web app, they might want to ask it "take a screenshot of my staging environment" or "generate an OG image for this page." Without an MCP server, the assistant can write the code to do that. With one, it can just do it.

It works. I have asked Claude to inspect a page's element structure, take a screenshot of a specific URL, and generate a PDF of documentation. It uses the API the same way any code would, except the "code" is just the assistant deciding to call a tool.

Setting it up is about 10 lines of JSON config. I cover this in more depth in a separate post.

Where it is now

PageBolt is live. There is a free tier with 100 requests per month, no credit card required. Paid plans start at $29/month for 5,000 requests.

I am not a marketing person. I built this because I kept needing it and I could not find one thing that did all of it under one key at a reasonable price. If you are in the same situation, the free tier is a real free tier, not a 3-day trial.

What I am measuring

I am watching which of the seven APIs gets used first by new signups. My hypothesis is that screenshots and PDFs will be the entry point for most people, and that video recording is a "discovery feature" people find once they are already using the rest.

I am also watching whether the MCP integration brings in a different kind of user: people who are not searching for a screenshot API but who discover PageBolt because their AI assistant suggested it.

If you build something with it, I would genuinely like to hear about it. The product is at pagebolt.dev. The free tier does not expire.