DEV Community

Custodia-Admin
Custodia-Admin

Posted on • Originally published at pagebolt.dev

I added a first-party MCP server to my API. Here is what AI coding assistants can now do.

I added a first-party MCP server to my API. Here is what AI coding assistants can now do.

Most developers who use AI coding assistants have noticed the same gap: the assistant can write code to take a screenshot, but it cannot actually take one.

That sounds like a small thing. But the more you use AI assistants for real work, the more that limitation shows up. You are building a web app. You want to QA a UI change. You ask the assistant to check what a page looks like on mobile. It writes you code. You run the code. You look at the result. You describe what you see. The assistant responds.

That is a lot of steps for something that should take five seconds.


What MCP is

Model Context Protocol is an open standard published by Anthropic for connecting AI assistants to external tools. It gives any MCP-compatible assistant (Claude Desktop, Cursor, Windsurf, and others) the ability to call tools natively, as part of a conversation, without any custom integration on your end.

Think of it like USB-C for AI tools. Once an MCP server exists for a capability, any compatible assistant can use it.


The problem it solves

AI assistants are very good at reasoning about web pages. They can analyze HTML, suggest CSS fixes, describe accessibility issues. But they are working blind unless you paste the content to them manually.

A first-party MCP server changes that. The assistant can call the tool directly, get the result back, and reason about it in the same context. No copy-paste. No manual steps.

With the PageBolt MCP server, an AI coding assistant can:

  • Take a screenshot of any URL or rendered HTML
  • Generate a PDF
  • Create an OG image
  • Record a video of a browser flow
  • Run a multi-step browser sequence (login, navigate, click, capture)
  • Inspect a page's element structure and get CSS selectors back

Setting it up

For Cursor, add this to your MCP configuration:

{
  "mcpServers": {
    "pagebolt": {
      "command": "npx",
      "args": ["-y", "pagebolt-mcp"],
      "env": {
        "PAGEBOLT_API_KEY": "your_api_key_here"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

For Claude Desktop, the same snippet goes in claude_desktop_config.json.

That is it. Once configured, the assistant has access to all the PageBolt tools. No code to write, no endpoints to memorize, no auth flow to manage.


What this actually unlocks

Visual QA in conversation. You push a CSS change. You ask the assistant "what does the homepage look like now on an iPhone 14 Pro?" It calls the screenshot tool with viewportDevice: "iphone_14_pro", gets the image, and tells you what it sees. You can iterate without leaving the conversation.

Page inspection for debugging. You describe a layout issue. The assistant calls inspect_page on the URL, gets back a structured map of elements with CSS selectors, and uses that to diagnose the problem. No more "I think the selector is .header-nav > ul > li."

Automated demo content. You generate a landing page. You ask the assistant to create the OG image for it. It calls the OG image tool with your title and description, and hands you the file. No Figma, no manual steps.

Narrated video demos from a PR. This is the one I am most interested in. The flow is: push a PR, a CI step triggers, an AI agent runs a browser sequence against the staging environment, records a video with step notes and audio narration, and posts the result as a PR comment. The reviewer watches a 30-second narrated demo instead of reading a description of what changed.

The PageBolt MCP server supports all of this natively. Here is what that CI step might look like, written by the assistant itself:

// Ask the assistant: "record a demo of the checkout flow on staging"
// The assistant calls the MCP tool with something like this:

{
  "tool": "record_video",
  "arguments": {
    "steps": [
      { "action": "navigate", "url": "https://staging.yourapp.com", "note": "Landing page" },
      { "action": "click", "selector": "#get-started", "note": "Start the signup flow" },
      { "action": "fill", "selector": "#email", "value": "demo@example.com", "note": "Enter email" },
      { "action": "click", "selector": "#continue", "note": "Proceed to checkout" }
    ],
    "audioGuide": {
      "enabled": true,
      "voice": "nova",
      "script": "Welcome. {{1}} Click Get Started to begin. {{2}} Enter your email. {{3}} Click Continue."
    },
    "frame": { "enabled": true, "style": "macos" },
    "background": { "enabled": true, "type": "gradient", "gradient": "ocean" }
  }
}
Enter fullscreen mode Exit fullscreen mode

The assistant writes that config, calls the tool, and gives you back an MP4. The video has browser chrome, a gradient background, an animated cursor, and AI voice narration.


What I am watching

I built the MCP server because I kept using the PageBolt API myself inside Cursor and kept wishing I did not have to write the API call manually. Now I do not have to.

What I do not know yet is whether other people will find it as useful, or whether they will mostly use the HTTP API directly. The MCP path is lower friction for people already inside an AI coding assistant. The HTTP API is lower friction for people building automations and pipelines.

Both are valid. The MCP server just removes a category of friction that I kept bumping into.


Try it

PageBolt has a free tier with 100 requests per month. The MCP server works on any tier. Setup takes about two minutes if you already have Cursor or Claude Desktop configured.

The spec for MCP is at modelcontextprotocol.io if you want to understand what is happening under the hood. It is worth reading. The ecosystem is growing fast and the pattern is becoming a standard part of how AI tools connect to the world.

If you configure it and find something interesting it can do, I would like to know about it.

Top comments (6)

Collapse
 
matthewhou profile image
Matthew Hou

This is the right direction. APIs that ship their own MCP server are going to have a massive DX advantage over APIs that don't.

Think about it: instead of reading docs, writing boilerplate, and debugging auth — you just tell the AI "take a screenshot of this page" and it figures out the API calls. That's the promise of MCP done right.

The question is who maintains it. If it's a side project for one engineer, it'll fall behind the main API. If it's first-class, it changes how people interact with your product. Sounds like you're treating it as first-class, which is the way to go.

Collapse
 
custodiaadmin profile image
Custodia-Admin

Exactly this - and the "who maintains it" question is the one that keeps us honest. The MCP server ships from the same repo as the API, versioned together, and we dogfood it ourselves (it's how I take screenshots/videos during development now with the CI integration). If it falls behind the API, we'd catch it immediately. The DX gap you're describing is real and it's only going to widen as more tools add MCP support!

Collapse
 
renato_marinho profile image
Renato Marinho

This is a great real-world example of where MCP adds immediate value. One challenge I've noticed as MCP servers get more complex is that the tool responses themselves can become ambiguous to the consuming agent.

When your MCP tool returns structured data — statuses, amounts, flags — the AI has to infer what's actionable and what the values mean without visual context. For a first-party server like yours, you have full control over this, which is a huge advantage.

I've been working on an architectural pattern for this exact problem: mcp-fusion (github.com/vinkius-labs/mcp-fusion). It adds a Presenter layer that makes tool responses agent-aware by including action affordances (what the agent can do next) and semantic clarity in the output itself. Might be interesting to consider as your MCP server evolves beyond the initial tooling.

Collapse
 
custodiaadmin profile image
Custodia-Admin

Really appreciate this framing! the ambiguity problem in tool responses is something I've been thinking about too, especially as we add more complex tools like multi-step sequences and narrated video (and as I bloat my own AI agents with more and more tools). Right now we're leaning on rich descriptions in the tool schema itself, but you're pointing at something deeper: the response structure needs to tell the agent what's actionable, not just what happened. Going to look at mcp-fusion properly, the Presenter layer idea sounds like it solves a real problem. Thank you so much for sharing!

Collapse
 
mahima_heydev profile image
Mahima From HeyDev

Really nice use of MCP as a first-party contract for your API. In practice the biggest win I have seen is keeping the tool surface area tiny and versioned, otherwise assistants start calling “almost right” endpoints and you get spooky failures that look like model bugs. Curious if you’re enforcing auth + rate limits per tool, or treating the MCP server as a trusted internal client? Also interested how you’re testing these tool calls (golden transcripts vs mocked API responses).

Collapse
 
custodiaadmin profile image
Custodia-Admin

Good callouts on both.

Auth + rate limits: Auth is per-request, not per-session, every tool call includes the API key in the header. Rate limits are enforced at the HTTP layer (same quota the REST API uses), so each tool invocation counts against the monthly plan. No separate MCP quota. The MCP server is a thin wrapper; the real gate is the API.

Tool surface area: We deliberately expose a subset of each endpoint's parameters in the MCP tools, the ones that make sense for conversational/agentic use. Full parameter access is still available via the REST API. The tradeoff is discoverability vs. power; agentic users prefer fewer, well-named parameters.

Testing: Mostly integration tests against a real browser instance running locally. Unit-testing Puppeteer behaviour is painful. The MCP layer itself is thin enough that the main failure modes are auth errors and network issues rather than logic bugs.

The "spooky failures" framing is exactly what drove our /inspect design, return the element map first, let the agent decide what to interact with, rather than trying to make the agent guess selectors. Works much better in practice.