UI = MCP: How we got voice control of our B2B email platform for free via Model Context Protocol" published: false tags: mcp, architecture, voice, nestjs

#ai

Context
Live Direct Marketing (LDM) is a multi-tenant B2B email platform: CRM (companies/contacts/leads), segments, creatives, mailings, dialogs (inbound/outbound), suppression/stop-lists, anti-spam marking, deliverability checks. Stack: NestJS + Prisma + PostgreSQL + BullMQ + Redis on the backend, React + Turborepo on the frontend. Tenant isolation via a separate database per user.
The product's main differentiator is per-message inbox verification — every outbound email is verified across 9 mailbox providers (Gmail, Outlook, Yandex, Mail.ru, iCloud, GMX, AOL, T-Online, Yahoo) via a network of seed mailboxes. Each outbound dialog row in the DB has a placement field: inbox | spam | unchecked. Billing is gated on confirmed inbox.
When MCP started becoming the de facto agent integration standard, we had a choice: build a separate agent API mirror on top of the existing web API, or unify them.
We picked unify.
Architecture: UI = MCP
All endpoints live under /api/. In front of them sits a HybridAuthGuard that resolves either a session cookie (web UI) or a Bearer key in the form ldm_pk_ (MCP / external agent / voice skill). Whichever auth succeeds, the request hits the same controller, the same scope check, the same business logic.
typescript@Injectable()
export class HybridAuthGuard implements CanActivate {
async canActivate(ctx: ExecutionContext): Promise {
const req = ctx.switchToHttp().getRequest();

// 1. Bearer (agent / MCP / voice skill)
const m = req.headers.authorization?.match(/^Bearer (ldm_pk_\w+)$/);
if (m) {
  const key = await this.apiKeys.verifyHash(m[1]);
  if (!key) throw new UnauthorizedException();
  req.user = key.user;
  req.scopes = key.scopes; // gated by what was granted
  return this.checkScopeFor(req);
}

// 2. Session cookie (web UI)
const session = await this.sessions.fromCookie(req);
if (!session) throw new UnauthorizedException();
req.user = session.user;
req.scopes = ['*']; // UI = full scope
return true;

}
}
Every capability is described in /.well-known/agent-card.json (A2A discovery standard) with its scope: email:send, crm:read, dialogs:write, mailing:write, etc. A Bearer key is issued with an explicit scope set — a voice skill can be given a limited key that reads dialogs and sends mailings from a specific account, with no access to exports, billing, or suppression management.
The MCP server
Published as the npm package ldm-crm-mcp. Under the hood it's a thin wrapper over /api/*: takes LDM_API_KEY from env, proxies MCP tool calls to HTTP. About 30 tools for the most common operations, ~120 endpoints total available via generic invocation.
Claude Desktop / Cursor config:
json{
"mcpServers": {
"ldm-crm": {
"command": "npx",
"args": ["-y", "ldm-crm-mcp"],
"env": { "LDM_API_KEY": "ldm_pk_..." }
}
}
}
Wiring a voice assistant on top
Voice integration pipeline:

Voice → STT → text
Text → skill backend → MCP tool selection + parameters
MCP tool call → /api/* → response
Response → reply phrasing → TTS

The "brain" inside the skill backend is Claude via API. It interprets free-form speech and picks the appropriate MCP tool from the server's tool list. No hand-written grammars per command.
Auth: OAuth account linking on the voice platform → user gets a Bearer key in their LDM account with a limited scope (dialogs:read, dialogs:write, mailing:write, contacts:read).
All destructive actions (start a mailing, send a single email, update a deal) require voice confirmation — the skill reads back a summary and waits for "yes". Reads and briefings — no confirmation.
Field case: a trade show floor
A real test on a real booth, May 22, 2026. One of our early users (commercial director at a composite-pool manufacturer) is working a booth at a water-infrastructure trade show in Moscow. A landscape designer walks up, asks for a catalog of pond/water-feature solutions, leaves a business card, moves on.
Normal flow: enter the contact into CRM, open laptop, pick segment "landscape designers — subscribed", attach current catalog version, check template, send. 10–15 minutes if you're at a computer.
He pulled out his phone instead:

Send the pond solutions catalog to all subscribed landscape designers.

Found 247 contacts in segment "landscape designers — subscribed".
Template: "Pond Solutions Catalog v3, May 2026".
Confirm send?

Yes.

Sending.
Under the hood, the agent in the skill backend ran:
bash# 1. Resolve segment
GET /api/contacts?tagId=landscape-designers&subscribed=1&pageSize=500
→ 247 contacts

2. Resolve creative

GET /api/creatives?search=pond+catalog&latest=1
→ creativeId

3. Create mailing task

POST /api/tasks
{
"methodId": 2,
"creativeId": "cmoue9...",
"contactListId": "",
"accountId": ""
}
→ taskId, status: DRAFT

4. Self-approve (scope: mailing:write)

POST /api/mailing/$TASK_ID/approve
{ "note": "Voice-approved at trade show booth" }

5. Start

POST /api/tasks/$TASK_ID/start
→ status: ACTIVE
20 seconds — sent to the full segment. Then the interesting part started.
Per-message inbox verification
The standard model in cold/B2B email platforms: warm-up + inbox rotation + a pre-flight inbox placement test (send 20 emails to seed mailboxes before launch, compute the % landing in inbox). That's a statistical estimate from a pre-send sample.
Our model: every actual outbound is verified post-send against a network of seed mailboxes. We run 10–30 seed mailboxes per provider (~100 each for Gmail and Outlook). Roughly: when sending SMTP → real recipient, we parallelize a test twin to a seed mailbox on the same provider, using the same headers, body, sending account. Via IMAP we then check INBOX vs SPAM/JUNK/Quarantine and write the result to the dialog's placement field.
It's not a perfect proxy — a seed mailbox isn't the actual recipient, and providers may filter individually based on per-recipient signals. But it's substantially better than a pre-flight test because:

The check runs on every real send, not a sample.
It captures the reputation state at the moment of sending, not a day before campaign launch.
A drop into spam for a specific provider gets caught in real time — and triggers an automatic pause if the spam rate within a window exceeds threshold.

The endpoint the voice skill polls after a mailing:
bashGET /api/dialogs/stats?taskId=$TASK_ID
{
"total": 247,
"placement": {
"inbox": 231,
"spam": 4,
"unchecked": 12
},
"byProvider": {
"gmail": { "inbox": 142, "spam": 1 },
"yandex": { "inbox": 47, "spam": 0 },
"outlook": { "inbox": 18, "spam": 3 }
}
}
Voice response: "247 sent. 231 in inbox, 4 in spam, 12 still verifying. Outlook dropped — 3 of 21 in spam."
Billing: 231 inbox-delivered. The 4 spam and 16 blocked are not billed.
Where voice control breaks
No marketing veneer. Voice covers maybe 20–30% of an operator's actual workflow. The other 70% really is more efficient via a screen.
Works well:

Morning briefings on incoming dialogs.
Triggering a pre-configured mailing to a known segment.
Replying to a specific incoming email (short).
Checking delivery status of a campaign or individual message.
Ad-hoc stat questions.

Doesn't work:

Complex multi-parameter filters. Dictating "companies in city X, 50–500 headcount, e-commerce, no activity in 30 days, tag Y" by voice is misery. This belongs in a UI.
HTML/creative editing.
Designing multi-step automation pipelines (best-time-sending, follow-up sequences, A/B branching).
Latin-alphabet brand / domain recognition. The STT layer systematically butchers "Apple" / "Acme" / domain names. Solvable with a phonetic-normalization layer on top of CRM data, but accuracy is ~70–85%, not 100%.

Fragile points:

OAuth refresh token handling. When users change passwords on the voice platform's OAuth provider, the linked Bearer key gets orphaned and the user has to re-link.
Voice confirmation in noisy environments. On a trade show floor or in a car with windows open, a confirmation "yes" gets misheard about 1 in 3 times.
Latency. The STT → Claude (intent + tool selection) → MCP → /api/* → response → phrasing → TTS chain runs 4–8 seconds on a typical command. Acceptable for email ops, noticeable for conversational UX.

Architectural tradeoffs
MCP is just transport. The value depends on what you expose through it. Many CRM platforms expose read-only MCP, or a limited subset of objects. Ours exposes the full UI surface including starting mailings and self-approval — useful for agents, but requires a sane scope model and confirmation handling on the client side.
"UI = MCP" has a cost. Any new endpoint becomes agent-callable by default. That requires discipline — you can't dump anything into /api/* that should be UI-only for UX or safety reasons. We solve this with scopes and additional middleware on specific handlers, but it adds design overhead.
Voice as a UI is niche. It's not a replacement for the dashboard. It's an extension for specific scenarios — mobility, hands-busy contexts, fast briefings. Maybe 10–15% of operations actually benefit, no more.
What's next

MCP server v2 with explicit JSON Schema per tool (currently many tools return loose JSON, agents have to parse manually).
Voice-friendly responses on /api/dialogs/stats — flat, terse, no nested objects, so TTS doesn't waste seconds reading structure.
ChatGPT MCP App directory submission.
Apple Intelligence MCP App Extensions support, once Apple opens that to third parties.

Docs are public: developers.live-direct-marketing.online. MCP package: ldm-crm-mcp on npm. Questions on architecture / implementation — happy to discuss in comments.