Matt Cretzman

Posted on Mar 22 • Originally published at blog.mattcretzman.com

Building In-Context Messaging Into MCP Tool Responses

#ai #mcp #productivity #programming

MCP (Model Context Protocol) is pull-based. The client — Claude, ChatGPT via custom Actions, Copilot — initiates every request. There's no server-push mechanism.

That means if you want to deliver a message to a user inside their AI tool, you can't push it. You need to wait for them to show up.

At Skill Refinery, we built a message queue layer that does exactly this. When an expert broadcasts a message, it enters a queue. The next time a subscriber's AI tool makes any tool call to our MCP server, pending messages are appended to the response.

From the subscriber's side, it feels like getting a notification the moment they open their AI. Here's the architecture pattern and the design decisions that make it work.

The Problem

Knowledge platforms are passive. Subscribers ask, the platform answers. Experts have no way to reach their subscribers except through saturated external channels — email (121 messages per day for the average worker), push notifications (46 per day per smartphone user), or social media (algorithm-dependent reach).

If your MCP server is already handling tool calls from subscribers, you already have a delivery channel. You just need to build the plumbing.

The Architecture Pattern

Four components:

1. Message queue table. Stores pending broadcasts with sender info, message type, content, expiration timestamp, and a dismissable boolean.

2. Delivery tracking table. Records which subscribers received which messages, keyed by a SHA-256 hash of the subscriber's MCP key — never the raw key. Tracks delivery count per subscriber-message pair.

3. Middleware layer. Sits between the tool call handler and the response. On every tool call: query pending messages for this subscriber, filter by rate limits and delivery caps, append qualifying messages to the response payload, insert delivery records in batch.

4. Subscriber control tool. A dedicated MCP tool that lets subscribers manage their preferences through natural language — dismiss a message, opt out of a sender, opt out of a message type, or nuclear opt-out of all broadcasts.

The Design Decisions

These aren't technical afterthoughts. They define the product experience.

Knowledge answer first, messages after. The subscriber asked a question. Answer it. Then append messages. Messages never hijack the primary interaction. This is the foundational principle — flip the order and you destroy the core product experience.

Pull-based delivery, not push. True server-push doesn't exist in MCP today. Rather than waiting for Streamable HTTP transport to mature and for client implementations to support server-initiated notifications, build on what works now. The infrastructure is the same either way — when push arrives, you flip the delivery channel without rebuilding the queue.

Max 2 messages per tool call. Regardless of how many senders are active for a given subscriber. A subscriber with 10 expert subscriptions doesn't get hit with 10 messages at once.

Daily cap per subscriber. Even across multiple sessions and multiple tool calls, a subscriber sees a maximum number of unique messages in a 24-hour window. The system protects the experience at the subscriber level, not the sender level.

Auto-dismiss after 3 deliveries. If a subscriber has seen the same message across three separate tool calls and hasn't engaged (clicked, responded, or explicitly dismissed), the message is automatically marked dismissed. This prevents stale-message pile-up.

SHA-256 hashed keys for delivery tracking. Delivery records need to associate messages with subscribers, but storing raw MCP keys in a delivery table is a security risk. Hash the key, use the hash for lookup, never persist the raw value.

Cross-Platform Rendering

This is the biggest unknown. Claude, ChatGPT, and Copilot all handle appended text in tool responses differently. Some render it cleanly. Some truncate. Some prioritize the "answer" portion and collapse additional content.

The approach: keep messages plain text, self-contained, and short. No markdown formatting assumptions. No HTML. Each message should be legible as a standalone paragraph even if the AI wraps it differently than expected.

As a fallback, build a dedicated check_messages tool. If a platform consistently drops appended content, subscribers on that platform can explicitly ask their AI to check for pending messages. It's a less elegant experience but a reliable backup.

Fan-Out at Scale

An org admin broadcasting to 16,000 members needs to insert 16,000 delivery records. Doing this synchronously on the broadcast call would time out.

The approach: batch inserts of 500 rows per transaction. Queue the fan-out asynchronously. Analytics updates (delivery counts, open tracking) happen in background jobs that never block the tool response.

The tool call response path must stay fast. Any work that doesn't directly affect the subscriber's current response goes async.

Rate Limiting as Product Design

Rate limits on this system aren't about server protection. They're product design decisions that define the subscriber experience.

Experts get a limited number of sends per day. Org admins get more, because internal operational messages have higher urgency. The per-subscriber caps (messages per tool call, messages per day) override sender limits — the subscriber experience is always the ceiling.

Getting these numbers wrong in either direction is a product failure, not a technical one. Too restrictive and senders can't reach their audience. Too permissive and subscribers opt out of everything.

Message Types and Expiration

Different message types have different lifespans. Announcements expire quickly — they're time-sensitive by nature. Compliance or action-required messages persist longer. Content notifications sit in between.

Org compliance messages support a dismissable: false flag. The message keeps delivering until its expiration date, regardless of how many times the subscriber has seen it. This directly supports use cases where a company needs confirmation that staff received a critical update.

The Reusable Pattern

This pattern generalizes to any MCP server:

Add a message queue table to your database
Build middleware that checks for pending messages on every tool call
Append qualifying messages to the response, after the primary content
Track deliveries with hashed subscriber keys
Expose a subscriber-control tool for preference management
Rate limit at the subscriber level, not just the sender level

The infrastructure is lightweight — four database tables, a middleware function, and one additional MCP tool. No external services required.

The harder part isn't the code. It's the product decisions: what are the right rate limits? When does a message become stale? How do you balance sender value against subscriber trust? Those decisions are the product.

If you're building MCP servers and delivering knowledge to subscribers, you already have the delivery channel. The message queue just makes it a communication channel.

We're building this at Skill Refinery right now. If you're an expert, coach, or consultant who wants to deliver knowledge — and soon, messages — through your subscribers' AI tools, that's where you set up. If you're building in the MCP space and want to talk integration or partnership, grab time on my calendar.

I'm Matt Cretzman. I build AI agent systems through Stormbreaker Digital. Ventures include Skill Refinery, TextEvidence, LeadStorm AI. Writing at mattcretzman.com.

Originally published at blog.mattcretzman.com.

DEV Community