Why I wouldn't pick a single LLM — and the platform layer (Claude + GPT + Gemini + Grok, with approval gates and audit hooks) that turns four APIs into one product a CFO can sign off on.
Introduction
What is a virtual digital employee service?
It's a software service that provisions AI "employees" — agents scoped to a specific role (HR Analyst, Finance Controller, Product Designer) rather than generic assistants — and rents them to businesses as a subscription. Each digital employee has a written job description, a defined toolbelt (the HRIS, the payroll system, a Slack channel, a ticketing system), and a remit to operate across those systems continuously, 24/7, without a human having to prompt every step. Unlike a chatbot, it takes durable action on the customer's behalf — filing invoices, drafting contracts, reconciling books, exporting design specs — which means it also has to ask for a human's approval before doing anything irreversible, and keep a full audit trail of what it did. From the business's perspective it's a virtual hire: lower cost, always on, narrow in scope but deep within that scope, and accountable through a log book rather than a performance review.
What defines a digital employee — three dimensions
Three things separate "a hire" from "a chatbot." They're also the axes we'll keep coming back to when we compare SDKs and architectures below.
1. What they can do — actions and tasks
A digital employee acts, not just answers. That means both read operations (look up an employee's salary, pull last quarter's sales numbers, summarize a contract) and write operations (submit a payroll invoice, send an approved contract, file a Jira ticket, post to Slack on the customer's behalf). Reads run freely; writes pause for a human to click Approve before they execute. The set of available actions is bounded by role: the HR Payroll Analyst can draft and (with approval) submit a payroll run, but it cannot open a Figma file or create a Stripe charge — those belong to other employees on the roster. Tasks are typically multi-step, not single-turn Q&A: "prepare March payroll" fans out into list-employees → get-salary-for-each → compute-gross-to-net → draft-invoice → request-approval → submit → notify-finance.
A digital employee's limits are enforced by the framework, not by asking the model nicely — we'll cover the exact harness mechanisms (tool allowlists, approval gates, tenant scoping, budget caps, audit inevitability) in the implementation section below.
2. What knowledge do they have — and what data do they reach?
Two kinds, layered. First, a job description: a written system prompt specifying what the role does, what it never does, the policies it follows ("never send money without explicit approval", "use the company's approved legal templates", "always cc finance@ on payroll confirmations"), and enough domain vocabulary to sound like a practitioner rather than a generalist chatbot. Second, scoped access to the customer's systems: for Acme's HR Analyst that's Acme's HRIS, Acme's payroll provider, Acme's own database, and the Slack/email channels Acme has authorized — and only those. The boundary is both tenant-scoped and role-scoped at the same time: Acme's HR Analyst cannot see Contoso's data (tenant isolation) and cannot see Acme's Figma files either (role isolation). Session memory on top of that lets the employee remember prior conversations so Jane doesn't re-explain context every Monday morning.
3. How do they communicate with the business?
They have to meet the business where the business already works, which means multi-channel by default. Inbound: Slack DMs and channel mentions, Teams, email, SMS, webhooks from the customer's own SaaS apps, and a web console for longer-form work. Outbound: replies go back on the same channel the request arrived on, streamed as they're generated. Sitting on top of the conversational surface are two other streams that turn this from "a chat toy" into a real product: an approval inbox where humans click Approve/Deny on proposed write operations (Slack interactive buttons, web app, mobile push), and an activity log that tenant admins can inspect for compliance and confidence ("what did the Finance Controller do last week, and was every write approved?"). A digital employee can also initiate conversation, not just respond to it — proactive reminders ("Q1 payroll is due in 5 days; shall I draft it?"), scheduled runs on cron, and escalations to a human when it's genuinely stuck. Chat alone is table stakes; chat + approval inbox + activity log is the product.
The competitive reality — and why build our own anyway?
Before we spend pages arguing about technical details, we have to answer a prior question: why would an organization build its own virtual digital employee service when the three hyperscalers just shipped versions of it? As of April 22, 2026 — the day this doc was last revised — OpenAI, Google, and Microsoft all have enterprise-agent products in market targeting the exact workflows described above. For many organizations, buying one of those is the right call. This section is for the ones where it isn't — specifically, organizations that need full control over their data, their models, and their agent behavior, and have the engineering capability to build and operate their own.
What shipped in April 2026
| Dimension | OpenAI Workspace Agents | Google Gemini Enterprise / Agentspace | Microsoft Copilot Studio |
|---|---|---|---|
| Launch | Research preview, Apr 22, 2026 (today) | Apr 22, 2026 (today) | Multi-agent orchestration GA, Apr 2026 |
| Where it runs | Codex in OpenAI's cloud | Gemini on Google Cloud | Azure / Power Platform |
| How you build an agent | UI wizard inside ChatGPT ("describe a workflow, ChatGPT turns it into an agent"), or templates for finance/sales/marketing | Agent Designer (low/no-code) + Agent Garden prebuilts | Copilot Studio maker canvas; code-first path via M365 Agents SDK |
| Distribution channel | ChatGPT Business / Enterprise / Edu / Teachers seats + Slack | Gemini Enterprise seats (Business/Standard/Plus/Frontline) + M365/Workspace connectors | M365 seats |
| Pricing | Free until May 6 2026, then credit-based | Per-edition seat pricing (not public) | $30/user/month (paid yearly) |
| HITL approvals | Built in — "require the agent to ask for permission before moving forward" for sensitive steps (edit spreadsheet, send email, add calendar event) | Human approval checkpoints in Agent Designer workflows; governance via Agent Identity + Agent Gateway | Governance + approvals via Power Platform |
| Enterprise governance | Compliance API, admin console, prompt-injection safeguards, analytics | VPC-SC, CMEK, HIPAA/FedRAMP (Standard/Plus), Model Armor | Managed security + governance as Microsoft platform service |
| Named example agents (overlap with our roles) | Lead Outreach, Weekly Metrics Reporter, Third-Party Risk Manager, Software Reviewer, Product Feedback Router; OpenAI's internal accounting agent does month-end close with workpapers | Prebuilts include NotebookLM Enterprise, Deep Research; low-code Agent Designer for custom | Multi-agent orchestration across teams, Fabric-backed data agents |
| Lock-in posture | Tenant must live inside ChatGPT | Tenant must live inside Gemini Enterprise / GCP | Tenant must live inside M365 |
Where a hyperscaler wins the head-to-head sale
- Buyer is already on ChatGPT Enterprise, Google Workspace / Gemini Enterprise, or Microsoft 365.
- Single-org deployment where the organization itself is the tenant — one workspace, one admin console, one bill.
- Budget tolerates per-seat enterprise pricing (OpenAI credit model, Google Gemini Enterprise editions, or Copilot Studio at $30/user/month).
- Buyer trusts OpenAI / Google / Microsoft with their data and is happy for the agent to be "a ChatGPT feature" or "a Copilot" rather than a branded product.
For those buyers, there's no reason to build their own. Acknowledging this is the point of writing this section.
When and why an organization should build its own
The section above defines who doesn't need to build. By elimination, the organizations that should build their own are the ones that fail one or more of those criteria — and the common thread is control. Specifically:
- Data sovereignty and residency. When your employee records, financial data, patient information, or legal documents flow through an agent, the hyperscaler product decides where that data lives and who can access it. Workspace Agents runs on OpenAI's cloud. Gemini Enterprise runs on GCP. Copilot Studio runs on Azure. If your compliance posture (GDPR, HIPAA, SOC2, sector-specific regulation) requires data to stay within a specific geography, within your own infrastructure, or never touch a third-party LLM provider's servers at all — you need to own the stack. Building your own means you choose the deployment environment, the credential vault, the data residency, and the retention policy.
-
Model control and cost optimization. The hyperscaler products lock you to their model families and their pricing. You can't run a cheaper model for low-stakes queries, swap to a competitor's model when it performs better on a specific task, or run inference on-prem. Building your own lets you route per-tenant or per-task to different models (the
tier_modelpattern in §1 below), negotiate your own API contracts, or self-host open-weight models when the economics demand it. -
Full behavioral control and auditability. With a hyperscaler product, the agent loop is a managed service — you configure it, but you don't own it. You can't inject arbitrary logic between every tool call, you can't guarantee that every action is logged in your audit system before it executes, and you can't enforce organization-specific approval workflows that go beyond the vendor's built-in options. Building your own means the loop runs in your code: every
PreToolUseandPostToolUsehook is yours, every approval gate follows your workflow, and every log line lands in your SIEM, not the vendor's dashboard. - White-label and multi-tenant architecture. If you're a SaaS vendor, a managed service provider, or a platform builder serving many downstream customers, the hyperscaler products don't fit — their tenant model is "one organization, one workspace." Yours is "one platform, hundreds of isolated customers." Building your own lets you serve that multi-tenant model with per-customer branding, per-customer tool configurations, per-customer billing, and per-customer data isolation — none of which the hyperscaler products are designed to support.
Honest costs of building your own
- Engineering investment. The hyperscaler gives you an agent in minutes via a UI wizard. Building your own means standing up a platform: session management, approval inbox, channel adapters, credential vault, billing meter, audit pipeline, connector catalog. That's a team and a roadmap, not a weekend project.
- Velocity gap. OpenAI, Google, and Microsoft will ship new prebuilt agent templates, new integrations, and new governance features faster than any single org's engineering team. You're trading their velocity for your control.
- Ongoing operational burden. You own uptime, security patching, model version migrations, and compliance certification. A managed service handles that; a self-built service means you handle it.
The decision to build should only be made when the control benefits (data sovereignty, model flexibility, behavioral auditability, multi-tenant architecture) outweigh these costs. For most organizations, they won't. For organizations where data control is non-negotiable or the multi-tenant use case doesn't fit a hyperscaler workspace — they will.
Bottom line
The hyperscaler launches mean the default answer is now "buy, don't build." The build path is justified only when your organization needs full control over data residency, model routing, agent behavior, and audit trails — or when your business model requires multi-tenant white-label architecture the hyperscalers can't provide. For those organizations, the rest of this document explains how to build it, starting with which SDK to use as the foundation.
Combine all LLMs — each one's best part, orchestrated by your platform
No single LLM family is best at everything. The right architecture doesn't pick a winner — it assigns each model to the job it does best. This pattern is already proven in productionmulti-LLM platforms that use 5+ providers (OpenAI, Anthropic, Gemini, Grok, and specialty APIs) via direct API calls, with a central LLM registry that maps each task type to the right model, a triager that classifies inbound requests and routes them, parallel dispatch to multiple LLMs with timeout deadlines, a combinator that merges responses, and an arbiter that scores quality. No agent SDK required — just your platform code orchestrating the providers directly.
A virtual digital employee service follows the same pattern. Your platform layer — tenant management, approval inbox, audit pipeline, channel adapters, billing — is your code. It doesn't belong to any vendor's SDK. Below it, each digital employee role calls whichever LLM API fits that role's job.
What each LLM family is best at (April 2026 snapshot)
| LLM Family | Flagship (Apr 2026) | Where it leads | Best-fit digital employee roles |
|---|---|---|---|
| Anthropic Claude | Opus 4.7, Sonnet 4.6, Haiku 4.5 | Agentic multi-step tool chains: HLE-with-tools 53.1% (highest), SWE-bench Pro 64.3%. Extended Thinking with adaptive effort. File artifact generation via Managed Agents sandbox (PDF/xlsx/CSV). | HR Payroll Analyst (12-step tool chains), Finance Controller (reconciliation + file deliverables), any role that must reliably finish a real multi-step job using tools. |
| OpenAI GPT | GPT-5.4, o3/o3-pro | Pure reasoning and analytical review: GPQA Diamond 92.8%, ARC-AGI 87.5% (o3). Cleanest handoff model for triage → specialist → escalation. Broadest ecosystem: Realtime API (voice), Codex (code), Code Interpreter (file gen). | Customer Support Lead (triage/routing), Review & Approval Agent (structured validation, quality scoring), any role needing voice interaction or analytical judgment calls. |
| Google Gemini | Gemini 3.1 Pro, 2.5 Flash | Multimodal reasoning (image/video/audio), fastest TTFT (Flash: 250–730ms), cheapest tokens (Flash: $2.50/1M). Best speed-vs-reasoning balance. Deep Think baked into the main model line. | Product Designer (vision over mockups/images), Data Analyst (high-volume cost-sensitive queries), any role where multimodal input or low per-token cost is the binding constraint. |
| xAI Grok | Grok-4-1-fast | Speed-optimized inference, OpenAI-compatible API surface (drop-in replacement). Strong for real-time conversational tasks where latency trumps depth. | Fast-response roles, chat-first interactions, or as a fast fallback when flagship models are slow or over-budget. |
| Specialty (Palantir, domain-specific) | Varies | Domain-locked data and workflows (Foundry ontology, AIP actions). Not general-purpose — useful when the digital employee needs to operate inside a customer's Palantir deployment or other domain-specific platform. | Roles tied to a specific enterprise platform (Foundry-based data analysis, regulated-industry workflows). |
The combination architecture
The platform doesn't care which LLM a role uses. It orchestrates them:
Key architectural patterns (proven in production multi-LLM platforms):
-
Central LLM registry. A single configuration maps each task type or role to its model(s):
triager → gemini,structured_analysis → claude-opus,comparison_arbiter → gpt,fast_chat → grok. Adding a new LLM provider is adding one entry to the registry and one API adapter — not a re-architecture. - Triager-first routing. A fast, cheap model (e.g. Gemini Flash) classifies every inbound request: task type, required capabilities, include/exclude specific LLMs. The triager decides which models to dispatch to — the user doesn't have to pick.
- Parallel dispatch with deadlines. Fire requests to multiple LLMs simultaneously with a two-phase timeout: wait for the first response, then give stragglers a grace period. This gives you the best response from whichever model finishes first with quality, not just speed.
- Combinator + arbiter. A combinator merges parallel responses into a unified answer. An arbiter (a different LLM, often GPT for its analytical scoring strength) evaluates quality and picks the best output. The digital employee's response is the best of N, not a single model's attempt.
- Platform-level guardrails wrap everything. Approval gates, audit logging, tenant scoping, and budget caps are enforced by your platform layer around whichever LLM(s) ran inside. The LLM provides the intelligence; the platform provides the control.
Why this is better than picking one SDK
- No model lock-in. Claude is best at tool chains today; GPT-6 might be best next quarter. Swapping a role's model is a registry change, not a rewrite.
- Best-of-breed per role. The HR Payroll Analyst gets Claude's tool-chaining strength. The Product Designer gets Gemini's vision. The Review Agent gets GPT's analytical scoring. No role is stuck with a model that's wrong for its job.
-
Cost optimization. Route cheap queries to Gemini Flash ($2.50/1M) or Grok-fast, reserve Opus ($75/1M) for genuinely hard analysis. Per-tenant
tier_modelrouting still works — just at a finer grain. - Resilience. If one provider has an outage or rate-limits you, the triager routes to alternatives. No single point of model failure.
-
You don't need vendor agent SDKs at all. You can call each provider's API directly (openai, anthropic, google-genai, xai via OpenAI-compatible endpoint) using custom asyncio dispatch. The "agent loop" is your code, not a vendor's framework. If you do want an SDK's conveniences (Claude's
PreToolUsehooks, OpenAI's handoff model, ADK's A2A), you can adopt them selectively per-role — but the platform architecture doesn't depend on any of them.
Plain-English takeaway: Don't pick one LLM — combine all of them. Use Claude for multi-step tool chains and file generation. GPT for analytical review, triage routing, and voice. Gemini for multimodal reasoning and cheap high-volume work. Grok for speed-first interactions. Your platform orchestrates all of them with a triager, parallel dispatch, and a combinator/arbiter. Swapping or adding an LLM provider is a registry entry, not a re-architecture.
Sources (snapshot: April 2026, GA flagships Opus 4.7 / GPT-5.4 / Gemini 3.1 Pro): OpenAI API docs, Google Gemini API docs, Anthropic Claude API docs, xAI Grok API docs, plus 2026 reasoning benchmark roundups (HLE, GPQA Diamond, ARC-AGI, SWE-bench Pro) and TTFT benchmarks from BenchLM/TokenMix. April 22 2026 enterprise-agent launches: OpenAI Workspace Agents, Google Gemini Enterprise / Agentspace, and Microsoft Copilot Studio multi-agent orchestration GA.
Velocity caveat. All three providers shipped a new flagship in the 60 days before this snapshot (Gemini 3.1 Pro Feb, GPT-5.4 Mar, Opus 4.7 Apr 16). Latency, pricing, and benchmark numbers should be re-verified before any commitment is made on the strength of this table alone — model-layer claims age in weeks, not quarters.
Recommendations
Combine all LLMs — don't pick one. Assign each digital employee role to the model that fits its job best: Claude for multi-step tool chains and file deliverables, GPT for triage routing and analytical review, Gemini for multimodal work and cheap high-volume inference, Grok for speed-first interactions, specialty APIs for domain-locked workflows. Your platform layer (triager, parallel dispatch, combinator, arbiter, approval inbox, audit pipeline, billing) orchestrates all of them and doesn't depend on any single vendor's SDK. Anthropic's Managed Agents is a useful sandboxed compute tool within this architecture, not a foundation.
A note on openclaw
openclaw sometimes comes up in conversations about AI agent frameworks. It's a personal AI assistant daemon: local-first, single-host, Markdown-on-disk memory, designed to run as "my AI on my laptop." That's a different problem shape from a multi-tenant platform that orchestrates multiple LLM providers with per-tenant isolation, approval gates, and audit trails. It's a fine tool for what it's designed for — it's just not a candidate for this architecture.
A note on Anthropic's Managed Agents
Anthropic offers Managed Agents, a hosted runtime where Anthropic runs the agent loop for you. In a multi-LLM platform architecture, it's not a foundation — for the same reasons no single vendor's hosted runtime should be:
- You lose loop transparency. The platform-level guardrails this doc describes (approval gates, audit hooks, tenant scoping) require inserting custom logic between every tool call. A hosted runtime controls the loop on the vendor's side — you configure it, but you don't own it.
- You lose model routing. The hosted runtime decides which model runs your turn. A multi-LLM platform needs to route each role to a different provider — that routing must live in your code, not in Anthropic's infrastructure.
- You lose portability. The point of the multi-LLM architecture is that swapping a role's provider is a registry change. A dependency on any single vendor's hosted runtime undermines that.
Where Managed Agents does earn its keep: as a sandboxed compute tool called from inside a role's session. When a digital employee needs to execute arbitrary code — the Finance Controller reconciling CSVs, the HR Analyst computing gross-to-net payroll — Managed Agents' sandbox is a solid "code interpreter as a service" primitive. It runs Python in isolation, no network, no access to tenant data except what you hand in. Similarly, OpenAI's Code Interpreter serves the same function for GPT-powered roles. Use these as tools (see §6); don't use either as the platform foundation.
How to build the platform
The platform is the layer that turns raw LLM APIs into a digital employee service. It handles the things no LLM ships on its own: which customer is this, which role should answer, which model to use for that role, what tools it's allowed to touch, who has to approve before it takes action, and where the audit trail lands. The LLMs provide the intelligence; the platform provides the control.
The running example below uses Claude Agent SDK for the HR Payroll Analyst role (because Claude leads agentic tool-chaining). Other roles in the roster would use different providers — GPT for the Review Agent, Gemini for the Product Designer — but the platform patterns (session management, role packs, approval gates, audit logging, channel adapters) are the same regardless of which LLM runs inside.
A running example
We'll follow a single, concrete request through the whole system:
Jane, the HR manager at Acme Widgets, DMs our HR Payroll Analyst in Slack:
"Prepare the payroll invoice for all employees at Acme Widgets for March 2026."
By the end of this section you'll see every moving part that turns that one sentence into an approved, filed, auditable payroll invoice — and the exact few lines of code that make each part happen.
The code below is Python; the patterns translate to any language. The example uses Claude Agent SDK for the HR Payroll Analyst role — other roles would swap in the appropriate provider's client.
1. Know who's asking, who should answer, and which LLM to use
When Jane's Slack message arrives at our server, the first thing we do is figure out three things:
- Which customer is this? → Acme Widgets (we call this the tenant).
- Which digital employee should handle it? → the HR Payroll Analyst.
-
Which LLM provider and model should power this role? → looked up from the
llm_registry(for HR Payroll Analyst: Claude Opus 4.7, because it leads agentic tool-chaining).
We then spin up a dedicated conversation for that pair. We give it a memorable ID (acme-widgets:hr) so the next time Jane messages — whether from Slack, email, or text — the digital employee picks up exactly where it left off. The model selection comes from two sources: the role's default in the registry (Claude for HR, Gemini for Design, GPT for Support) and the tenant's pricing tier (a $99 plan might get Sonnet instead of Opus; a $49 plan might get Haiku).
from llm_registry import get_role_config
def build_session(tenant: Tenant, role: str) -> dict:
role_config = get_role_config(role) # e.g. {"provider": "claude", "model": "claude-opus-4-7", ...}
model = tenant.tier_override or role_config["model"] # tenant tier can downgrade
return {
"session_id": f"{tenant.id}:{role}", # "acme-widgets:hr"
"provider": role_config["provider"], # "claude" | "openai" | "gemini" | "grok"
"model": model, # "claude-opus-4-7"
"max_turns": 20, # safety cap on back-and-forth
"max_budget_usd": tenant.per_turn_budget,# safety cap on spend
"env": {
"TENANT_ID": tenant.id, # tell every tool which customer
"ROLE": role,
},
}
2. Give it a job description, a toolbelt, and an LLM
A digital employee isn't just an LLM — it's an LLM plus a written job description plus a specific set of systems it's allowed to touch plus the model that's best at its job. We keep a catalog called ROLE_PACKS that describes each role. Adding a new digital employee is adding one entry to this dictionary — including which LLM provider powers it.
For our example, the HR Payroll Analyst gets:
- a job description that says things like "you prepare payroll invoices, you answer benefits questions, you never send money without explicit approval"
- access to the HRIS (where employees and salaries live), payroll software, Slack and email for communicating, and the tenant's own database
- no access to, say, Figma or Salesforce — those belong to other digital employees
- Claude Opus 4.7 as its LLM — because multi-step tool chains are Claude's strength
The Product Designer, by contrast, gets Gemini 3.1 Pro (multimodal vision), and the Customer Support Lead gets GPT-5.4 (triage/handoff patterns).
ROLE_PACKS = {
"hr_payroll_analyst": {
"provider": "claude", # which LLM family
"model": "claude-opus-4-7", # default model for this role
"job_description": open("prompts/hr_payroll.md").read(),
"can_use": ["hris", "payroll", "tenant_db", "slack", "email"],
"allowed_tools": [
"mcp__hris__list_employees",
"mcp__hris__get_salary",
"mcp__payroll__draft_invoice",
"mcp__payroll__submit_invoice", # this one needs approval!
"mcp__slack__send_message",
"mcp__email__send",
],
},
"finance_controller": {
"provider": "claude",
"model": "claude-sonnet-4-6",
"job_description": open("prompts/finance.md").read(),
"can_use": ["quickbooks", "stripe", "tenant_db", "sandbox"],
"allowed_tools": ["mcp__quickbooks__*", "mcp__stripe__read_*", ...],
},
"product_designer": {
"provider": "gemini", # Gemini for multimodal vision
"model": "gemini-3.1-pro",
"job_description": open("prompts/design.md").read(),
"can_use": ["figma", "linear", "slack"],
"allowed_tools": ["mcp__figma__*", "mcp__linear__*", ...],
},
"customer_support_lead": {
"provider": "openai", # GPT for triage/handoff
"model": "gpt-5.4",
"job_description": open("prompts/support.md").read(),
"can_use": ["zendesk", "slack", "tenant_db"],
"allowed_tools": ["mcp__zendesk__*", "mcp__slack__*", ...],
},
}
3. Connect the digital employee to the real world
The AI can't "just look up Acme's employees" — it has to call a real system. The industry-standard plug for doing that is called MCP (Model Context Protocol). You can picture each MCP server as a little adapter box: "this one plugs into Slack", "this one plugs into QuickBooks", "this one plugs into Acme's HRIS". Some of these adapters are off-the-shelf; others we write ourselves for things specific to our SaaS — like a safe way to read Acme's own database without ever letting a query leak across tenants.
For Jane's payroll request, the HR Payroll Analyst will:
- call
mcp__hris__list_employees→ "who worked at Acme Widgets in March?" - call
mcp__hris__get_salaryfor each one - call
mcp__payroll__draft_invoice→ builds an unsigned draft - (pause here — see step 4)
- call
mcp__payroll__submit_invoice→ files the invoice (only after human approval)
from claude_agent_sdk import tool, create_sdk_mcp_server
import os, json
@tool("list_employees",
"List all employees at the caller's company for a given month",
{"month": str}) # e.g. "2026-03"
async def list_employees(args: dict) -> dict:
tenant_id = os.environ["TENANT_ID"] # "acme-widgets"
rows = await hris.list_active(tenant_id, month=args["month"])
return {"content": [{"type": "text", "text": json.dumps(rows)}]}
hris_server = create_sdk_mcp_server(
name="hris", version="1.0.0", tools=[list_employees, ...],
)
CONNECTORS = {
"hris": hris_server, # our own code
"payroll": {"type": "http", "url": "https://mcp.gusto/ddr"}, # vendor
"slack": {"type": "stdio", "command": "mcp-slack"}, # vendor
"email": {"type": "stdio", "command": "mcp-sendgrid"},
"tenant_db": tenant_db_server,
# ...Teams, SMS, QuickBooks, Stripe, Figma, Linear, Salesforce
}
4. Stop before it does anything irreversible — ask a human
This is the single most important part of the platform — and it works the same way regardless of which LLM provider powers the role. The approval gate, audit logging, and tenant scoping are enforced by your platform layer, not by any vendor's SDK. It's also where the promise from the Introduction — "'cannot' is enforced by the harness, not by asking the model nicely" — becomes concrete code. The six harness-level mechanisms that make a digital employee's limits real:
-
Tool allowlist — only tools in the role's
allowed_toolslist can be called at all. No Figma tool wired into the HR session means no Figma call, period. -
Write-operation approval gate — every tool matching a write pattern is paused by a
PreToolUsehook that returnsallow/deny/askbased on a human's click, not the model's judgment (see the code block below). -
Tenant scoping — tools read
TENANT_IDfrom the session environment, not from the model's arguments. The model cannot ask to see Contoso's data from inside Acme's session. -
Budget and turn caps —
max_budget_usdandmax_turnsin the session options halt the loop before a misbehaving role can bankrupt a tenant. - Immutable job description — the system prompt is owned by the platform, not by tenant users or the model itself. It's assembled server-side at session-start and isn't exposed to the tenant's input channel. Prompt-injection attempts in inbound messages can't rewrite it.
-
Audit inevitability — every tool call flows through
PreToolUseandPostToolUsehooks. The employee literally cannot take an action that isn't logged; the log happens before the tool runs, not after (see §5).
Taken together, these guardrails are the difference between "an LLM we've asked to behave" and "a digital employee we can defend in front of a SOC2 auditor." The approval gate is the most visible of the six, so let's walk through it in detail.
Reading is safe: the AI can list Acme's employees all day and no harm is done. Writing is dangerous: actually submitting a payroll invoice means real money leaves a real bank account. So we install a little gatekeeper that runs every time the AI wants to do something. If the action is read-only (look something up), the gatekeeper waves it through. If the action writes, creates, sends, or pays, the gatekeeper pauses the AI in mid-thought, pops a card into Jane's manager's approval inbox, and waits.
In our example:
- HR Payroll Analyst builds the invoice — everything up to
draft_invoiceis read-only and runs freely. - The AI now wants to call
submit_invoice($184,372.55 to Gusto for Acme Widgets, March 2026). - The gatekeeper sees "submitinvoice" is a write operation. It pushes a card to Jane's CFO: "HR Payroll Analyst wants to submit a $184,372.55 payroll run for March. Approve / Deny."
- The AI's next move is frozen until the CFO clicks something.
- CFO clicks Approve → gatekeeper returns "allow" → invoice is filed.
- CFO clicks Deny → gatekeeper returns "deny" with a reason → the AI reads the reason ("duplicate of last week's run") and tells Jane so.
from claude_agent_sdk import HookMatcher
from fnmatch import fnmatch
# Anything matching these patterns writes, sends, or pays.
WRITE_PATTERNS = [
"mcp__payroll__submit_*", "mcp__payroll__pay_*",
"mcp__email__send", "mcp__slack__send_message",
"mcp__quickbooks__create_*", "mcp__tenant_db__write_*",
]
def is_write(tool_name):
return any(fnmatch(tool_name, p) for p in WRITE_PATTERNS)
async def approval_gate(input_data, tool_use_id, ctx):
if not is_write(input_data["tool_name"]):
return {"hookSpecificOutput": {
"hookEventName": "PreToolUse", "permissionDecision": "allow",
}}
# This is a write. Freeze the AI and ask a human.
decision = await approval_inbox.request_and_wait(
tenant_id = os.environ["TENANT_ID"], # "acme-widgets"
role = os.environ["ROLE"], # "hr_payroll_analyst"
action = input_data["tool_name"], # "mcp__payroll__submit_invoice"
details = input_data["tool_input"], # amount, recipients, period...
timeout_s = 3600, # give the CFO an hour
)
return {"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "allow" if decision.approved else "deny",
"permissionDecisionReason": decision.reason,
}}
APPROVAL_HOOKS = {"PreToolUse": [HookMatcher(matcher="*", hooks=[approval_gate])]}
Plain-English takeaway: the AI cannot spend Acme's money without a human click. That promise is worth everything in an HR / Finance SaaS.
5. Write everything down — the log book
Every SMB that buys this eventually needs SOC2, and every SOC2 auditor asks the same question: "show me who did what, when, and whether it was approved." We get that for free by recording both sides of every tool call — what the AI tried to do, and what happened.
For Jane's payroll run, the log book will end up with a tidy paper trail like:
10:31:02 acme-widgets / hr_payroll_analyst read list_employees(month=2026-03) → 47 rows
10:31:05 acme-widgets / hr_payroll_analyst read get_salary(employee=E-0012) → $84,200/yr
...
10:31:44 acme-widgets / hr_payroll_analyst WRITE submit_invoice($184,372.55) APPROVED by cfo@acme.com
10:31:47 acme-widgets / hr_payroll_analyst write submit_invoice → invoice_id=INV-99423
async def audit_before(input_data, tool_use_id, ctx):
await audit_log.write({
"when": now(), "phase": "before",
"tenant": os.environ["TENANT_ID"], "role": os.environ["ROLE"],
"action": input_data["tool_name"], "details": input_data["tool_input"],
})
async def audit_after(input_data, tool_use_id, ctx):
await audit_log.write({
"when": now(), "phase": "after",
"tenant": os.environ["TENANT_ID"],
"action": input_data["tool_name"],
"result": input_data.get("tool_response"),
})
AUDIT_HOOKS = {
"PreToolUse": [HookMatcher(matcher="*", hooks=[audit_before])],
"PostToolUse": [HookMatcher(matcher="*", hooks=[audit_after])],
}
6. Let it do the math in a safe sandbox
Preparing a payroll invoice isn't just database reads — there's real arithmetic: prorating mid-month hires, computing overtime, applying state-specific tax rates, reconciling against last month's run. Rather than teach the AI to do this by hand (risky), we give it a sealed calculator: a disposable Python environment where it can run real numeric code. The code runs inside Anthropic's Managed Agents sandbox — isolated, no network, no access to Acme's data except what we hand in.
from anthropic import Anthropic
anthropic = Anthropic()
@tool("run_in_sandbox",
"Run trusted Python to do payroll math. Returns stdout.",
{"code": str, "timeout_s": int})
async def run_in_sandbox(args: dict) -> dict:
result = await anthropic.beta.agents.runs.create(
agent_id="code_interpreter",
input=args["code"],
timeout=args.get("timeout_s", 60),
)
return {"content": [{"type": "text", "text": result.output.text}]}
The HR Payroll Analyst uses this when it needs to say things like "compute gross-to-net for these 47 employees, apply the March bonus schedule, group by department, and give me a total."
7. Stitch it together — one function answers Jane
Here's the whole payroll request, end to end. Every inbound message — Slack DM, Teams mention, email, SMS — funnels through this same function. The platform resolves the tenant, picks the role, looks up which LLM provider that role uses, dispatches to the right client, and wraps everything in the approval gate and audit hooks. The reply goes back on whichever channel Jane used.
from llm_clients import get_client # returns Claude/OpenAI/Gemini/Grok client by provider
async def handle_inbound(msg: InboundMessage) -> None:
# 1. Which customer? Which digital employee? Which LLM?
tenant = await tenants.resolve(msg.workspace_id) # Acme Widgets
role = await routing.pick_role(tenant, msg.text) # "hr_payroll_analyst"
session = build_session(tenant, role) # includes provider + model
# 2. Get the right LLM client for this role's provider.
pack = ROLE_PACKS[role]
client = get_client(
provider=session["provider"], # "claude" | "openai" | "gemini" | "grok"
model=session["model"], # "claude-opus-4-7"
system_prompt=pack["job_description"],
tools=pack["allowed_tools"],
env=session["env"],
)
# 3. Wrap with platform guardrails (same for every provider).
client = apply_approval_gate(client, session) # pre-tool write check
client = apply_audit_hooks(client, session) # pre/post-tool logging
# 4. Run the turn. Stream the reply back to the same channel.
async for chunk in client.run(msg.text):
await channels.reply(msg, chunk)
What Jane actually sees in Slack:
HR Payroll Analyst · 10:31
Drafting March 2026 payroll for Acme Widgets... I found 47 active employees. Total gross is $184,372.55. I've sent a request to Michael (CFO) to approve submission to Gusto.HR Payroll Analyst · 10:42
Michael approved. InvoiceINV-99423filed with Gusto. I emailed the payroll summary to finance@acme-widgets.com. Anything else?
What we still have to build ourselves
The LLM APIs give us intelligence. The platform patterns above (session management, role packs, approval gates, audit hooks) give us structure. The parts below are what turn it into a product — and they're the reason time-to-MVP is "medium" instead of "fast":
Read this list through the competitive lens. Every item below is something OpenAI Workspace Agents and Google Gemini Enterprise ship as a built-in for their tenants. The LLMs give us the brains; everything on this list is our competitive moat against the hyperscalers (data control, multi-tenant isolation, per-tenant economics, white-label) — or our gap, if we don't build it well.
- The LLM registry and dispatch layer — the triager that classifies tasks, the parallel dispatch that fires to the right provider(s), the combinator/arbiter that merges and scores responses.
- The approval inbox that Michael the CFO actually clicks in (web app, Slack buttons, mobile push).
- The customer registry — tenants, users, roles, what plan they're on, which integrations they've connected, which LLM tier they're paying for.
- The credential vault — Acme's HRIS token must never leak into a session serving a different customer. Each provider's API key is managed per-tenant or per-platform, never exposed to the model.
- The channel adapters — Slack, Teams, email, SMS, both inbound (webhooks) and outbound (replies).
- The billing meter — we read each turn's token usage across all providers and bill Acme's subscription accordingly. Different providers have different pricing; the meter normalizes.
- The connector catalog — adding a new MCP integration (say, Workday) should be a one-day task, not a rewrite. Because MCP is shared across all providers, a connector works with every role regardless of its LLM.
- The SOC2 plumbing around the log book: retention, tamper-evidence, export for auditors.
That list is the actual product. The multi-LLM architecture is what makes each role best-in-class; the platform layer is what makes it a service.
Conclusion
This document started with a question: when should an organization build its own virtual digital employee service, and how?
The answer to "when" is narrower than it was a year ago. As of April 2026, OpenAI, Google, and Microsoft all ship enterprise-agent products that cover the majority of buyers — organizations already on their platforms, comfortable with their data policies, and happy to use a vendor-branded agent. For those buyers, building from scratch is the wrong answer. The build path is justified only when your organization needs full control over data residency, model routing, agent behavior, and audit trails — or when your business model requires multi-tenant white-label architecture the hyperscalers can't provide.
The answer to "how" is: don't pick one LLM — combine all of them. Claude for multi-step tool chains and file generation. GPT for analytical review, triage routing, and voice. Gemini for multimodal reasoning and cheap high-volume work. Grok for speed-first interactions. Each digital employee role gets the model that's best at its job, selected from a central LLM registry and dispatched by your platform layer. The platform — not any vendor's SDK — owns the harness: tenant routing, approval gates, audit logging, billing, and channel adapters.
Three things make this architecture work:
- The platform layer is LLM-agnostic. Approval gates, audit hooks, tenant scoping, and budget caps wrap around whichever model runs inside. Swapping a role's LLM is a registry change, not a rewrite.
- MCP is the shared integration protocol. A connector you build for your HRIS works with Claude, GPT, Gemini, and Grok without modification. The connector catalog grows once and serves every role.
- The harness enforces "cannot" at the framework level. Tool allowlists, write-operation approval gates, immutable job descriptions, and audit inevitability are architectural constraints, not polite requests in a system prompt. That's the difference between "an LLM we've asked to behave" and "a digital employee we can defend in front of a SOC2 auditor."
The LLMs will keep getting better, cheaper, and faster — model-layer claims age in weeks, not quarters. What won't change is the need for a platform that controls who the customer is, which job the AI is doing, what it's allowed to touch, and who has to approve before it acts. Build that platform well, and the models underneath become interchangeable parts. Build it poorly, and no model — however capable — will earn the trust of a CFO who's about to let an AI submit a payroll invoice.


Top comments (0)