Ahmad Ra'fat

Posted on Jun 23

Multi-Tenancy Is the Real Agent Platform Problem

#ai #agents #multiplatform #architecture

Overview

Read on Hashnode

What I have seen when building agent platforms is that most agent demos work because there is only one tenant. One user, one memory store, one tool set, one trace, one notebook, and one happy path. Nothing to keep apart.

Then teams turn the demo into a platform, and prompts stop being the hard part. The hard question becomes a boring one: can every database query, cache key, stream, tool call, trace, and memory lookup prove which tenant it belongs to? If even one of them cannot, you have a leak waiting to happen.

I keep seeing agent-platform conversations start in the wrong place. They open on model choice, orchestration style, or memory quality, and only later get around to the question of whether one tenant's data, tools, costs, and traces stay away from another's.

That boundary is not a hardening task you bolt on at the end. It is the shape of the platform, and you pay for it later if you pretend otherwise.

So here is the mechanism I would look for in a real agent platform: a typed request context, carried into the graph, scoped access at every boundary, and tests that make a tenant leak boring to catch before it becomes an incident.

The Demo Version Hides the Problem

A single-user agent can hold state in memory and still look impressive. It calls a search tool with no tenant filter. It stores chat history under a bare thread_id. It writes traces with raw prompts in them, because nobody else is looking.

But this seems fine, right?

Yes, but only for a demo. Not for a platform.

What changes is the unit of correctness. The agent is no longer just answering "what should I do next?" It is also carrying a boundary through every step it takes. Drop that boundary, and the model can still hand back a perfectly good answer, and you have still failed, because the answer went to the wrong tenant.

If you have worked on multi-tenant platforms before, that means: every operation that touches data, tools, memory, cost, traces, or events must be scoped by tenant before the model gets a chance to improvise.

Multi-Tenant Design Rules for Agent Platforms

These are not agent-specific ideas. This is normal backend security applied to an agent runtime. We translate the multi-tenant and authorization guidelines we already know into the places an agent platform leaks: memory, tools, retrieval, streams, and traces.

Concept	In a multi-tenant platform	In an agent platform
Tenant isolation (source)	Tenant context, cross-tenant access, cache/session isolation, database isolation, file isolation, logging, and audit all need explicit controls.	Agent memory, tools, streams, and traces are tenant-isolation surfaces too.
Retrieval authorization (source)	RAG systems need access-control metadata, retrieval-time authorization, deletion propagation, and tests for cross-tenant retrieval.	Vector filtering is not optional platform glue. It is part of authorization.
Least-privilege tools (source)	Agents need least-privilege tools, explicit authorization for sensitive actions, memory isolation, and safe tool design.	Tool catalogs should be filtered before the model can call them.
Object-level authorization (source)	Every endpoint that acts on an object ID needs an authorization check for that object.	Every tool call, memory lookup, vector search, file read, and background job needs the same object-level check.
Function-level authorization (source)	Function access should be denied by default and granted explicitly by role or policy.	Do not expose every tool to every agent. Scope tools by tenant, role, domain, and action.
No implicit trust (source)	Access decisions should be explicit, per request, and policy-driven.	The graph runtime should not inherit trust from "inside the service." Nodes and tools still need context-aware checks.
Application-boundary enforcement (source)	Cloud-native applications need granular application access controls, service identity, and policy enforcement near application boundaries.	Agent gateways, graph runtimes, and tool services should each enforce policy at their own boundary.
Attribute-based policy (source)	ABAC decisions use subject, object, action, and environment attributes.	Tenant, user, role, tool, target resource, and runtime context belong in the policy decision.
Structural guardrails (source)	Use the simplest agentic shape that works, design tools carefully, and add guardrails where the agent can compound errors.	Applied to tenancy, this means enforcement belongs around tools and data access, not inside a prompt asking the model to behave.
Trace correlation (source)	Context travels across service and process boundaries so telemetry can be correlated.	Trace IDs are not enough. Tenant-safe attributes must be attached deliberately and kept out of untrusted downstream calls.

The agent view says the same thing from another angle, though let me be precise about it: Anthropic is not publishing a multi-tenancy standard. I am taking its tool design and guardrail guidance and pointing it at a tenancy problem.

The principle that carries over is to push enforcement to the most structural layer you have. A prompt is the softest layer there is. A scoped repository, a policy check, or a tenant-aware tool wrapper is harder to bypass.

Where Tenant Context Has to Travel

The first and most common mistake I have seen is treating tenant_id as an HTTP-layer concern. It starts there, but it cannot stay there. In an agent platform, the agent graph is a second execution environment, the tools are a third, and the context has to survive every hop between them.

Read this diagram as three things moving together, none of them moves on its own:

Context: identity, tenant, role, locale, and trace metadata travel from the API layer down into the graph and the tools.
Boundary: every hop is a chance to drop the scope, infer it from the wrong place, or let a caller quietly override it.
Invariant: nothing touches storage, retrieval, cache, a stream, or a trace write until the tenant context is already resolved.

Look at the diagram for a moment. The important path is RequestContext through Agent service, Graph state, Node execution, and Tool wrapper. That is the tenant boundary. Storage, vector search, cache, stream events, and trace attributes should only see a request after that boundary is already present. I would design that path first, before I write a single prompt.

The Mechanism I Would Build

I would start with what not to do: I would not thread tenant_id, user_id, role, locale, and trace_id through every function as five loose parameters. That turns isolation into a copy-paste exercise, and copy-paste is exactly where we forget one.

Instead, I would use a single explicit request context object, and then make every boundary either accept it or fail closed. No third option.

Start With a Typed Context

This code snippet is illustrative. It shows the shape I mean.

from dataclasses import dataclass
from typing import Literal

Role = Literal["owner", "admin", "member", "viewer"]

@dataclass(frozen=True)
class RequestContext:
    # These values should come from trusted auth and membership lookup,
    # not from user-supplied request fields.
    tenant_id: str
    user_id: str
    role: Role
    locale: str
    trace_id: str

    def require_tenant(self) -> str:
        # Missing tenant context is a platform bug. Fail before any model,
        # storage, or tool boundary can run unscoped.
        if not self.tenant_id:
            raise PermissionError("missing tenant context")
        return self.tenant_id

I know it is boring, but that boring little require_tenant function is the whole point. If a background job, a tool, or a graph node tries to run without tenant context, the platform stops right there. Missing scope is not something the model can recover from by trying harder. It is a bug in the platform, and it should fail like one.

Put Context Into the Graph Boundary

If you are building an agent platform, you will have flows or graphs. A useful rule: the graph should pull context from runtime metadata or state, not from the user prompt.

You should never ask the model to remember the tenant for you. The model can reason over data it is allowed to see, but it should never be the decision maker on who owns that data.

Check this illustrative snippet.

async def run_agent(message: str, ctx: RequestContext) -> AgentResult:
    # Keep tenant context outside the user prompt. The model can use allowed
    # data, but it should not enforce ownership.
    state = {
        "messages": [{"role": "user", "content": message}],
        "request_context": ctx,
    }

    return await graph.ainvoke(
        state,
        config={
            "configurable": {
                # Scope the thread so memory/checkpoint lookup cannot
                # collide across tenants.
                "thread_id": f"{ctx.tenant_id}:{ctx.user_id}",
            },
            "metadata": {
                # Metadata is for observability, not authorization.
                "tenant_id": ctx.tenant_id,
                "trace_id": ctx.trace_id,
            },
        },
    )

Be careful when it comes to observability context.

OpenTelemetry baggage can carry account or user identifiers downstream, but its own docs warn that baggage can cross service boundaries and should not contain credentials, API keys, or PII. I would keep tenant IDs opaque and scrub propagation before calling external services.

So you should not let baggage, trace attributes, user headers, or model-visible state be the source of authorization. Treat any incoming propagation header as untrusted until your own auth and membership lookup rebuild the request context from scratch. W3C Baggage has no integrity protection either, making it a carrier of correlation data rather than proof of anything.

The split I keep coming back to is simple. Authorization context comes from trusted auth claims and a membership lookup. Observability context is there to help you debug a request after the decision has already been made. Attaching a safe, opaque tenant tag to a trace so an operator can find a broken flow is fine. Letting that same trace tag decide which documents a tool can read is not, and the gap between those two is where engineers get burned.

Watch the propagation scope too. Internal service-to-service calls may genuinely need tenant-safe correlation metadata. A third-party model call rarely needs your tenant identifier riding along in baggage. If you do need correlation out there, use a request ID that cannot be traced back to a customer or a workspace.

Make Storage APIs Tenant-Scoped by Construction

This is where I want the structural layer to earn its keep.

The query API should make the safe path the short one and the unsafe path the awkward one.

People usually follow the path of least resistance, so make the path of least resistance correct.

class MemoryStore:
    async def list_facts(self, ctx: RequestContext, subject_id: str) -> list[Fact]:
        # Force callers through the tenant-aware API before building a query.
        tenant_id = ctx.require_tenant()
        rows = await db.fetch_all(
            """
            select id, subject_id, fact, source
            from memory_facts
            where tenant_id = :tenant_id
              and subject_id = :subject_id
            order by created_at desc
            """,
            {"tenant_id": tenant_id, "subject_id": subject_id},
        )
        return [Fact.from_row(row) for row in rows]

I do not like APIs like list_facts(subject_id) in a multi-tenant platform. They ask every caller to remember the boundary. That is what fails once the same rule has to be repeated across too many call sites.

The better move is to make the scoped path the only normal path. Pass a RequestContext, a tenant-scoped repository, or a tenant-scoped DB session into the storage layer, and hide the raw unscoped query behind a small internal API that only migrations, repair jobs, and reviewed admin tooling ever touch.

Check this illustrative snippet showing the API I am pointing at:

class TenantMemoryStore:
    def __init__(self, db: Database, tenant_id: str):
        # The tenant is captured once, when the scoped repository is created.
        # Callers cannot forget it on each method call.
        self._db = db
        self._tenant_id = tenant_id

    @classmethod
    def from_context(cls, db: Database, ctx: RequestContext) -> "TenantMemoryStore":
        return cls(db=db, tenant_id=ctx.require_tenant())

    async def list_facts(self, subject_id: str) -> list[Fact]:
        # No tenant argument here. The repository is already scoped.
        rows = await self._db.fetch_all(
            """
            select id, subject_id, fact, source
            from memory_facts
            where tenant_id = :tenant_id
              and subject_id = :subject_id
            order by created_at desc
            """,
            {"tenant_id": self._tenant_id, "subject_id": subject_id},
        )
        return [Fact.from_row(row) for row in rows]

Check this flow showing what I mean:

You may say the payoff here is small. I would agree. But it matters: this shape makes sure the graph node gets handed a repository that is already scoped. It can ask for facts by subject all day long, and there is no spelling of that request that turns into a cross-tenant read.

This is not just a security win. It makes tests cleaner too. A cross-tenant test can call the same public repository method the graph calls, instead of poking around trying to confirm that every caller remembered to tack on tenant_id by hand.

Wrap Tools With Policy Before the Model Sees Them

Another layer we should treat with care is the tools given to the model.

Tools are the place where an agent platform quietly turns into an authorization system, whether you planned for that or not. The model asks for an action. The platform decides whether that action even exists for this tenant and this role.

I would model that as an ABAC decision rather than a role check sprinkled through the codebase wherever someone happened to think of it. The policy input wants the subject, the tenant, the tool or action, the target resource, and the environment.

Implement it with OPA, Cedar behind Amazon Verified Permissions (AVP), OpenFGA, or a plain in-process policy module, whatever fits your stack. None of those choices makes the decision point itself optional.

Check this illustrative example:

class ToolGate:
    def __init__(self, policy: PolicyEngine, registry: ToolRegistry):
        self.policy = policy
        self.registry = registry

    async def call(self, ctx: RequestContext, tool_name: str, args: dict) -> ToolResult:
        tool = self.registry.get(tool_name)

        # Decide before invocation, so the model cannot bypass tenancy by
        # selecting a more powerful tool.
        decision = await self.policy.can_call_tool(
            tenant_id=ctx.tenant_id,
            user_id=ctx.user_id,
            role=ctx.role,
            tool_name=tool_name,
            args=args,
        )

        if not decision.allowed:
            # Return a recoverable denial instead of leaking tool internals.
            return ToolResult.permission_denied(reason=decision.reason)

        return await tool.invoke(ctx=ctx, args=args)

This is the cleanest match to the point of the whole article. The prompt is allowed to describe the behavior you want. The tool gate is what enforces it. And when the two disagree, which they eventually will, code wins.

What Engineers Forget

Some bugs are obvious, like a DB query missing where tenant_id = .... That one is at least visible in code review.

The bugs that actually leak in production are the quiet ones, hiding in the side channels around the agent that nobody thinks of as "the data layer."

Surface	Common mistake	Safer invariant	Risk shape
Redis keys	`chat:{thread_id}`	`tenant:{tenant_id}:chat:{thread_id}`, with opaque IDs where logs may escape.	Easy to miss
Vector search	Embedding search across one shared collection with no metadata filter.	Tenant and permission metadata travel with every chunk, and retrieval-time checks run before context enters the prompt. If the store cannot enforce this reliably, use separate collections or indexes.	High impact
RAG deletion	Deleting the source document but leaving chunks, embeddings, cached answers, or summaries behind.	Deletion and retention rules cover derived artifacts, not only the original blob.	Stale access
File/blob storage	Object keys or signed URLs that are not bound to the tenant and authorization checks.	File reads go through a tenant-aware access layer, even when the blob store is shared.	Direct leak
Streams	Stream names are built from conversation IDs only.	Stream names include tenant scope, and reconnect tokens cannot subscribe across tenants.	Streaming trap
Background jobs	A queued job carries an object ID but not the tenant and actor context that authorized it.	Jobs carry scoped context and recheck permissions before side effects.	Async drift
Rate limits & quotas	Global limits hide noisy-neighbor behavior and per-tenant abuse.	Budget, quota, and rate-limit keys are tenant-scoped before they are user-scoped.	Shared resource
Tool credentials	One service credential is reused for every tenant-specific integration.	Credential lookup is tenant-scoped and action-scoped before tool invocation.	Policy boundary
Evaluation datasets	Regression cases mixed with tenant-specific prompts or documents.	Use synthetic or approved datasets. Never let tenant data leak into shared eval fixtures.	Data hygiene
Traces & logs	Raw prompts, tool payloads, or document snippets in shared observability sinks.	Attach safe tenant tags for observability and incident response only, never as the source of authorization. Redact prompt and tool payloads before shared sinks.	Exposure risk

Tenant Isolation Checklist

If you hand me an agent platform to review, this is the checklist I would open with. It is deliberately practical. No architecture poetry, just the questions that tend to surface a leak.

Layer	Question to ask	What good looks like
HTTP/API	Where does tenant context come from?	Derived from trusted auth claims or membership lookup, not from a user-controlled parameter alone.
Agent graph	Can a node run without context?	No. Missing context fails before model or tool execution.
Tools	Can every agent call every tool?	No. Tool availability is filtered by tenant, role, domain, and action.
Storage	Can a repository method run unscoped?	Repository methods accept `RequestContext` or a tenant-scoped session.
Vector/RAG	Is tenant filtering required by the API, and do chunks carry access-control attributes?	The retrieval function cannot be called without a tenant filter, and permissions are rechecked at retrieval time.
Cache/Streams	Can one tenant guess another tenant's key?	Keys include tenant scope and use opaque IDs where key names might be logged.
Observability	What crosses into logs, traces, and metrics?	Safe identifiers only. PII, prompt payloads, tool payloads, secrets, and sensitive business data are removed before shared sinks.
Tests	Do tests try cross-tenant reads and writes?	Yes. Cover cross-tenant retrieval, cache leakage, stale permissions, background jobs, and unauthorized tool invocation.
Policy	What attributes feed authorization?	Subject, tenant, action, resource, and environment come from trusted auth and membership data, not telemetry or model-visible state.

A Test That Should Exist

A checklist is useful, but I would still want one boring regression test sitting in CI.

This test has tenant A write a fact. Tenant B reads through the normal scoped API and must get nothing back.

Here is the tell: if you can only make that test pass by reaching for a special bypass, your isolation boundary is living in the wrong layer.

Check this illustrative snippet showing the test shape:

async def test_memory_store_does_not_cross_tenants(memory_store):
    # Tenant A and tenant B share the same subject_id on purpose.
    # The tenant boundary, not the subject ID, must isolate the data.
    tenant_a = RequestContext(
        tenant_id="tenant-a",
        user_id="user-a",
        role="member",
        locale="en",
        trace_id="trace-a",
    )
    tenant_b = RequestContext(
        tenant_id="tenant-b",
        user_id="user-b",
        role="member",
        locale="en",
        trace_id="trace-b",
    )

    # Write through the normal tenant-scoped API.
    await memory_store.save_fact(
        tenant_a,
        subject_id="case-123",
        fact="User approved the draft contract.",
    )

    # A second tenant asks for the same subject. The expected result is empty,
    # or a permission error if your API chooses to fail closed more loudly.
    visible_to_b = await memory_store.list_facts(
        tenant_b,
        subject_id="case-123",
    )

    assert visible_to_b == []

I would push for the same test shape for vector retrieval, cache keys, stream subscriptions, background jobs, and tool calls. One small test per surface is a lot cheaper than writing the incident report for a cross-tenant leak after the fact.

An Honest Tradeoff for an MVP

There is one shortcut I can live with early on: org_id = user_id. For a personal-workspace MVP, that can be enough, as long as the interface already treats "organization" as a real concept and membership still comes from trusted auth data rather than a field the user can set.

The shortcut that actually hurts is pretending the concept does not exist at all. Build every table, cache key, graph thread, and tool policy around a user-only model, and the day your first real organization shows up, moving to tenants is a rewrite across the entire platform instead of a swap behind one abstraction.

Here is my take on both shortcuts:

Acceptable early shortcut:

org_id = user_id behind an OrgContext or RequestContext abstraction, with a trusted membership lookup ready to replace it.
Tables already include tenant_id or org_id.
Tests use two tenants, even if each tenant has one user.

Shortcut that ages badly:

User IDs are threaded everywhere by hand, tools read global configuration, and memory keys are plain thread_id.
The first real organization forces a schema and API redesign at the same time.
There is no clean place to add membership.
There is no reliable way to run cross-tenant leak tests, because the platform is fully centered on a single-user model.

What I would recommend here:

If you are going to defer real organizations, keep the tenant-shaped interface anyway. You can always swap the implementation underneath later. What you cannot do cheaply is add the boundary back after every call site has spent six months learning to ignore it.

Take Away

Multi-tenancy is not a feature you add on top of your orchestration. It is the boundary your orchestration runs inside, and the sooner you treat it that way, the less it costs you later.

If you have to do one thing after reading this article, do this: pick a single agent flow and trace the tenant context from the HTTP request all the way to the final tool call. Write down every spot where the context gets copied, inferred, dropped, logged, or turned into a key. That little map is usually where the real work, and the real risk, has been hiding.

Next, I will probably go a level deeper into request-context propagation for agents: how to carry tenant, locale, role, and trace metadata through the graph without polluting every tool signature to do it.

DEV Community