DEV Community

authagonal
authagonal

Posted on • Originally published at authagonal.io

One hyphen, two tenants, one signing key

Two of our tenants were the same tenant. They had different names, different signups, and different billing rows. They also shared a database and a token-signing key, and none of the three of us, the two tenants or us, had any idea. This is the story of the one-line function that merged them, why every layer of the system looked perfectly correct while it happened, and why a pre-launch audit is the cheapest insurance you will ever buy.

Multi-tenant auth has exactly one job it cannot get wrong: keep tenants apart. Acme's users, Acme's sessions, and above all Acme's signing key must never be reachable by anyone else. The signing key is the crown jewel. Whoever can sign with Acme's key can mint a token that Acme's own auth server will accept as genuine, for any user, with any role, no password required. So every per-tenant resource we create, every storage table and every key, is namespaced by the tenant's slug. Get that namespacing right and tenants are islands. Get it subtly wrong and they quietly become the same place.

The one-line bug

Here is the function that turns a tenant slug into the prefix we name its storage and keys with. Read the doc comment. It documents the bug as if it were a feature.

/// Derive the table name prefix from a tenant slug by stripping hyphens.
/// E.g. "acme-corp" → "acmecorp".
public static string GetTablePrefix(string tenantSlug)
{
    return tenantSlug.Replace("-", "");
}
Enter fullscreen mode Exit fullscreen mode

Replace("-", ""). It strips hyphens. The intent was housekeeping: slugs flow into Azure Table names and Vault key names, which have their own character rules, so we sanitized them. The trouble is that stripping characters is a lossy transform, and a lossy transform on an identifier is not injective. acme-corp and acmecorp both come out as acmecorp. So do ac-me-corp and acme--corp. Distinct slugs, one namespace.

That namespace is everything downstream. The user table is {prefix}-Users. And the per-tenant signing key is, verbatim, this:

private string GetKeyName() => $"signing-{ShardRouter.GetTablePrefix(_tenantContext.Slug)}";
Enter fullscreen mode Exit fullscreen mode

So acme-corp and acmecorp do not merely share a user list. They sign their tokens with the same signing-acmecorp key in Vault. A token minted for one is, byte for byte, a validly signed token for the other. If you can register a slug that collapses to the same prefix as an existing tenant, you can issue yourself tokens their auth server trusts completely. Cross-tenant account takeover, and the exploit is "sign up with a hyphen."

Why nothing caught it

The unsettling part is how ordinary every layer looked. Signup validated the slug and saw a fresh, unused string. Provisioning created a tenant record keyed by the full slug, acme-corp, which is genuinely distinct from acmecorp in the control plane. Token validation derived the key from the slug and verified happily. Every component did its job correctly on its own input.

Nothing in the system ever compared two slugs' prefixes, because no single component owned the invariant "a slug maps to exactly one namespace." The collision lived in the gap between a slug, which the control plane keys on, and a prefix, which storage and Vault key on. Nobody was standing in that gap. This is the signature of the most dangerous class of bug: not a check anyone forgot, but an assumption nobody knew they were making.

The second collision, for free

The same lossy transform had a second victim. Our internal system tenants are named by suffix: {slug}-admin holds a tenant's portal team, {slug}-sandbox holds its test environment. Run those through the same de-hyphenator and acme-admin becomes acmeadmin. Which means a customer who registered the slug acmeadmin would collapse onto the prefix of Acme's admin tenant, the one place that is supposed to be more privileged than the customer, not shared with one.

One stripping function, two different isolation boundaries at risk: tenant to tenant, and tenant to its own control plane. When a single line threatens two unrelated boundaries, that is the tell of a root-cause bug rather than a surface one. The fix has to land on the transform, not on either symptom.

The fix: injective by construction

The instinct is to make the prefix function smarter. Escape the hyphens, hash the slug, base32-encode it. Every one of those is still a transform you have to prove injective forever, against every future change, by someone who may not know why it matters. The cheaper and far more durable fix is to remove the transform's freedom entirely: constrain the input so the transform is the identity.

// Lowercase alphanumeric ONLY, no hyphens. Forbidding hyphens makes prefix == slug,
// so distinct slugs can never share a data store or signing key.
if (!slug.All(c => (c >= 'a' && c <= 'z') || (c >= '0' && c <= '9')))
    return false;
Enter fullscreen mode Exit fullscreen mode

Slugs are now lowercase letters and digits, nothing else. With no hyphens to strip, GetTablePrefix has nothing to do, prefix == slug holds by construction, and two distinct slugs can never share a namespace again. We also reject the reserved names and any slug ending in admin or sandbox, which closes the system-tenant collision in the same line. The validity check is the narrow point where a slug first becomes a namespace, so that is exactly where the one-to-one guarantee belongs.

We then re-assert the same rule one more time, at the provisioning chokepoint. There are two doors a slug can enter through, self-service signup and admin-driven provisioning, and a security invariant defended at only one of two doors is defended at neither.

Why we are telling you

We found this in a pre-launch security audit of our own product, before a single paying customer existed, which is the only acceptable time to find it. No tenant was ever merged in production. But it is a humbling bug, because it is not a missing check or a weak algorithm. It is a helper that does something entirely reasonable, sanitize a string, in a place where "reasonable" and "injective" turn out to be different words.

The lesson outlived the fix. Any function that turns user-controlled input into the name of a security boundary, a table, a key, a namespace, a path, must be injective, and you should enforce that at the narrowest point where the mapping is created, not hope it survives every layer downstream. A normalize step, lowercase, trim, strip, collapse, is precisely where two identities quietly become one.

It is the same principle the whole product is built on. Security is not a tier you graduate into; it is the floor. Every isolation guarantee, every signing key, and every security feature we ship, SSO and SAML, SCIM, MFA, audit export, is on at every plan, because the alternative is the kind of thing you find at 2am instead of in an audit. See what's included.

Top comments (0)