OAuth Token Vault Patterns for AI Agents

#ai #security #oauth #typescript

OAuth Token Vault Patterns for AI Agents

AI agents that call external APIs have a problem most tutorials skip over: tokens expire, and when they do mid-pipeline, your agent crashes in the worst possible way.

I learned this the hard way. I had an agent handling Stripe checkout flows -- it would authenticate at the start of a session, store the token in memory, and happily process orders. One night it lost Stripe access mid-checkout because the OAuth token had a 1-hour TTL and the session ran long. The order went through on the customer side, but the webhook couldn't be verified because the token was gone. Refund chaos. Support tickets. A very bad morning.

The fix wasn't just "refresh tokens more aggressively." It required thinking carefully about where tokens live, who can access them, and what happens when things go wrong.

Here are the four patterns I now use in every production AI agent system.

The Core Problem

AI agents need persistent API access. Unlike a web app where a user authenticates once per session, agents run continuously -- they wake up on cron, get triggered by webhooks, or run multi-hour pipelines. OAuth tokens expire. Refresh tokens rotate. And if you're handling multiple tenants, one compromised token shouldn't expose others.

The naive approaches fail in predictable ways:

Environment variables: static, no rotation, shared across all tenants
Plain database columns: readable by anyone with DB access, no encryption at rest
In-memory storage: wiped on restart, no persistence across agent runs

You need a vault.

Pattern 1: Encrypted-at-Rest Vault

Every token at rest should be encrypted with AES-256-GCM. The key insight is to use separate encryption keys per tenant -- not one master key for everything.

import { createCipheriv, createDecipheriv, randomBytes } from "crypto";

const ALGORITHM = "aes-256-gcm";
const KEY_LENGTH = 32; // 256 bits
const IV_LENGTH = 16;
const TAG_LENGTH = 16;

interface EncryptedToken {
  ciphertext: string;
  iv: string;
  tag: string;
  keyId: string; // reference to which key was used
}

interface TokenRecord {
  tenantId: string;
  provider: string; // "stripe", "github", "slack"
  accessToken: EncryptedToken;
  refreshToken: EncryptedToken | null;
  expiresAt: Date;
  scope: string[];
}

// Key derivation: derive per-tenant key from master key + tenant ID
// This means you only store one master key, but each tenant gets a unique encryption key
async function deriveKey(masterKey: Buffer, tenantId: string): Promise<Buffer> {
  const { hkdf } = await import("crypto");
  return new Promise((resolve, reject) => {
    hkdf(
      "sha256",
      masterKey,
      Buffer.from(tenantId),
      Buffer.from("oauth-token-vault-v1"),
      KEY_LENGTH,
      (err, key) => {
        if (err) reject(err);
        else resolve(Buffer.from(key));
      }
    );
  });
}

function encryptToken(plaintext: string, key: Buffer): EncryptedToken {
  const iv = randomBytes(IV_LENGTH);
  const cipher = createCipheriv(ALGORITHM, key, iv);

  const ciphertext = Buffer.concat([
    cipher.update(plaintext, "utf8"),
    cipher.final(),
  ]);

  const tag = cipher.getAuthTag();

  return {
    ciphertext: ciphertext.toString("base64"),
    iv: iv.toString("base64"),
    tag: tag.toString("base64"),
    keyId: "v1", // version your keys so you can rotate them
  };
}

function decryptToken(encrypted: EncryptedToken, key: Buffer): string {
  const decipher = createDecipheriv(
    ALGORITHM,
    key,
    Buffer.from(encrypted.iv, "base64")
  );

  decipher.setAuthTag(Buffer.from(encrypted.tag, "base64"));

  const plaintext = Buffer.concat([
    decipher.update(Buffer.from(encrypted.ciphertext, "base64")),
    decipher.final(),
  ]);

  return plaintext.toString("utf8");
}

// Vault class that wraps storage with transparent encryption/decryption
class TokenVault {
  private masterKey: Buffer;

  constructor(masterKey: string) {
    this.masterKey = Buffer.from(masterKey, "base64");
  }

  async store(
    tenantId: string,
    provider: string,
    accessToken: string,
    refreshToken: string | null,
    expiresAt: Date,
    scope: string[]
  ): Promise<void> {
    const key = await deriveKey(this.masterKey, tenantId);

    const record: TokenRecord = {
      tenantId,
      provider,
      accessToken: encryptToken(accessToken, key),
      refreshToken: refreshToken ? encryptToken(refreshToken, key) : null,
      expiresAt,
      scope,
    };

    // Store in your DB -- only encrypted values hit disk
    await db.upsert("oauth_tokens", record, ["tenantId", "provider"]);
  }

  async retrieve(
    tenantId: string,
    provider: string
  ): Promise<{ accessToken: string; refreshToken: string | null; expiresAt: Date }> {
    const record = await db.findOne("oauth_tokens", { tenantId, provider });
    if (!record) throw new Error(`No token for ${tenantId}/${provider}`);

    const key = await deriveKey(this.masterKey, tenantId);

    return {
      accessToken: decryptToken(record.accessToken, key),
      refreshToken: record.refreshToken
        ? decryptToken(record.refreshToken, key)
        : null,
      expiresAt: record.expiresAt,
    };
  }
}

The keyId field on EncryptedToken is load-bearing. When you rotate your master key, you need to know which version encrypted each record so you can re-encrypt them. Without versioning, key rotation requires decrypting everything at once -- a dangerous all-or-nothing operation.

Pattern 2: Token Refresh Daemon

Don't refresh tokens on-demand inside your agent pipeline. By the time the agent discovers a token is expired, it's already mid-task. Instead, run a background process that refreshes tokens before they expire.

import { TokenVault } from "./vault";

const REFRESH_BUFFER_MINUTES = 10; // refresh 10 minutes before expiry
const DAEMON_INTERVAL_MS = 60_000; // check every minute

async function refreshIfNeeded(
  vault: TokenVault,
  tenantId: string,
  provider: string,
  refreshFn: (refreshToken: string) => Promise<{
    accessToken: string;
    refreshToken: string;
    expiresIn: number; // seconds
  }>
): Promise<void> {
  const token = await vault.retrieve(tenantId, provider);

  const now = new Date();
  const bufferMs = REFRESH_BUFFER_MINUTES * 60 * 1000;
  const shouldRefresh = token.expiresAt.getTime() - now.getTime() < bufferMs;

  if (!shouldRefresh) return;
  if (!token.refreshToken) {
    console.warn(`[token-daemon] ${tenantId}/${provider}: no refresh token, cannot refresh`);
    return;
  }

  try {
    const newTokens = await refreshFn(token.refreshToken);
    const newExpiry = new Date(Date.now() + newTokens.expiresIn * 1000);

    await vault.store(
      tenantId,
      provider,
      newTokens.accessToken,
      newTokens.refreshToken,
      newExpiry,
      [] // scope unchanged
    );

    console.log(`[token-daemon] refreshed ${tenantId}/${provider}, expires ${newExpiry.toISOString()}`);
  } catch (err) {
    console.error(`[token-daemon] failed to refresh ${tenantId}/${provider}:`, err);
    // Don't throw -- other tenants should still refresh
    // But you should alert here (PagerDuty, Slack, whatever)
    await alertOnRefreshFailure(tenantId, provider, err);
  }
}

// The daemon loop -- runs continuously in a separate process
async function runRefreshDaemon(vault: TokenVault): Promise<never> {
  console.log("[token-daemon] starting");

  while (true) {
    const allTokens = await db.findAll("oauth_tokens");

    // Process all tenants concurrently, but don't let one failure kill the rest
    await Promise.allSettled(
      allTokens.map((record) =>
        refreshIfNeeded(vault, record.tenantId, record.provider, (rt) =>
          callProviderRefreshEndpoint(record.provider, rt)
        )
      )
    );

    await sleep(DAEMON_INTERVAL_MS);
  }
}

function sleep(ms: number): Promise<void> {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

Promise.allSettled is the right choice here, not Promise.all. If one tenant's refresh fails, you still want to attempt the rest. Promise.all would bail out on the first rejection.

Run this daemon as a separate Node.js process, not as part of your main agent worker. This way a restart of the agent doesn't interrupt in-flight refreshes.

Pattern 3: Circuit Breaker for Expired Tokens

Even with a refresh daemon, tokens can expire between refresh cycles -- the provider might revoke them, the user might disconnect the integration, or the refresh daemon itself might have been down. Your agent needs to handle this gracefully instead of throwing unhandled errors into the middle of a pipeline.

type TokenStatus = "valid" | "expired" | "revoked" | "missing";

interface TokenCheckResult {
  status: TokenStatus;
  token: string | null;
  error: string | null;
}

// A wrapper your agent calls before any API operation
async function getValidToken(
  vault: TokenVault,
  tenantId: string,
  provider: string
): Promise<TokenCheckResult> {
  let tokenData: Awaited<ReturnType<typeof vault.retrieve>>;

  try {
    tokenData = await vault.retrieve(tenantId, provider);
  } catch {
    return { status: "missing", token: null, error: `No token stored for ${provider}` };
  }

  // Check expiry with a small buffer
  const now = Date.now();
  const expiryMs = tokenData.expiresAt.getTime();
  const GRACE_MS = 30_000; // 30-second grace period

  if (expiryMs - now < GRACE_MS) {
    return {
      status: "expired",
      token: null,
      error: `Token for ${provider} expired at ${tokenData.expiresAt.toISOString()}`,
    };
  }

  return { status: "valid", token: tokenData.accessToken, error: null };
}

// Circuit breaker pattern for agent tasks
async function withValidToken<T>(
  vault: TokenVault,
  tenantId: string,
  provider: string,
  task: (token: string) => Promise<T>,
  fallback: (reason: string) => T
): Promise<T> {
  const check = await getValidToken(vault, tenantId, provider);

  if (check.status !== "valid" || !check.token) {
    console.warn(
      `[circuit-breaker] skipping ${provider} task for ${tenantId}: ${check.error}`
    );
    // Log this for alerting/monitoring
    await logTokenFailure(tenantId, provider, check.status);
    return fallback(check.error ?? "unknown token error");
  }

  try {
    return await task(check.token);
  } catch (err: unknown) {
    // Catch 401s and mark token as needing re-auth
    if (isUnauthorizedError(err)) {
      await markTokenRevoked(tenantId, provider);
      return fallback("token was revoked during execution");
    }
    throw err;
  }
}

// Example usage in an agent pipeline
async function processCheckoutForTenant(tenantId: string): Promise<void> {
  await withValidToken(
    vault,
    tenantId,
    "stripe",
    async (token) => {
      // Normal path: token is valid, proceed
      await stripe.charges.create({ ... }, { apiKey: token });
    },
    (reason) => {
      // Degraded path: queue for retry when token is restored
      console.log(`Queuing checkout for ${tenantId}, will retry: ${reason}`);
      return jobQueue.enqueue("retry-checkout", { tenantId, reason });
    }
  );
}

function isUnauthorizedError(err: unknown): boolean {
  return (
    typeof err === "object" &&
    err !== null &&
    "status" in err &&
    (err as { status: number }).status === 401
  );
}

The critical thing here is the fallback behavior. The agent doesn't crash -- it degrades gracefully and queues the work for retry. That Stripe checkout scenario I mentioned earlier? With this pattern, the order gets queued for retry with a flag "needs token re-auth" instead of silently failing.

Pattern 4: Multi-Tenant Token Isolation

If you're building an agent that serves multiple customers, their tokens must be completely isolated. One compromised tenant should not be able to access another's tokens.

The encryption approach in Pattern 1 handles the at-rest isolation (separate derived keys per tenant). But you also need runtime isolation.

// Tenant-scoped vault -- agent instances can only access their own tenant
class TenantScopedVault {
  private vault: TokenVault;
  private tenantId: string;

  constructor(vault: TokenVault, tenantId: string) {
    this.vault = vault;
    this.tenantId = tenantId;
  }

  // No tenantId parameter -- it's baked in at construction time
  async store(
    provider: string,
    accessToken: string,
    refreshToken: string | null,
    expiresAt: Date,
    scope: string[]
  ): Promise<void> {
    return this.vault.store(
      this.tenantId,
      provider,
      accessToken,
      refreshToken,
      expiresAt,
      scope
    );
  }

  async retrieve(provider: string) {
    return this.vault.retrieve(this.tenantId, provider);
  }
}

// Agent factory -- each agent gets a scoped vault at construction
function createAgentForTenant(
  tenantId: string,
  globalVault: TokenVault
): Agent {
  // The agent can never accidentally access another tenant's tokens
  // because its vault doesn't expose the tenantId parameter
  const scopedVault = new TenantScopedVault(globalVault, tenantId);
  return new Agent(scopedVault);
}

// Row-level security in your database
// This is the belt-and-suspenders version -- TypeScript isolation + DB isolation
const CREATE_TENANT_POLICY = `
  CREATE POLICY tenant_isolation ON oauth_tokens
    USING (tenant_id = current_setting('app.tenant_id'));
`;

The row-level security SQL is not optional -- it's the safety net when the TypeScript layer has a bug. Defense in depth: TypeScript isolation prevents accidental cross-tenant access, RLS prevents it at the database level even if the application layer is compromised.

Putting It Together

The four patterns compose cleanly:

Vault stores all tokens encrypted, with per-tenant derived keys
Refresh daemon wakes every minute and refreshes anything expiring soon
Circuit breaker wraps every agent API call -- graceful degradation on token failure
Tenant-scoped vault instances enforce isolation at the TypeScript and DB levels

The thing that trips people up is thinking encryption alone is enough. It protects data at rest, but if your application can decrypt it, an attacker who compromises your application can too. That's why you need the isolation layer (Pattern 4) and the circuit breaker (Pattern 3) -- they limit blast radius when something goes wrong.

The refresh daemon (Pattern 2) is what prevents the "wrong time to fail" problem. Tokens shouldn't expire during a pipeline; they should be refreshed proactively before that can happen.

If you're building production AI systems that interact with external APIs, these patterns will save you from a lot of 2am incidents. I've implemented this stack for clients across fintech, e-commerce, and dev tooling.

I build production AI systems for companies. If you're tackling agent infrastructure, multi-tenant architecture, or just need a senior engineer to review what you've got -- astraedus.dev