GrimLabs

Posted on Apr 26

Multi-Tenant Audit Logging: The Architecture Mistakes We Made

#multitenant #architecture #auditlogging #saas

We shipped our SaaS product with a single audit_logs table that had a tenant_id column. Seemed fine. Every query filtered by tenant_id. We had an index on it. Done.

Then a customer's admin found another customer's audit events in their activity feed.

That was the worst Slack message I've ever received on a Friday afternoon. And it was entirely our fault.

How the Data Leak Happened

The bug was embarrassingly simple. We had a dashboard endpoint that loaded recent audit events for the "current organization." But one of our API routes was pulling the tenant_id from the request body instead of the authenticated session.

// THE BUG: tenant_id came from the request, not the session
app.get('/api/audit-events', async (req, res) => {
  const events = await db.auditEvents.findMany({
    where: {
      tenantId: req.query.tenantId, // WRONG: user-supplied
    },
    orderBy: { timestamp: 'desc' },
    take: 50,
  });

  res.json(events);
});

// THE FIX: tenant_id comes from the authenticated session
app.get('/api/audit-events', async (req, res) => {
  const events = await db.auditEvents.findMany({
    where: {
      tenantId: req.auth.tenantId, // RIGHT: from verified session
    },
    orderBy: { timestamp: 'desc' },
    take: 50,
  });

  res.json(events);
});

One line difference. But the implications were massive. A customer could have enumerated tenant IDs and read any organization's audit logs. The very system designed to prove accountability and security had a data leak.

Not gonna lie, that was a very bad week.

The Three Approaches to Multi-Tenant Audit Storage

After the incident, we did a deep dive into multi-tenant isolation patterns. There are basically three approaches, each with real tradeoffs.

Approach 1: Shared Table with tenant_id Column

This is what we started with. All tenants share a single table, with a tenant_id column to separate data.

CREATE TABLE audit_events (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id UUID NOT NULL,
  event_type VARCHAR(100) NOT NULL,
  actor_id UUID NOT NULL,
  actor_email VARCHAR(255),
  target_type VARCHAR(100),
  target_id UUID,
  changes JSONB,
  metadata JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_audit_tenant_time ON audit_events (tenant_id, created_at DESC);

Pros: Simple. One table, one schema, one set of indexes. Easy to query across tenants for internal analytics.

Cons: One bug and you have a cross-tenant data leak. Query performance degrades as the table grows across all tenants. Row-level security is hard to enforce consistently.

Approach 2: Schema-per-Tenant

Each tenant gets their own database schema (or namespace). The audit_events table exists in each schema but with identical structure.

// Schema-per-tenant routing
function getAuditTable(tenantId: string) {
  const schema = `tenant_${tenantId}`;
  return `${schema}.audit_events`;
}

async function queryAuditEvents(tenantId: string, filter: AuditFilter) {
  const table = getAuditTable(tenantId);

  // Even if there's a bug in filter logic,
  // you physically cannot access another tenant's data
  return db.raw(`
    SELECT * FROM ${table}
    WHERE created_at BETWEEN ? AND ?
    ORDER BY created_at DESC
    LIMIT ?
  `, [filter.startDate, filter.endDate, filter.limit]);
}

Pros: Physical isolation. A bug in your application code cant leak data across schemas. Each tenant's data can be managed independently (backup, delete, export).

Cons: Schema management becomes complex. Migrations must be applied to every schema. Connection pooling gets complicated. You cant easily query across tenants.

Approach 3: Database-per-Tenant

The nuclear option. Each tenant gets their own database entirely.

Pros: Complete isolation. You can put high-value tenants on dedicated infrastructure. Compliance teams love this.

Cons: Operational nightmare at scale. Connection management, migrations, monitoring, backups, all multiplied by tenant count. Cost scales linearly with tenants.

What We Actually Built (Second Time Around)

After the data leak, we rebuilt with a hybrid approach. Shared table with tenant_id, but with multiple layers of protection to prevent cross-tenant access.

Layer 1: Row Level Security in PostgreSQL

-- PostgreSQL Row Level Security
ALTER TABLE audit_events ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON audit_events
  USING (tenant_id = current_setting('app.current_tenant_id')::uuid);

-- Set tenant context at the start of every request
SET app.current_tenant_id = 'tenant_uuid_here';

With RLS enabled, even if your application code forgets to filter by tenant_id, PostgreSQL will enforce it at the database level. This is your safety net.

Layer 2: Middleware Enforcement

// Every request sets the tenant context automatically
async function tenantMiddleware(req: Request, res: Response, next: NextFunction) {
  const tenantId = req.auth?.tenantId;

  if (!tenantId) {
    return res.status(401).json({ error: 'No tenant context' });
  }

  // Set PostgreSQL session variable for RLS
  await db.raw(`SET app.current_tenant_id = '${tenantId}'`);

  // Also attach to request for application-level filtering
  req.tenantId = tenantId;

  next();
}

Layer 3: Query Wrapper

// All audit queries go through this wrapper
// It physically cannot query without a tenant_id
class TenantAuditStore {
  constructor(private tenantId: string) {
    if (!tenantId) throw new Error('TenantAuditStore requires tenantId');
  }

  async query(filter: AuditFilter): Promise<AuditEvent[]> {
    return db.auditEvents.findMany({
      where: {
        tenantId: this.tenantId, // Always enforced
        ...this.buildFilterWhere(filter),
      },
    });
  }

  async record(event: Omit<AuditEvent, 'tenantId'>): Promise<void> {
    await db.auditEvents.create({
      data: {
        ...event,
        tenantId: this.tenantId, // Always set
      },
    });
  }

  // No method to query across tenants. By design.
}

Three layers. If one fails, the others catch it. Defense in depth, applied to tenant isolation.

The Retention Problem

Multi-tenant audit logging creates a unique retention challenge. Different customers might have different retention requirements. Customer A needs 12 months (SOC 2). Customer B needs 7 years (financial regulation). Customer C wants logs deleted after 90 days (GDPR data minimization).

// Per-tenant retention policies
interface TenantRetentionPolicy {
  tenantId: string;
  retentionDays: number;
  archiveAfterDays: number;
  deleteAfterDays: number;
}

async function enforceRetention(policy: TenantRetentionPolicy) {
  const archiveDate = new Date();
  archiveDate.setDate(archiveDate.getDate() - policy.archiveAfterDays);

  const deleteDate = new Date();
  deleteDate.setDate(deleteDate.getDate() - policy.deleteAfterDays);

  // Move old events to cold storage
  await archiveEvents(policy.tenantId, archiveDate);

  // Hard delete expired events
  await deleteEvents(policy.tenantId, deleteDate);
}

This gets complicated fast, especially if your using a shared table. You need background jobs running per-tenant retention policies. And you need to make sure the deletion of Tenant A's old data doesnt impact query performance for Tenant B.

Thats exactly why I built AuditKit. Building audit logging is one thing. Building it correctly for 200+ tenants with different retention requirements and guaranteed isolation is a completely different scale of problem. It handles the multi-tenant isolation so your team can stop burning engineering hours on it.

The Testing Challenge

How do you test that tenant isolation actually works? Unit tests can verify that your queries include tenant_id. But they cant catch the subtle bugs that cause cross-tenant leaks.

// Integration test for tenant isolation
describe('Audit Event Isolation', () => {
  it('tenant A cannot see tenant B events', async () => {
    // Create events for both tenants
    await createAuditEvent({ tenantId: 'tenant_a', eventType: 'user.login' });
    await createAuditEvent({ tenantId: 'tenant_b', eventType: 'user.login' });

    // Query as tenant A
    const store = new TenantAuditStore('tenant_a');
    const events = await store.query({});

    // Should only see tenant A's events
    expect(events).toHaveLength(1);
    expect(events.every(e => e.tenantId === 'tenant_a')).toBe(true);
  });

  it('cannot query without tenant context', async () => {
    expect(() => new TenantAuditStore('')).toThrow();
  });
});

We added these integration tests after the incident. They should have been there from the start. Per OWASP's testing guide, multi-tenant isolation testing should be part of your standard security test suite.

Lessons Learned

Never trust client-supplied tenant identifiers. Always derive tenant context from the authenticated session.
Use database-level isolation (RLS) as a safety net, not just application-level filtering.
Test tenant isolation explicitly. Dont assume your query patterns are correct.
Plan for per-tenant retention policies from the start.
Audit the audit system. If your audit logs have a security bug, the irony will not be lost on your customers.

The data leak we had was caught quickly and affected a small number of events. We disclosed it to the affected customers, did a full incident review, and rebuilt the system properly. But it could have been much worse.

Multi-tenant isolation in audit logging isnt optional. Its the whole point. And it deserves more engineering attention than a tenant_id column and a WHERE clause.

DEV Community