How I Built a Shadow AI Governance Platform on a DynamoDB Single-Table Design

#aws #database #nextjs #security

The proliferation of unsanctioned, locally executed Artificial Intelligence (AI) models (Shadow AI) running inside corporate workstations via Ollama, Llama.cpp, or LM Studio has created a massive blind spot for enterprise security teams.

Because local inference occurs entirely in system memory, network-layer proxies (like CASBs) are completely blind to it. Because the tools are benignly signed binaries, kernel-level EDRs (like CrowdStrike Falcon) don't trigger alerts when they read sensitive source code or spreadsheets offline.

To solve this, I built LifecycleZero-a B2B SaaS platform that monitors local endpoint AI engines, streams telemetry, and isolates compromised hosts.

Here is the database and system architecture blueprint showing how to design this platform for performance and scale.

1. The Database Blueprint: Single-Table DynamoDB Design

We consolidated all B2B data entities (Tenant Settings, Employees, Hardware Assets, Telemetry Event Streams, and Audit Custody Logs) into a single physical table (LifecycleZero_Assets).

Entity Relationship and Index Mapping

Entity	PK (Partition Key)	SK (Sort Key)	GSI1PK (Index 1)	GSI1SK (Index 1)	GSI2PK (Index 2 - Sparse)
Tenant	`TENANT#<TenantId>`	`METADATA`	-	-	-
Asset	`TENANT#<TenantId>`	`ASSET#<AssetId>`	`EMP#<EmployeeId>`	`STATE#<Status>`	`TENANT#<TenantId>#ACTION_REQ`
Telemetry	`TENANT#<TenantId>#TELEMETRY#SHARD#<0-9>`	`TELEMETRY#<AssetId>#<TS>`	`ASSET#<AssetId>`	`DATE#<TS>`	`TENANT#<TenantId>#ALERT#<Risk>`
Audit Log	`TENANT#<TenantId>`	`AUDIT#<AssetId>#<TS>`	-	-	-

Key Database Highlights

Cryptographic Tenant Isolation: Enforced by prefixing all partition keys with TENANT#<OrgId> mapped directly from Clerk B2B authentication sessions.
Sparse Indexing (GSI2) for Dashboards: 99.9% of telemetry events are benign. Indexing every heartbeat would bloat storage and query costs. We built a Sparse GSI (GSI2PK) that only populates when a security alert is flagged as CRITICAL or WARNING. The dashboard queries GSI2 directly, retrieving active incidents in milliseconds via cheap O(1) index scans.
Telemetry Write Sharding: A fleet of 10,000+ endpoints streaming heartbeats every 5 seconds to a single partition key will throttle DynamoDB's 1,000 WCU limit. We shard raw telemetry partition keys across 10 physical partitions (PK = TENANT#<TenantId>#TELEMETRY#SHARD#<0-9>) using a random hash function.

2. Transaction Integrity & ACID Containment

When a security administrator quarantines a host, consistency is critical. If a host status is updated to ISOLATED but the audit custody log fails to write, we breach compliance standards.

We solved this using DynamoDB’s TransactWriteItems to execute the isolation command:

ConditionCheck & Asset Update: Verifies that the asset exists in the partition and its current status is active (not already ISOLATED). Updates Status to ISOLATED.
Immutable Audit Log: Appends a chronological custody log (SK = AUDIT#<AssetId>#<Timestamp>) detailing the operator ID, action type, and compliance justification.

If either step fails, the entire transaction rolls back instantly, eliminating inconsistent states.

3. Decoupling the Ingest Pipeline with SQS

Direct database writes from thousands of concurrent endpoint agents can trigger write capacity lockouts.

Our Next.js API Gateway ingests telemetry payloads, checks device isolation status, and immediately pushes them to an AWS SQS queue before returning 202 Accepted in sub-50ms. A TypeScript worker daemon pulls events asynchronously using SQS long-polling to run AI-powered risk evaluations.