Unizo

Posted on Jan 20

What I Learned Building a Multi-Tenant Integration Layer (The Hard Way)

#api #architecture #backend #systemdesign

You’ve been asked to build integrations for your platform. Seems straightforward: call some APIs, normalize the data, display it in the UI. A few weeks of work, tops.

Except it’s not a few weeks. And it’s not straightforward.

I’ve spent few years building integration infrastructure for security platforms. Here’s everything I wish someone had told me before I started.

The Gap Between POC and Production

A proof-of-concept integration is easy. Read the docs, make some calls, parse the response. Done in a day.

Production is a different beast. Here’s the actual checklist:

Authentication Hell

Every vendor does auth differently:

Vendor A: OAuth 2.0 with refresh

headers = {"Authorization": f"Bearer {access_token}"}

Vendor B: API key in header

headers = {"X-API-Key": api_key}

Vendor C: API key as query param (yes, really)

url = f"{base_url}/endpoint?api_key={api_key}"

Vendor D: Custom signature with timestamp

signature = hmac.new(secret, f"{timestamp}{method}{path}".encode(), 'sha256')
headers = {"X-Signature": signature.hexdigest(), "X-Timestamp": timestamp}

And you need to handle token refresh without interrupting syncs. Plus store credentials securely for hundreds of customer connections. Plus handle IP allowlisting for vendors that require it.

Rate Limiting is Harder Than You Think

Every API has rate limits. The fun part is they’re all different:

The nice vendor: returns 429 with retry-after

if response.status_code == 429:
retry_after = int(response.headers.get('Retry-After', 60))
time.sleep(retry_after)

The less nice vendor: just returns 500 when you hit the limit

Good luck figuring out why

The enterprise vendor: different limits per endpoint

/users: 100 req/min

/alerts: 10 req/min

/export: 1 req/hour

When you’re pulling data for hundreds of tenants, rate limits become a constant constraint. You need: - Per-tenant rate limit tracking - Intelligent request queuing - Exponential backoff with jitter - Circuit breakers so one failing integration doesn’t cascade

Pagination Nightmares

Offset pagination (simple but inefficient)

for offset in range(0, total, page_size):
response = client.get(f"/items?offset={offset}&limit={page_size}")

Cursor pagination (better, but cursors expire)

cursor = None
while True:
response = client.get(f"/items?cursor={cursor}&limit={page_size}")
cursor = response.json().get('next_cursor')
if not cursor:
break

Link header pagination (RFC 5988)

while url:
response = client.get(url)
url = response.links.get('next', {}).get('url')

The vendor that changes pagination between API versions

and doesn't document it

The Normalization Problem

This is where it gets really fun. Here’s the same concept. a security alert, across three vendors:

// Illustrative shapes (actual field names vary by vendor and API version)

// CrowdStrike: calls it a "detection" (via /detects/ endpoints)
{
"detection_id": "ldt:abc123...",
"max_severity": 4,
"created_timestamp": "2024-01-15T10:30:00Z",
"device": { "hostname": "..." }
}

// SentinelOne: calls it a "threat" (via /threats endpoint)
{
"id": "123456789",
"threatInfo": {
"classification": "Malware",
"confidenceLevel": "high"
},
"agentRealtimeInfo": { "agentComputerName": "..." },
"createdAt": "2024-01-15T10:30:00.000Z"
}

// Microsoft Defender: calls it an "alert" (incidents are collections of alerts)
{
"alertId": "da637292082891366787_1234567890",
"severity": "high",
"createdDateTime": "2024-01-15T10:30:00.0000000Z",
"evidence": [{ "deviceDnsName": "..." }]
}

Three different structures for the same concept. Different field names, different severity formats (number vs string), different timestamp formats, different nesting structures.

You need to map all of these to a single normalized schema:

Your normalized alert schema

@dataclass
class NormalizedAlert:
id: str
severity: str # "critical", "high", "medium", "low"
timestamp: datetime
hostname: str
source: str
raw_data: dict

Multiply this by 40 vendors across 8 security categories. That’s a lot of mapping logic.

The Multi-Tenant Complexity

All of the above gets exponentially harder with multiple tenants.

Tenant Isolation

This is the one that keeps me up at night:

WRONG: Shared cache without tenant scoping

cache.set("crowdstrike_detections", detections)

RIGHT: Tenant-scoped everything

cache.set(f"tenant:{tenant_id}:crowdstrike:detections", detections)

WRONG: Logging raw data

logger.error(f"API failed: {response.json()}")

RIGHT: Scrubbed logging

logger.error(f"API failed for tenant {tenant_id}: {response.status_code}")

Shared rate limit pools, shared caches, shared logs; any of these can leak data between tenants if you’re not careful.

Credential Storage

You’re storing API credentials for hundreds of connections. This is a high-value target:

Minimum requirements:

- Encrypted at rest (AES-256 or better)

- Encrypted in transit (TLS 1.2+)

- Access controls (which service can access which creds)

- Audit logging (who accessed what, when)

- Key rotation support

- HSM/KMS integration for key management

If you’re building a security product (like a GRC platform), your credential storage needs to pass auditor scrutiny.

Scaling

One customer with one integration is manageable. A hundred customers with ten integrations each is a thousand concurrent connections:

Tenants: 100
Integrations per tenant: 10
Sync frequency: every 15 minutes
API calls per sync: ~50

= 1,000 integrations
= 4,000 sync jobs per hour
= 200,000 API calls per hour

Your architecture needs to handle this without falling over. Queue-based processing, worker pools, connection pooling, database optimization. It just adds up.

The Maintenance Burden

Here’s the part nobody warns you about: building is maybe 30% of the work. Maintenance is 70%.

API Versioning

Your code, working fine

response = client.get("/v1/detections")

Vendor announcement: "v1 deprecated, migrate to v2 by March"

v2 changes:

- Different auth flow

- Different pagination

- Different response schema

- Some fields renamed

- Some fields removed

- New required parameters

Your weekend: gone

Multiply by 40 integrations. You’re dealing with API changes constantly.

Silent Breaking Changes

The worst kind:

What your code expects

device = detection.get("device", {})
hostname = device.get("hostname") # Returns: "workstation-1"

What the API started returning (no announcement)

hostname = device.get("hostname") # Returns: ["workstation-1"]

Your normalization: quietly broken

Customer data: silently wrong

Time to discover: days or weeks

The Real Cost

I’ve seen teams underestimate this consistently:

Initial build: - 20 integrations × 2-3 weeks each = 40-60 weeks of engineering - Plus common infrastructure (auth, rate limiting, queuing) = 8-12 weeks - Plus testing, deployment, monitoring = 4-8 weeks - Total: 12-18 months for a small team

Ongoing maintenance: - 2+ FTEs just to keep integrations running - Every API change = regression testing across affected tenants - Every new integration request = another month of work

Opportunity cost: - Every hour on integrations = hour not spent on your actual product - I’ve seen teams lose 40-50% of engineering capacity to integration work

The Alternative

At some point, you have to ask: is building integration infrastructure actually your core competency?

If you’re building a GRC platform, your value is in compliance logic, risk analysis, and control mapping; not in parsing CrowdStrike’s pagination quirks.

The “buy” option today isn’t just Zapier-style workflow tools. There are now category-level unified APIs that handle all the complexity above. Here’s what using one looks like:

// Using Unizo's SDK (from docs.unizo.ai/docs/sdks/overview)
// npm install @unizo/sdk

import { Unizo } from '@unizo/sdk';

const client = new Unizo({
apiKey: process.env.UNIZO_API_KEY
});

// One call - normalized vulnerabilities from ALL connected scanners
// (Qualys, Tenable, Snyk, etc. - doesn't matter which your customer uses)
const vulnerabilities = await client.security.vulnerabilities.list({
severity: 'high',
status: 'open'
});

// Iterate over normalized results
vulnerabilities.forEach(vuln => {
console.log(${vuln.id}: ${vuln.title} (${vuln.severity}));
});

Or if you prefer raw REST:

// Direct REST call to the same endpoint
const response = await fetch(
'https://api.unizo.ai/v1/security/vulnerabilities?severity=high&status=open',
{
headers: {
'Authorization': Bearer ${process.env.UNIZO_API_KEY},
'Content-Type': 'application/json'
}
}
);

const vulnerabilities = await response.json();

The SDK handles auth, retries, rate limits, and pagination. You get:

One API call to get normalized data across all EDR/VMS/Identity vendors

One webhook endpoint for real-time events from all sources

One auth flow (Connect UI) for your customers to connect any tool

Vendor API changes handled upstream, not in your codebase

The Decision Framework

Before you decide to build:

Count your integrations: How many do you need now? In a year?

Calculate the cost: Fully-loaded engineer cost × months of work

Factor in maintenance: 2+ FTEs ongoing, forever

Consider opportunity cost: What else could those engineers build?

Then compare to embedding existing infrastructure. The math usually favors buying unless integrations are literally your core product.

TL;DR

POC integrations are easy. Production integrations are 10x harder.

Multi-tenancy adds another 5x complexity.

Maintenance is 70% of the work, and it never ends.

The real cost isn’t just engineering time. It’s opportunity cost.

Unless integrations are your core product, you probably shouldn’t build from scratch.