Prince Raj

Posted on Jan 15

Building a Multi-Tenant Observability Platform with SigNoz + OneUptime

#devops #monitoring #security #programming

Modern SaaS teams need deep observability without sacrificing tenant isolation or compliance. This post explains how we built a multi-tenant monitoring platform that routes logs, metrics, and traces to isolated SigNoz and OneUptime stacks, enforces strong security controls, and aligns with SOC 2 and ISO 27001 practices. The result: each customer gets a dedicated monitoring experience while we keep the operational footprint lean and repeatable.

1) Architecture Overview

We designed a hub-and-spoke model:

A central monitoring VM hosts the observability stack.
Each tenant has either:
- a fully isolated SigNoz stack (frontend, query, collector, ClickHouse), or
- a shared stack with strict routing based on a tenant identifier (for lightweight tenants).
Each application VM runs an OpenTelemetry (OTEL) Collector that tails PM2 logs, receives OTLP traces/metrics, and forwards to the monitoring VM.

This gives a consistent ingestion pipeline while allowing isolation-by-default where needed.

2) Tenant Segregation Strategy

We support two isolation modes:

1) Full isolation per tenant

Dedicated SigNoz stack per tenant
Separate ClickHouse instance
Separate OTEL collector upstream
Strongest data isolation

2) Logical isolation on a shared stack

Single SigNoz + ClickHouse
Routing by business_id (header + resource attribute)
Good for smaller tenants

We default to full isolation for regulated or high-traffic customers.

Key routing headers:

x-business-id for SigNoz
x-oneuptime-token for OneUptime

3) Provisioning and Hardening the Monitoring VM

We treat the monitoring VM as a controlled production system:

SSH keys only, no password auth
Minimal inbound ports (22, 80, 443, 4317/4318)
Nginx as a single TLS ingress
Docker Compose for immutable service layout

Example provisioning steps (high-level):

# SSH key-based access only
az vm user update --resource-group <rg> --name <vm> --username <user> --ssh-key-value "<pubkey>"

# Open required ports (restrict SSH to trusted IPs)
az network nsg rule create ... --destination-port-ranges 22 80 443 4317 4318

4) Multi-Tenant Routing at the Edge

We use Nginx maps to route traffic by hostname for both UI and OTLP ingestion:

map $host $signoz_collector_upstream {
  signoz.tenant-a.example  signoz-otel-collector-tenant-a;
  signoz.tenant-b.example  signoz-otel-collector-tenant-b;
  default                 signoz-otel-collector-default;
}

server {
  listen 4318;
  location / {
    proxy_pass http://$signoz_collector_upstream;
  }
}

This gives us clean DNS-based tenant routing while keeping a single IP.

5) Collector Configuration: Logs, Traces, Metrics

Each tenant VM runs OTEL Collector with filelog + OTLP. We parse PM2 logs (JSON wrapper), normalize severity, and attach resource fields for fast filtering in SigNoz.

Core fields we enforce:

severity_text (info/warn/error)
service.name
deployment.environment
host.name
business_id

Minimal config excerpt:

processors:
  resourcedetection:
    detectors: [system]
  resource:
    attributes:
      - key: business_id
        value: ${env:BUSINESS_ID}
        action: upsert

  transform/logs:
    log_statements:
      - context: log
        statements:
          - set(severity_text, attributes["severity"]) where attributes["severity"] != nil

This makes severity_text, service.name, and host.name searchable immediately in SigNoz.

6) Client-Side Integration (Apps)

We used a consistent OTEL pattern across backend, web, and agent services:

Backend: OTLP exporter for traces
Web: browser traces forwarded to backend (which re-exports)
Agents: OTEL SDK configured with OTEL_EXPORTER_OTLP_ENDPOINT

Typical environment variables:

BUSINESS_ID=tenant-a
SIGNOZ_ENDPOINT=http://signoz.tenant-a.example:4318
ONEUPTIME_ENDPOINT=http://status.tenant-a.example:4318
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://127.0.0.1:4318/v1/traces
DEPLOY_ENV=production

7) DNS and TLS (Public UX)

Each tenant gets:

signoz.
status.

We terminate TLS at Nginx with real certificates (ACME/Let's Encrypt):

sudo certbot --nginx -d signoz.tenant-a.example -d status.tenant-a.example

We keep per-tenant TLS policies aligned with strong ciphers and HSTS.

8) Verification and Observability QA

We validate the pipeline with:

OTEL health endpoint (/health on collector)
Test traffic from backend
ClickHouse queries to confirm log attributes
SigNoz filters for severity_text, service.name, host.name

Example ClickHouse check (internal):

SELECT severity_text, count()
FROM signoz_logs.logs_v2
WHERE resources_string['business_id'] = 'tenant-a'
AND timestamp >= now() - INTERVAL 15 MINUTE
GROUP BY severity_text;

9) Security and Compliance (SOC 2 + ISO 27001)

We implemented controls aligned with SOC 2 and ISO 27001:

Access control: SSH keys only, least privilege, MFA on cloud console.
Network segmentation: minimal open ports; SSH restricted by source IP.
Secrets management: runtime secrets stored in a vault, never in code.
Encryption in transit: TLS everywhere, no plaintext endpoints exposed.
Encryption at rest: disk encryption enabled on VMs and DB volumes.
Audit trails: system logs retained; infra changes tracked in code.
Change management: all config in repos; change reviews before deployment.
Monitoring and alerting: OneUptime for SLOs and uptime checks.
Incident response: documented procedures, retention and escalation.
Backup strategy: ClickHouse backup policies per tenant.

10) Repeatability: Infra + Tenant Config as Code

We split configuration by responsibility:

Monitoring services repo: all infra and Nginx routing
Tenant repos: OTEL collector config and deploy hooks

That means a new VM can be rebuilt with:
1) Pull monitoring repo and run docker compose up -d
2) Update DNS + TLS
3) Run tenant deployment scripts to install collector and env

Final Takeaways

This architecture gives us the best of both worlds:

Strong tenant isolation for compliance-focused clients
Shared ops processes and standard config
Fast log filtering (severity/service/env/host) for high signal-to-noise debugging
A repeatable, audited deployment flow suitable for SOC 2 and ISO 27001 requirements

DEV Community