DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

Designing an Event-Sourced, Multi-Tenant Data Platform for Real-Time Analytics

Designing an Event-Sourced, Multi-Tenant Data Platform for Real-Time Analytics

Designing an Event-Sourced, Multi-Tenant Data Platform for Real-Time Analytics

Building a data platform that serves real-time analytics across multiple tenants is a challenging but increasingly common requirement. The design must handle high ingestion rates, per-tenant data isolation, strict access controls, fault tolerance, and scalable storage and compute. This guide walks through a practical, end-to-end approach: from high-level architecture to concrete data models, event sourcing, multi-tenant isolation, operational concerns, and a sample implementation you can adapt.

Note: This topic is distinct from the topics you listed and focuses on a coherent system design with concrete patterns and code snippets.

Table of contents

  • Architectural overview
  • Core components and data model
  • Event sourcing and streaming pipeline
  • Multi-tenant isolation and security
  • Real-time query layer and materialized views
  • Storage and scalability considerations
  • Operational patterns and observability
  • A practical implementation example (Go + Kafka + ClickHouse)
  • Deployment considerations
  • Trade-offs and common pitfalls

Architectural overview

  • Goals
    • Real-time analytics per tenant with strict data isolation
    • High ingest throughput, low-latency queries, and resilience to outages
    • Evolvable schema via event definitions, with replayable history
  • Core ideas
    • Event-sourced data plane: all changes are represented as immutable events
    • Append-only event log per tenant, as the source of truth
    • Streaming pipeline that materializes views and streams to analytical stores
    • Separate write path (producers) and read path (consumers) to decouple workloads
    • Centralized metadata service for tenancy and access control
  • High-level architecture
    • Ingestion services produce events to a distributed log (e.g., Kafka) with tenant-scoped topics
    • Event bus streams into a materialization layer (updaters) that build read models
    • Analytical store (columnar DB) for fast analytics
    • Optional online store for operational dashboards and alerting
    • Identity and access management layer enforcing per-tenant permissions
  • Key patterns
    • Event sourcing + CQRS (command-query responsibility segregation)
    • Backfillable event streams for new projections
    • Time-based partitioning and snapshotting for performance

Core components and data model

  • Tenancy and identity
    • Tenant ID: a stable, opaque identifier for each customer
    • Access control: per-tenant roles (viewer, analyst, admin)
    • Authn/Authz: use an external identity provider; issue short-lived tokens with tenant scope
  • Events
    • Event types: e.g., PageViewed, PurchaseMade, SensorReading, AlertTriggered
    • Event schema: always include tenant_id, event_type, timestamp, and payload
    • Versioning: include event_version to support evolving payload schemas
  • Write model
    • Producers emit events to Kafka topics named per-tenant or per-event-type with a tenant prefix
    • Example topic naming: tenant-{tenant_id}.events or events.{tenant}.{type}
  • Read model
    • Projections consume events and produce denormalized views optimized for queries
    • Store projections in a columnar database (e.g., ClickHouse, Apache Iceberg tables in S3)
  • Snapshotting
    • Periodic snapshots of aggregate state to speed up state reconstruction
    • Snapshot schema mirrors the current read model state
  • Metadata catalog
    • Central registry for tenants, schemas, and available projections
    • Stores permissions, quotas, and retention policies

Event sourcing and streaming pipeline

  • Ingest layer
    • Lightweight HTTP/gRPC endpoints or SDKs to emit events
    • Validate event structure, enrich with metadata, attach tenant_id
    • Emit to a durable log (Kafka) with idempotent producers and exactly-once semantics where feasible
  • Streaming layer
    • Consumers subscribe to tenant-scoped partitions or topics
    • Process events to update projections and materialized views
    • Handle out-of-order events with timestamps and watermarking
  • Materialization
    • Projections implement business logic (e.g., sum-by-tenant, time-series aggregates)
    • Use stream processing frameworks (Kafka Streams, ksqlDB, Apache Flink) for robust state handling
  • Data retention
    • Retain raw events for a configurable period; delete or archive after policy
    • Snapshots and read models kept for longer-term analytics

Multi-tenant isolation and security

  • Data separation
    • Use topic partitioning by tenant to minimize cross-tenant access
    • Implement per-tenant access control at the API layer and within the read store
  • Access controls
    • Tokens include tenant_id and roles; services enforce restrictions
    • Separate service accounts per tenant for integration; rotate credentials regularly
  • Observability and governance
    • Audit logs for access and data changes
    • Quotas per tenant to prevent abuse (ingest rate, storage, query costs)

Real-time query layer and materialized views

  • Query options
    • Time-series dashboards (per-tenant)
    • Ad-hoc analytics and cohort analysis
    • Fast lookups for recent events with per-tenant filters
  • Projections
    • Build per-tenant materialized views such as:
    • daily_active_users(tenant)
    • revenue_by_day(tenant)
    • top_pages_by_hits(tenant)
  • Storage choices
    • Operational projections: append-only stores (e.g., ClickHouse or Druid)
    • Long-term analytics: data lake with columnar formats (Parquet) and a query engine (Presto/Trino)
  • Query layer design
    • Provide a single API surface for dashboards; enforce tenant scoping
    • Implement read replicas and query caching for frequently accessed pipelines

Storage and scalability considerations

  • Data partitioning
    • Time-based partitions (e.g., daily) plus tenant-based partitioning for load isolation
  • Compression and encoding
    • Columnar formats with compression (e.g., ZSTD) to minimize storage and I/O
  • Backpressure and backfills
    • Backpressure-aware consumers; replayable streams to reprocess events after schema changes
  • Data retention policies
    • Configure per-tenant retention for events and read models; implement archival pipelines
  • Scaling strategy
    • Independent scaling paths for ingestion, streaming processing, and query layer
    • Stateless producers; stateful stream processors with checkpoints and durable state stores

Operational patterns and observability

  • Telemetry
    • Centralized metrics for ingest rate, latency, error rates, per-tenant quotas
    • Distributed tracing across services to diagnose slow tenants or failed projections
  • Resilience
    • Durable queues, idempotent processing, and retry/backoff policies
    • Circuit breakers around external dependencies
  • Backup and recovery
    • Backups for topics, CDC-like snapshots for read models
    • Playbooks for disaster recovery with tested runbooks
  • Observability tooling
    • Dashboards for tenants, system health, and SLA attainment
    • Alerting on anomalies (e.g., sudden drop in events, projection lag)

A practical implementation example (Go + Kafka + ClickHouse)
Note: This is a minimal, illustrative setup you can expand. Adapt to your preferred language and tech stack.

1) Data model example (Go structs)
type Event struct {
TenantID string json:"tenant_id"
Type string json:"type"
Timestamp int64 json:"timestamp"
Version int json:"version"
Payload map[string]interface{} json:"payload"
}

2) Producer: emit events to Kafka
package main

import (
"encoding/json"
"log"
"time"

"github.com/segmentio/kafka-go"
)

func main() {
topic := "tenant-42.events"
r := kafka.NewReader(kafka.ReaderConfig{
Brokers: []string{"localhost:9092"},
Topic: topic,
GroupID: "producer-42",
})
defer r.Close()

// Example event
e := Event{
TenantID: "tenant-42",
Type: "PageViewed",
Timestamp: time.Now().Unix(),
Version: 1,
Payload: map[string]interface{}{
"page": "/home",
"ref": "newsletter",
},
}
b, _ := json.Marshal(e)
w := kafka.NewWriter(kafka.WriterConfig{
Brokers: []string{"localhost:9092"},
Topic: topic,
})
err := w.WriteMessages(context.Background(),
kafka.Message{Key: []byte(e.TenantID), Value: b},
)
if err != nil {
log.Fatal(err)
}
}

3) Consumer/Projection: update ClickHouse read model
package main

import (
"context"
"encoding/json"
"log"

"github.com/ClickHouse/clickhouse-go/v2"
"github.com/segmentio/kafka-go"
)

type Event struct {
TenantID string json:"tenant_id"
Type string json:"type"
Timestamp int64 json:"timestamp"
Version int json:"version"
Payload map[string]interface{} json:"payload"
}

func main() {
// Connect to ClickHouse
conn, err := clickhouse.Open(clickhouse.Options{
addr: []string{"127.0.0.1:9000"},
})
if err != nil {
log.Fatal(err)
}
// Ensure table exists
_, _ = conn.Exec(context.Background(), CREATE TABLE IF NOT EXISTS analytics.events
(tenant_id String, type String, timestamp DateTime, payload String)
ENGINE = MergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (tenant_id, timestamp)
)

r := kafka.NewReader(kafka.ReaderConfig{
Brokers: []string{"localhost:9092"},
Topic: "tenant-42.events",
GroupID: "proj-42",
})
for {
m, err := r.ReadMessage(context.Background())
if err != nil {
log.Println(err)
continue
}
var e Event
if err := json.Unmarshal(m.Value, &e); err != nil {
log.Println(err)
continue
}
// Serialize payload for storage
payloadJSON, _ := json.Marshal(e.Payload)
// Upsert into read model
_, _ = conn.Exec(context.Background(), "INSERT INTO analytics.events (tenant_id, type, timestamp, payload) VALUES (?, ?, toDateTime(?), ?)", e.TenantID, e.Type, e.Timestamp, string(payloadJSON))
}
}

4) Query example: per-tenant time-series
SELECT toStartOfDay(timestamp) AS day, count(*) AS events
FROM analytics.events
WHERE tenant_id = 'tenant-42'
AND timestamp >= '2026-01-01 00:00:00'
GROUP BY day
ORDER BY day ASC

5) Backfill and replay

  • To backfill a projection, replay events from a historical window into the projection pipeline.
  • Maintain a versioned projection state; enable schema evolution by applying migration steps as needed.

Security and deployment considerations

  • Encrypt data in transit and at rest; use TLS for Kafka and ClickHouse connections.
  • Use role-based access control (RBAC) on the API surface; enforce tenant scoping in every query.
  • Implement quotas and fair-use controls to protect tenants and the platform.
  • Consider managed services for Kafka (e.g., Confluent Cloud, AWS MSK) and ClickHouse (cloud offerings) to reduce ops load.
  • CI/CD for event schemas: include schema validation tests and event compatibility checks.

Trade-offs and common pitfalls

  • Event schema evolution can become complex; plan for forward/backward compatibility and versioning.
  • Over-partitioning can cause small-partition throttling; balance between tenancy isolation and manageable partitions.
  • Real-time latency vs. durability: choose a delivery guarantee that matches your business - at-least-once is common; exactly-once is more expensive.
  • Read-side lag: projections may lag ingestion under bursty loads; implement buffering and backpressure strategies.
  • Compliance: ensure data retention policies meet regulatory requirements and provide easy data deletion per tenant where needed.

Next steps and how to tailor this to Carlisle/UK context

  • If you’re building for UK-based tenants, consider data residency requirements and GDPR implications. Plan data stores with region-specific compliance in mind.
  • Start small: implement a single-tenant, real-time pipeline to validate the end-to-end flow, then scale to multi-tenant architecture.
  • Prototype with a minimal event set (e.g., PageViewed, PurchaseMade) and a simple per-tenant dashboard to validate the experience before expanding event types and projections.

Would you like a richer, end-to-end code skeleton in a specific language (e.g., Python with Faust, Java/Kafka Streams, or Rust) and a docker-compose setup to run a local multi-tenant demo?

-

Rizwan Saleem | https://rizwansaleem.co

Top comments (0)