Designing an Event-Sourced, Multi-Tenant Data Platform for Real-Time Analytics
Designing an Event-Sourced, Multi-Tenant Data Platform for Real-Time Analytics
Building a data platform that serves real-time analytics across multiple tenants is a challenging but increasingly common requirement. The design must handle high ingestion rates, per-tenant data isolation, strict access controls, fault tolerance, and scalable storage and compute. This guide walks through a practical, end-to-end approach: from high-level architecture to concrete data models, event sourcing, multi-tenant isolation, operational concerns, and a sample implementation you can adapt.
Note: This topic is distinct from the topics you listed and focuses on a coherent system design with concrete patterns and code snippets.
Table of contents
- Architectural overview
- Core components and data model
- Event sourcing and streaming pipeline
- Multi-tenant isolation and security
- Real-time query layer and materialized views
- Storage and scalability considerations
- Operational patterns and observability
- A practical implementation example (Go + Kafka + ClickHouse)
- Deployment considerations
- Trade-offs and common pitfalls
Architectural overview
- Goals
- Real-time analytics per tenant with strict data isolation
- High ingest throughput, low-latency queries, and resilience to outages
- Evolvable schema via event definitions, with replayable history
- Core ideas
- Event-sourced data plane: all changes are represented as immutable events
- Append-only event log per tenant, as the source of truth
- Streaming pipeline that materializes views and streams to analytical stores
- Separate write path (producers) and read path (consumers) to decouple workloads
- Centralized metadata service for tenancy and access control
- High-level architecture
- Ingestion services produce events to a distributed log (e.g., Kafka) with tenant-scoped topics
- Event bus streams into a materialization layer (updaters) that build read models
- Analytical store (columnar DB) for fast analytics
- Optional online store for operational dashboards and alerting
- Identity and access management layer enforcing per-tenant permissions
- Key patterns
- Event sourcing + CQRS (command-query responsibility segregation)
- Backfillable event streams for new projections
- Time-based partitioning and snapshotting for performance
Core components and data model
- Tenancy and identity
- Tenant ID: a stable, opaque identifier for each customer
- Access control: per-tenant roles (viewer, analyst, admin)
- Authn/Authz: use an external identity provider; issue short-lived tokens with tenant scope
- Events
- Event types: e.g., PageViewed, PurchaseMade, SensorReading, AlertTriggered
- Event schema: always include tenant_id, event_type, timestamp, and payload
- Versioning: include event_version to support evolving payload schemas
- Write model
- Producers emit events to Kafka topics named per-tenant or per-event-type with a tenant prefix
- Example topic naming: tenant-{tenant_id}.events or events.{tenant}.{type}
- Read model
- Projections consume events and produce denormalized views optimized for queries
- Store projections in a columnar database (e.g., ClickHouse, Apache Iceberg tables in S3)
- Snapshotting
- Periodic snapshots of aggregate state to speed up state reconstruction
- Snapshot schema mirrors the current read model state
- Metadata catalog
- Central registry for tenants, schemas, and available projections
- Stores permissions, quotas, and retention policies
Event sourcing and streaming pipeline
- Ingest layer
- Lightweight HTTP/gRPC endpoints or SDKs to emit events
- Validate event structure, enrich with metadata, attach tenant_id
- Emit to a durable log (Kafka) with idempotent producers and exactly-once semantics where feasible
- Streaming layer
- Consumers subscribe to tenant-scoped partitions or topics
- Process events to update projections and materialized views
- Handle out-of-order events with timestamps and watermarking
- Materialization
- Projections implement business logic (e.g., sum-by-tenant, time-series aggregates)
- Use stream processing frameworks (Kafka Streams, ksqlDB, Apache Flink) for robust state handling
- Data retention
- Retain raw events for a configurable period; delete or archive after policy
- Snapshots and read models kept for longer-term analytics
Multi-tenant isolation and security
- Data separation
- Use topic partitioning by tenant to minimize cross-tenant access
- Implement per-tenant access control at the API layer and within the read store
- Access controls
- Tokens include tenant_id and roles; services enforce restrictions
- Separate service accounts per tenant for integration; rotate credentials regularly
- Observability and governance
- Audit logs for access and data changes
- Quotas per tenant to prevent abuse (ingest rate, storage, query costs)
Real-time query layer and materialized views
- Query options
- Time-series dashboards (per-tenant)
- Ad-hoc analytics and cohort analysis
- Fast lookups for recent events with per-tenant filters
- Projections
- Build per-tenant materialized views such as:
- daily_active_users(tenant)
- revenue_by_day(tenant)
- top_pages_by_hits(tenant)
- Storage choices
- Operational projections: append-only stores (e.g., ClickHouse or Druid)
- Long-term analytics: data lake with columnar formats (Parquet) and a query engine (Presto/Trino)
- Query layer design
- Provide a single API surface for dashboards; enforce tenant scoping
- Implement read replicas and query caching for frequently accessed pipelines
Storage and scalability considerations
- Data partitioning
- Time-based partitions (e.g., daily) plus tenant-based partitioning for load isolation
- Compression and encoding
- Columnar formats with compression (e.g., ZSTD) to minimize storage and I/O
- Backpressure and backfills
- Backpressure-aware consumers; replayable streams to reprocess events after schema changes
- Data retention policies
- Configure per-tenant retention for events and read models; implement archival pipelines
- Scaling strategy
- Independent scaling paths for ingestion, streaming processing, and query layer
- Stateless producers; stateful stream processors with checkpoints and durable state stores
Operational patterns and observability
- Telemetry
- Centralized metrics for ingest rate, latency, error rates, per-tenant quotas
- Distributed tracing across services to diagnose slow tenants or failed projections
- Resilience
- Durable queues, idempotent processing, and retry/backoff policies
- Circuit breakers around external dependencies
- Backup and recovery
- Backups for topics, CDC-like snapshots for read models
- Playbooks for disaster recovery with tested runbooks
- Observability tooling
- Dashboards for tenants, system health, and SLA attainment
- Alerting on anomalies (e.g., sudden drop in events, projection lag)
A practical implementation example (Go + Kafka + ClickHouse)
Note: This is a minimal, illustrative setup you can expand. Adapt to your preferred language and tech stack.
1) Data model example (Go structs)
type Event struct {
TenantID string json:"tenant_id"
Type string json:"type"
Timestamp int64 json:"timestamp"
Version int json:"version"
Payload map[string]interface{} json:"payload"
}
2) Producer: emit events to Kafka
package main
import (
"encoding/json"
"log"
"time"
"github.com/segmentio/kafka-go"
)
func main() {
topic := "tenant-42.events"
r := kafka.NewReader(kafka.ReaderConfig{
Brokers: []string{"localhost:9092"},
Topic: topic,
GroupID: "producer-42",
})
defer r.Close()
// Example event
e := Event{
TenantID: "tenant-42",
Type: "PageViewed",
Timestamp: time.Now().Unix(),
Version: 1,
Payload: map[string]interface{}{
"page": "/home",
"ref": "newsletter",
},
}
b, _ := json.Marshal(e)
w := kafka.NewWriter(kafka.WriterConfig{
Brokers: []string{"localhost:9092"},
Topic: topic,
})
err := w.WriteMessages(context.Background(),
kafka.Message{Key: []byte(e.TenantID), Value: b},
)
if err != nil {
log.Fatal(err)
}
}
3) Consumer/Projection: update ClickHouse read model
package main
import (
"context"
"encoding/json"
"log"
"github.com/ClickHouse/clickhouse-go/v2"
"github.com/segmentio/kafka-go"
)
type Event struct {
TenantID string json:"tenant_id"
Type string json:"type"
Timestamp int64 json:"timestamp"
Version int json:"version"
Payload map[string]interface{} json:"payload"
}
func main() {
// Connect to ClickHouse
conn, err := clickhouse.Open(clickhouse.Options{
addr: []string{"127.0.0.1:9000"},
})
if err != nil {
log.Fatal(err)
}
// Ensure table exists
_, _ = conn.Exec(context.Background(), CREATE TABLE IF NOT EXISTS analytics.events)
(tenant_id String, type String, timestamp DateTime, payload String)
ENGINE = MergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (tenant_id, timestamp)
r := kafka.NewReader(kafka.ReaderConfig{
Brokers: []string{"localhost:9092"},
Topic: "tenant-42.events",
GroupID: "proj-42",
})
for {
m, err := r.ReadMessage(context.Background())
if err != nil {
log.Println(err)
continue
}
var e Event
if err := json.Unmarshal(m.Value, &e); err != nil {
log.Println(err)
continue
}
// Serialize payload for storage
payloadJSON, _ := json.Marshal(e.Payload)
// Upsert into read model
_, _ = conn.Exec(context.Background(), "INSERT INTO analytics.events (tenant_id, type, timestamp, payload) VALUES (?, ?, toDateTime(?), ?)", e.TenantID, e.Type, e.Timestamp, string(payloadJSON))
}
}
4) Query example: per-tenant time-series
SELECT toStartOfDay(timestamp) AS day, count(*) AS events
FROM analytics.events
WHERE tenant_id = 'tenant-42'
AND timestamp >= '2026-01-01 00:00:00'
GROUP BY day
ORDER BY day ASC
5) Backfill and replay
- To backfill a projection, replay events from a historical window into the projection pipeline.
- Maintain a versioned projection state; enable schema evolution by applying migration steps as needed.
Security and deployment considerations
- Encrypt data in transit and at rest; use TLS for Kafka and ClickHouse connections.
- Use role-based access control (RBAC) on the API surface; enforce tenant scoping in every query.
- Implement quotas and fair-use controls to protect tenants and the platform.
- Consider managed services for Kafka (e.g., Confluent Cloud, AWS MSK) and ClickHouse (cloud offerings) to reduce ops load.
- CI/CD for event schemas: include schema validation tests and event compatibility checks.
Trade-offs and common pitfalls
- Event schema evolution can become complex; plan for forward/backward compatibility and versioning.
- Over-partitioning can cause small-partition throttling; balance between tenancy isolation and manageable partitions.
- Real-time latency vs. durability: choose a delivery guarantee that matches your business - at-least-once is common; exactly-once is more expensive.
- Read-side lag: projections may lag ingestion under bursty loads; implement buffering and backpressure strategies.
- Compliance: ensure data retention policies meet regulatory requirements and provide easy data deletion per tenant where needed.
Next steps and how to tailor this to Carlisle/UK context
- If youβre building for UK-based tenants, consider data residency requirements and GDPR implications. Plan data stores with region-specific compliance in mind.
- Start small: implement a single-tenant, real-time pipeline to validate the end-to-end flow, then scale to multi-tenant architecture.
- Prototype with a minimal event set (e.g., PageViewed, PurchaseMade) and a simple per-tenant dashboard to validate the experience before expanding event types and projections.
Would you like a richer, end-to-end code skeleton in a specific language (e.g., Python with Faust, Java/Kafka Streams, or Rust) and a docker-compose setup to run a local multi-tenant demo?
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)