DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

Designing a Durable Event-Sourced Analytics Platform for Real-Time Dashboards

Designing a Durable Event-Sourced Analytics Platform for Real-Time Dashboards

Designing a Durable Event-Sourced Analytics Platform for Real-Time Dashboards

Event-sourced analytics is a powerful approach for building real-time dashboards that remain accurate, auditable, and scalable as data volume grows. This guide walks through a practical architecture, patterns, and implementation steps to design a robust system that ingests high-velocity events, stores them in an append-only log, materializes views for dashboards, and supports evolving analytics needs without breaking downstream consumers.

Overview and goals

  • Real-time, reliable dashboards: near-constant latency from event ingestion to fresh metrics.
  • Scalable storage: append-only event log plus compacted, query-friendly read models.
  • Strong consistency guarantees where it matters: idempotent producers, at-least-once delivery, and deterministic event handling.
  • Evolvable analytics: schema-less events with optional schemas; versioned read models and backward-compatible evolutions.
  • Observability and operability: tracing, metrics, schema migrations, and rollback strategies.

High-level architecture

  • Event producers: services emitting domain events (e.g., user_sign_up, item_purchased, page_view).
  • Event bus / log: a durable, immutable append-only log (e.g., Apache Kafka, AWS Kinesis, or a similar system).
  • Stream processing: processors that read from the log, perform enrichment, deduplication, and materialize read models.
  • Read models: denormalized, query-optimized projections stored in a database suitable for dashboards (e.g., PostgreSQL for ad-hoc queries, Redis for fast dashboards, or specialized OLAP stores for large-scale analytics).
  • Query layer: API or GraphQL/REST endpoints that expose dashboards and metrics.
  • Observability: centralized logging, metrics, tracing, and dashboards for system health.

Core design decisions

  • Event model
    • Each event should have: event_id, event_type, aggregate_id, timestamp, payload, and optional metadata.
    • Use a stable, monotonically increasing offset or timestamp for ordering. Include causality metadata when possible.
    • Make events immutable and append-only; avoid updating events in place.
  • Idempotency and deduplication
    • Assign a global idempotency key (e.g., event_id or a combination of source_id and sequence number).
    • Processors should be idempotent, able to handle duplicate events without side effects.
  • Schema evolution
    • Favor schemaless/loosely structured payloads (e.g., JSON with optional fields).
    • Maintain a central registry of event types and payload schemas; versions can be added, with backward-compatible reader behavior.
    • Read models should be versioned; new projections may leverage new fields while old projections remain intact.
  • Read-models and projections
    • Projections are deterministic, incremental, and append-only to the read model store.
    • Materialize common aggregates (counts, funnels, time-series) and more complex metrics (RFM, cohort analyses) as separate read models.
    • Use windowing (tumbling/hopping windows) to produce time-bounded dashboards.
  • Consistency and latency
    • Accept at-least-once delivery guarantees; design processors to handle retries gracefully.
    • For dashboards requiring strong accuracy, implement reconciliation passes to correct drift between log and read models.
  • Operational concerns
    • Back-pressure handling: scale producers and processors independently; implement backpressure signals.
    • Exactly-once semantics in read models can be expensive; prefer idempotent processing and at-least-once semantics with deduplication.
    • Monitoring: track event lag, processing lag, error rates, storage growth, and query latency.

A concrete architecture blueprint

  • Event bus/log: Kafka (or Kinesis)
    • Topics per event source or per domain (e.g., user_events, commerce_events).
    • Each producer writes to its topic with a stable schema.
  • Event processing layer: stream processors
    • Consumer groups for each read-model projection.
    • Enrichment: join with static data (e.g., user profile, product catalog) if needed; cache dimension data for fast enrichment.
    • Deduplication: use a cache-backed deduper keyed by event_id within a retention window.
  • Read models storage
    • PostgreSQL for relational read models and dashboards requiring SQL.
    • Redis for ultra-fast, time-series dashboards and counters.
    • Optional: ClickHouse or Druid for large-scale analytics and fast aggregations if data volume is huge.
  • Query API
    • REST or GraphQL endpoints that fetch the latest read models and time-series data.
    • Support for slicing by time window, user segments, and event types.
  • Observability
    • OpenTelemetry instrumentation; trace end-to-end latency from event production to dashboard render.
    • Metrics: event lag per topic, processor throughput, read-model update lag, error rates.
    • Centralized dashboards (Grafana, Kibana) with alerting on critical thresholds.

Step-by-step guide: building a minimal but extensible system
Phase 1: define events and topics

  • Identify core events that drive dashboards: user_signup, login, product_view, add_to_cart, order_placed.
  • For each event, define:
    • event_type, event_id, aggregate_id (e.g., user_id, session_id), timestamp, payload (fields relevant to the event).
  • Example payloads (simplified):
    • user_signup: { user_id, signup_source, plan }
    • product_view: { user_id, product_id, view_duration }
    • order_placed: { user_id, order_id, total_amount, currency, items: [{product_id, quantity}] }

Phase 2: choose storage and streaming components

  • Event log: Kafka with retries and compacted keys where appropriate.
  • Read-model store: PostgreSQL for rich queries; Redis for hot dashboards.
  • Processing framework: use a lightweight stream processor (e.g., Kafka Streams, ksqlDB, or a microservice consuming Kafka and producing to read models) depending on team familiarity.

Phase 3: build a small set of initial projections

  • Daily active users (DAU) and monthly active users (MAU)
    • Projection reads user_signup and login events; updates a daily count table and a rolling MAU view.
  • Revenue by day and by product
    • Projection processes order_placed events; aggregates totals by day and product_id.
  • Session length per user
    • Projection computes session duration from login and logout or from successive events within a session_id.

Phase 4: implement idempotent processors

  • Deduplicate events by event_id within a processing window (e.g., 24 hours).
  • Use upserts when updating read models to avoid duplicate counters.
  • Log duplicates for visibility but do not alter state.

Phase 5: schema evolution strategy

  • Maintain an event_type registry with a version per event payload shape.
  • If a payload changes, emit a new version of the event_type and support both versions in readers for a period.
  • Write read-model migrations as separate steps; add new read-model tables as needed without removing old ones immediately.

Phase 6: observability and resilience

  • Instrument producers and processors with request/latency metrics and error counters.
  • Track lag: difference between latest event_offset and read-model update offset.
  • Add alert rules for lag thresholds and processing failures.

Example implementation: a minimal Python-based event producer and a simple processor

  • Tech choices: Python, Kafka, PostgreSQL, Redis.

Event producer snippet (Python, using confluent-kafka)

  • This code emits a user_signup event when a new user registers.

from confluent_kafka import Producer
import json
import time
import uuid

p = Producer({'bootstrap.servers': 'kafka-broker:9092'})

def delivery_report(err, msg):
if err is not None:
print(f"Delivery failed for record {msg.key()}: {err}")
else:
pass # success

def publish_user_signup(user_id, signup_source, plan):
event = {
"event_id": str(uuid.uuid4()),
"event_type": "user_signup",
"aggregate_id": user_id,
"timestamp": int(time.time() * 1000),
"payload": {"user_id": user_id, "signup_source": signup_source, "plan": plan}
}
p.produce('user_events', key=event["event_id"], value=json.dumps(event), callback=delivery_report)
p.flush()

Example usage

publish_user_signup("user-123", "email_campaign", "premium")

Event processor snippet (Python, consuming Kafka and updating PostgreSQL)

  • A simple deduplicating projection for daily active users.

import json
import psycopg2
from confluent_kafka import Consumer

consumer = Consumer({
'bootstrap.servers': 'kafka-broker:9092',
'group.id': 'ua_projection',
'auto.offset.reset': 'earliest'
})

consumer.subscribe(['user_events'])

conn = psycopg2.connect(
dbname='analytics', user='analytics', password='secret', host='db-host'
)
cur = conn.cursor()

def main():
while True:
msg = consumer.poll(1.0)
if msg is None:
continue
if msg.error():
print("Consumer error: {}".format(msg.error()))
continue

    event = json.loads(msg.value().decode('utf-8'))
    if event.get('event_type') != 'user_signup':
        continue

    user_id = event['payload']['user_id']
    event_id = event['event_id']
    event_ts = event['timestamp']

    # Simple deduplication using a log table
    cur.execute("SELECT 1 FROM event_log WHERE event_id = %s", (event_id,))
    if cur.fetchone():
        conn.commit()
        continue

    # Insert event into log
    cur.execute(
        "INSERT INTO event_log (event_id, event_type, aggregate_id, ts, payload) VALUES (%s, %s, %s, to_timestamp(%s/1000.0), %s)",
        (event_id, event['event_type'], user_id, event_ts, json.dumps(event['payload']))
    )

    # Update DAU projection
    day = time.strftime('%Y-%m-%d', time.gmtime(event_ts/1000.0))
    cur.execute("""
        INSERT INTO dau_by_day (day, user_id)
        VALUES (%s, %s)
        ON CONFLICT (day, user_id) DO NOTHING
    """, (day, user_id))

    conn.commit()
Enter fullscreen mode Exit fullscreen mode

if name == "main":
main()

Notes on the code

  • The event producer emits events to a topic with a unique event_id; consumers deduplicate using event_id.
  • The processor maintains an event_log to enable reconciliation or rollback if needed.
  • The DAU projection uses a per-day granularity; you can consolidate to hourly or finer granularity as needed.

Testing and validation

  • End-to-end tests: simulate a sequence of events, ensure read-models reflect expected aggregates.
  • Idempotency tests: replay the same events; ensure read models do not double-count.
  • Back-pressure tests: simulate spikes and observe lag; ensure processors queue and scale.

Operational considerations

  • Scaling
    • Producers scale by topic and partition count; ensure event_id generation remains unique across partitions.
    • Processors scale by consumer groups; each projection can run in its own service or container, enabling independent scaling.
  • Storage management
    • Retain event logs long enough for reconciliation (e.g., 90 days or as required by compliance).
    • Periodically compress/read-aggregate older data into summarized tables to save space.
  • Security
    • Encrypt sensitive payload fields and enforce strict access control to read-model stores.
    • Validate inputs at producers and enforce schema checks at readers.

Extending the system

  • Add more read models
    • Funnel analytics: steps from page_view to purchase with time-to-conversion.
    • Cohorts: track user behavior by signup date and observe retention.
    • Product performance: revenue and units sold by category and region.
  • Real-time dashboards
    • Use Redis/TimescaleDB or ClickHouse for fast rolling-window aggregations.
    • Build a frontend that subscribes to a live API or WebSocket stream for near-instant updates.
  • Data quality and governance
    • Implement schema validation for event payloads.
    • Build a lightweight lineage system: which dashboards depend on which events.

Illustrative example: a nightly reconciliation cycle

  • Every night, a reconciliation job scans the event_log against read-models to detect discrepancies.
  • If a mutation occurred (e.g., late event arrives), the job reconstructs the affected read-model segment and replays events to bring the projection in sync.
  • This approach balances the performance of real-time processing with the reliability of batched consistency checks.

Final thoughts

  • An event-sourced analytics architecture provides a strong foundation for scalable, auditable dashboards. By starting with a compact set of events and simple projections, you can ship quickly while keeping room to grow. Prioritize idempotent processing, backward-compatible schema evolution, and robust observability to maintain a healthy system as your data and user needs expand.

Would you like a concrete starter repository structure with Docker Compose files and a minimal frontend dashboard example to go along with this architecture?

-

Rizwan Saleem | https://rizwansaleem.co

Top comments (0)