Rizwan Saleem

Posted on Jun 2

Designing a scalable event-sourced analytics platform with CQRS and a data lakehouse pattern

#frontend #ai #typescript #webdev

Designing a scalable event-sourced analytics platform with CQRS and a data lakehouse pattern

In this tutorial, you’ll learn how to design a scalable analytics system that handles high-cardinality event streams, supports flexible ad-hoc analytics, and remains maintainable as data volumes grow. We’ll walk through architectural decisions, data modeling, data flow, and concrete code examples to implement a lean, production-ready system.

Overview and goals

Handle large, rapid streams of user and system events with low latency ingestion.
Support both real-time dashboards and long-running analytics jobs.
Provide flexible, ad-hoc query capabilities without sacrificing write throughput.
Ensure data consistency for critical counts while enabling eventual consistency for exploratory analytics.
Enable fault tolerance, observability, and operational simplicity.

Architectural outline

Core ideas: event sourcing for immutable records, Command-Query Responsibility Segregation (CQRS) to separate write paths from read paths, and a data lakehouse for flexible analytics.
Ingest layer: a high-throughput event bus, a compacted log, and backpressure-aware producers.
Write model: write side stores events; maintain materialized views (projections) for common queries.
Read model: fast, query-optimized stores for dashboards, plus a data lakehouse for large-scale analytics.
Storage tiers: hot (streams and recent projections), warm (aggregations and precomputed summaries), cold (immutable event history and raw logs).
Observability: structured tracing, metrics, and lineage for data quality.

Detailed design

1) Event schema and streamlining

Event envelope: each event has an id, type, timestamp, correlation id, and metadata.
Event types should be additive and idempotent where possible.
Use a stable, evolving schema with a schema registry (e.g., Confluent Schema Registry or an in-house JSONSchema registry) to manage versions without breaking producers/consumers.

Example event envelope (typescript-like shape):
{
"event_id": "evt_12345",
"type": "page_view",
"timestamp": "2026-06-02T18:10:00.000Z",
"correlation_id": "req_98765",
"user_id": "u_abc",
"properties": {
"page": "/product/123",
"referrer": "https://search.example.com",
"device": "mobile",
"ip": "203.0.113.42"
},
"metadata": {
"source": "frontend",
"version": 2
}
}

Cardinality considerations: store user_id, session_id, and event type distinctly to enable efficient grouping and filtering.
Idempotency: deduplicate using a global event_id channel; design producers to be idempotent where possible.

2) Ingestion layer

Use a robust message bus (e.g., Apache Kafka, or a managed alternative like Kinesis or Pulsar).
Backpressure: producers should be able to slow down or buffer; use partitioning to scale ingestion and ensure ordered processing per entity (e.g., user_id partitioning).
Exactly-once delivery is hard; aim for at-least-once with idempotent processors and deduplication at the projection layer.

Code sketch (pseudo-typed with Node.js) for a Kafka producer:
const { Kafka } = require('kafkajs');
const kafka = new Kafka({ clientId: 'analytics-ingest', brokers: ['kafka:9092'] });
const producer = kafka.producer();

async function publishEvent(event) {
await producer.connect();
await producer.send({
topic: 'events',
messages: [{ key: event.event_id, value: JSON.stringify(event) }]
});
await producer.disconnect();
}

Partition strategy: partition by user_id and/or event_type to maintain order for a given key, while enabling parallelism across keys.

3) Write model (CQRS write side)

Commands and events: domain logic emits events rather than updating a database directly.
Event store: an append-only log that stores all events. You can store raw events plus a compacted “state transition” event for common aggregates.
Projections: materialized views that consume the event stream to produce query-optimized read models.

Example domain: product view analytics

Events: page_view, product_view, add_to_cart, purchase
Projections: per-user session aggregates, product popularity, cohort metrics

Projection design tips:

Use a time-series oriented store for rolling aggregations (e.g., on a per-hour basis).
Maintain exactly-countable metrics for dashboards (e.g., unique visitors) using a combination of user_id and a windowed approach to approximate distinct counts (HyperLogLog or streaming approximate distinct counts).
Persist projections to a fast read store (e.g., a columnar store or a NoSQL database optimized for reads).

Code sketch: Node.js projection consuming events
// pseudo-code
const readModelStore = connectToDatabase('read-model');

async function handleEvent(event) {
switch (event.type) {
case 'page_view':
await updatePageViewProjection(event);
break;
case 'purchase':
await updatePurchaseProjection(event);
break;
// more event types...
}
}

async function updatePageViewProjection(e) {
const { user_id, timestamp } = e;
// Example: hourly page views per user
const bucket = new Date(timestamp);
bucket.setMinutes(0, 0, 0);
await readModelStore.increment('user_page_views', { user_id, bucket }, 1);
}

4) Read model (queries and dashboards)

Hot path: optimized read stores for dashboards and ad-hoc queries. Use indexed SQL data warehouse or a fast NoSQL store with rich query capabilities.
Cold path: store event logs in a data lake (e.g., Parquet on S3) for long-term analytics and machine learning.
Data lakehouse approach: keep raw events in the data lake, and maintain curated tables in a data warehouse (e.g., Delta Lake, Apache Iceberg) for consistent analytics across teams.

Example reads

Daily active users by region
Revenue by product category
Funnel analysis for a conversion flow

5) Data lakehouse integration

Ingest raw events into a data lake (S3/Blob) with partitioning by date.
Build a metadata catalog and use a data warehouse layer (e.g., Delta Lake) to enable ACID transactions on table updates.
Use ETL jobs to transform events into analytics-ready tables, keeping both grain-level and aggregated views.

6) Consistency trade-offs

Write-side consistency: prefer immutable events; maintain idempotent projections to avoid double-counting.
Read-side consistency: for dashboards, consider lag windows (e.g., dashboards reflect data up to the previous hour) to allow projections to settle.
Exactly-once is tough in distributed systems; implement deduplication and reconciliation jobs overnight to fix anomalies.

7) Operational considerations

Observability: trace event lifecycles across ingestion, projection, and read layers; collect metrics on throughput, lag, error rates, and backlog depth.
Schema evolution: versioned events; implement backward-compatible changes and a migration plan for projections.
Security and governance: enforce least privilege on data stores; mask PII where not needed; audit access to sensitive metrics.
Testing: use replay-based tests with mocked event streams to verify projections against a known event history.

8) Example end-to-end flow

Frontend or backend service publishes a page_view event to the events topic.
Ingestion layer consumes, ensures idempotency, and stores the event in the immutable event store.
Projection consumer updates user_hourly_page_views and product_popularity read models.
Dashboards query the read models for real-time metrics; data analysts query the data warehouse for ad-hoc analytics, while raw events are stored in the data lake for ML workloads.

9) Implementation blueprint (stack-neutral)

Ingestion: a durable message bus (Kafka/Kinesis/Pulsar) with backpressure and partitioning.
Event store: immutable append-only log (local or cloud storage with an offset/index).
Projections: streaming processors (e.g., KStreams, Flink, Spark Structured Streaming) that read from the event stream and write to read models.
Read models: fast query stores (columnar DB, Redis for hot aggregates, or document stores with rich indexing).
Data lakehouse: object store with Parquet/ORC files; a catalog layer (Hive/Glue/Metastore) and a warehouse layer (Delta/Iceberg) for ACID transactions.

10) A compact, practical code example: a minimal projection with Node.js and a streaming library
Note: this is a simplified illustration. In production, you'd use a robust streaming framework.

// projection.ts (simplified)
type Event = {
event_id: string;
type: string;
timestamp: string;
user_id?: string;
properties?: any;
};

async function processEvent(event: Event) {
switch (event.type) {
case 'page_view':
await upsertUserHourlyViews(event);
break;
case 'purchase':
await upsertRevenue(event);
break;
default:
// ignore other events for this projection
}
}

async function upsertUserHourlyViews(e: Event) {
const userId = e.user_id;
const hourBucket = new Date(e.timestamp);
hourBucket.setMinutes(0, 0, 0);
// pseudo DB call
await db.increment('user_hourly_views', { user_id: userId, bucket: hourBucket }, 1);
}

async function upsertRevenue(e: Event) {
const amount = e.properties?.amount || 0;
const product = e.properties?.product_id || 'unknown';
const hourBucket = new Date(e.timestamp);
hourBucket.setMinutes(0, 0, 0);
await db.increment('hourly_revenue', { bucket: hourBucket, product }, amount);
}

// simulate consuming
async function run() {
const events = await fetchFromEventStream(); // fetch events from topic
for (const ev of events) {
await processEvent(ev);
}
}

run().catch(console.error);

11) Deployment and evolution notes

Start small: implement one critical projection (e.g., daily active users) and a single dashboard to validate the flow.
Add more projections incrementally, ensuring backward compatibility when introducing new event types.
Use feature flags to enable/disable new projections during testing.

12) Practical tips and pitfalls

Backpressure management: design producers to buffer or throttle; monitor queue depth and lag.
Data drift: regularly reconcile read models with the event store to catch missed events or dedup issues.
Privacy: avoid storing sensitive fields in events; redact or hash identifiers when possible.

Illustration: the data flow at a glance

Event sources (frontend, services) -> Ingestion (Kafka) -> Event Store (append-only log) -> Projections (read models) -> Dashboards/BI (read stores) -> Data Lakehouse (raw + curated tables) -> ML/Analytics.

If you’d like, I can tailor this blueprint to a specific domain (e-commerce, SaaS analytics, or gaming) and provide a concrete code scaffold with a chosen tech stack (Kafka + Flink + PostgreSQL for reads, Delta Lake for lakehouse, etc.). Would you prefer a stack aligned to cloud-native services (e.g., AWS or Azure) or an open-source on-prem approach?

Rizwan Saleem | https://rizwansaleem.co