DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

Designing a scalable event-sourced analytics platform with CQRS and a data lakehouse pattern

Designing a scalable event-sourced analytics platform with CQRS and a data lakehouse pattern

Designing a scalable event-sourced analytics platform with CQRS and a data lakehouse pattern

In this tutorial, you’ll learn how to design a scalable analytics system that handles high-cardinality event streams, supports flexible ad-hoc analytics, and remains maintainable as data volumes grow. We’ll walk through architectural decisions, data modeling, data flow, and concrete code examples to implement a lean, production-ready system.

Overview and goals

  • Handle large, rapid streams of user and system events with low latency ingestion.
  • Support both real-time dashboards and long-running analytics jobs.
  • Provide flexible, ad-hoc query capabilities without sacrificing write throughput.
  • Ensure data consistency for critical counts while enabling eventual consistency for exploratory analytics.
  • Enable fault tolerance, observability, and operational simplicity.

Architectural outline

  • Core ideas: event sourcing for immutable records, Command-Query Responsibility Segregation (CQRS) to separate write paths from read paths, and a data lakehouse for flexible analytics.
  • Ingest layer: a high-throughput event bus, a compacted log, and backpressure-aware producers.
  • Write model: write side stores events; maintain materialized views (projections) for common queries.
  • Read model: fast, query-optimized stores for dashboards, plus a data lakehouse for large-scale analytics.
  • Storage tiers: hot (streams and recent projections), warm (aggregations and precomputed summaries), cold (immutable event history and raw logs).
  • Observability: structured tracing, metrics, and lineage for data quality.

Detailed design

1) Event schema and streamlining

  • Event envelope: each event has an id, type, timestamp, correlation id, and metadata.
  • Event types should be additive and idempotent where possible.
  • Use a stable, evolving schema with a schema registry (e.g., Confluent Schema Registry or an in-house JSONSchema registry) to manage versions without breaking producers/consumers.

Example event envelope (typescript-like shape):
{
"event_id": "evt_12345",
"type": "page_view",
"timestamp": "2026-06-02T18:10:00.000Z",
"correlation_id": "req_98765",
"user_id": "u_abc",
"properties": {
"page": "/product/123",
"referrer": "https://search.example.com",
"device": "mobile",
"ip": "203.0.113.42"
},
"metadata": {
"source": "frontend",
"version": 2
}
}

  • Cardinality considerations: store user_id, session_id, and event type distinctly to enable efficient grouping and filtering.
  • Idempotency: deduplicate using a global event_id channel; design producers to be idempotent where possible.

2) Ingestion layer

  • Use a robust message bus (e.g., Apache Kafka, or a managed alternative like Kinesis or Pulsar).
  • Backpressure: producers should be able to slow down or buffer; use partitioning to scale ingestion and ensure ordered processing per entity (e.g., user_id partitioning).
  • Exactly-once delivery is hard; aim for at-least-once with idempotent processors and deduplication at the projection layer.

Code sketch (pseudo-typed with Node.js) for a Kafka producer:
const { Kafka } = require('kafkajs');
const kafka = new Kafka({ clientId: 'analytics-ingest', brokers: ['kafka:9092'] });
const producer = kafka.producer();

async function publishEvent(event) {
await producer.connect();
await producer.send({
topic: 'events',
messages: [{ key: event.event_id, value: JSON.stringify(event) }]
});
await producer.disconnect();
}

  • Partition strategy: partition by user_id and/or event_type to maintain order for a given key, while enabling parallelism across keys.

3) Write model (CQRS write side)

  • Commands and events: domain logic emits events rather than updating a database directly.
  • Event store: an append-only log that stores all events. You can store raw events plus a compacted “state transition” event for common aggregates.
  • Projections: materialized views that consume the event stream to produce query-optimized read models.

Example domain: product view analytics

  • Events: page_view, product_view, add_to_cart, purchase
  • Projections: per-user session aggregates, product popularity, cohort metrics

Projection design tips:

  • Use a time-series oriented store for rolling aggregations (e.g., on a per-hour basis).
  • Maintain exactly-countable metrics for dashboards (e.g., unique visitors) using a combination of user_id and a windowed approach to approximate distinct counts (HyperLogLog or streaming approximate distinct counts).
  • Persist projections to a fast read store (e.g., a columnar store or a NoSQL database optimized for reads).

Code sketch: Node.js projection consuming events
// pseudo-code
const readModelStore = connectToDatabase('read-model');

async function handleEvent(event) {
switch (event.type) {
case 'page_view':
await updatePageViewProjection(event);
break;
case 'purchase':
await updatePurchaseProjection(event);
break;
// more event types...
}
}

async function updatePageViewProjection(e) {
const { user_id, timestamp } = e;
// Example: hourly page views per user
const bucket = new Date(timestamp);
bucket.setMinutes(0, 0, 0);
await readModelStore.increment('user_page_views', { user_id, bucket }, 1);
}

4) Read model (queries and dashboards)

  • Hot path: optimized read stores for dashboards and ad-hoc queries. Use indexed SQL data warehouse or a fast NoSQL store with rich query capabilities.
  • Cold path: store event logs in a data lake (e.g., Parquet on S3) for long-term analytics and machine learning.
  • Data lakehouse approach: keep raw events in the data lake, and maintain curated tables in a data warehouse (e.g., Delta Lake, Apache Iceberg) for consistent analytics across teams.

Example reads

  • Daily active users by region
  • Revenue by product category
  • Funnel analysis for a conversion flow

5) Data lakehouse integration

  • Ingest raw events into a data lake (S3/Blob) with partitioning by date.
  • Build a metadata catalog and use a data warehouse layer (e.g., Delta Lake) to enable ACID transactions on table updates.
  • Use ETL jobs to transform events into analytics-ready tables, keeping both grain-level and aggregated views.

6) Consistency trade-offs

  • Write-side consistency: prefer immutable events; maintain idempotent projections to avoid double-counting.
  • Read-side consistency: for dashboards, consider lag windows (e.g., dashboards reflect data up to the previous hour) to allow projections to settle.
  • Exactly-once is tough in distributed systems; implement deduplication and reconciliation jobs overnight to fix anomalies.

7) Operational considerations

  • Observability: trace event lifecycles across ingestion, projection, and read layers; collect metrics on throughput, lag, error rates, and backlog depth.
  • Schema evolution: versioned events; implement backward-compatible changes and a migration plan for projections.
  • Security and governance: enforce least privilege on data stores; mask PII where not needed; audit access to sensitive metrics.
  • Testing: use replay-based tests with mocked event streams to verify projections against a known event history.

8) Example end-to-end flow

  • Frontend or backend service publishes a page_view event to the events topic.
  • Ingestion layer consumes, ensures idempotency, and stores the event in the immutable event store.
  • Projection consumer updates user_hourly_page_views and product_popularity read models.
  • Dashboards query the read models for real-time metrics; data analysts query the data warehouse for ad-hoc analytics, while raw events are stored in the data lake for ML workloads.

9) Implementation blueprint (stack-neutral)

  • Ingestion: a durable message bus (Kafka/Kinesis/Pulsar) with backpressure and partitioning.
  • Event store: immutable append-only log (local or cloud storage with an offset/index).
  • Projections: streaming processors (e.g., KStreams, Flink, Spark Structured Streaming) that read from the event stream and write to read models.
  • Read models: fast query stores (columnar DB, Redis for hot aggregates, or document stores with rich indexing).
  • Data lakehouse: object store with Parquet/ORC files; a catalog layer (Hive/Glue/Metastore) and a warehouse layer (Delta/Iceberg) for ACID transactions.

10) A compact, practical code example: a minimal projection with Node.js and a streaming library
Note: this is a simplified illustration. In production, you'd use a robust streaming framework.

// projection.ts (simplified)
type Event = {
event_id: string;
type: string;
timestamp: string;
user_id?: string;
properties?: any;
};

async function processEvent(event: Event) {
switch (event.type) {
case 'page_view':
await upsertUserHourlyViews(event);
break;
case 'purchase':
await upsertRevenue(event);
break;
default:
// ignore other events for this projection
}
}

async function upsertUserHourlyViews(e: Event) {
const userId = e.user_id;
const hourBucket = new Date(e.timestamp);
hourBucket.setMinutes(0, 0, 0);
// pseudo DB call
await db.increment('user_hourly_views', { user_id: userId, bucket: hourBucket }, 1);
}

async function upsertRevenue(e: Event) {
const amount = e.properties?.amount || 0;
const product = e.properties?.product_id || 'unknown';
const hourBucket = new Date(e.timestamp);
hourBucket.setMinutes(0, 0, 0);
await db.increment('hourly_revenue', { bucket: hourBucket, product }, amount);
}

// simulate consuming
async function run() {
const events = await fetchFromEventStream(); // fetch events from topic
for (const ev of events) {
await processEvent(ev);
}
}

run().catch(console.error);

11) Deployment and evolution notes

  • Start small: implement one critical projection (e.g., daily active users) and a single dashboard to validate the flow.
  • Add more projections incrementally, ensuring backward compatibility when introducing new event types.
  • Use feature flags to enable/disable new projections during testing.

12) Practical tips and pitfalls

  • Backpressure management: design producers to buffer or throttle; monitor queue depth and lag.
  • Data drift: regularly reconcile read models with the event store to catch missed events or dedup issues.
  • Privacy: avoid storing sensitive fields in events; redact or hash identifiers when possible.

Illustration: the data flow at a glance

  • Event sources (frontend, services) -> Ingestion (Kafka) -> Event Store (append-only log) -> Projections (read models) -> Dashboards/BI (read stores) -> Data Lakehouse (raw + curated tables) -> ML/Analytics.

If you’d like, I can tailor this blueprint to a specific domain (e-commerce, SaaS analytics, or gaming) and provide a concrete code scaffold with a chosen tech stack (Kafka + Flink + PostgreSQL for reads, Delta Lake for lakehouse, etc.). Would you prefer a stack aligned to cloud-native services (e.g., AWS or Azure) or an open-source on-prem approach?

-

Rizwan Saleem | https://rizwansaleem.co

Sources

Top comments (0)