Designing a scalable event-sourced analytics platform with CQRS and a data lakehouse pattern
Designing a scalable event-sourced analytics platform with CQRS and a data lakehouse pattern
In this tutorial, you’ll learn how to design a scalable analytics system that handles high-cardinality event streams, supports flexible ad-hoc analytics, and remains maintainable as data volumes grow. We’ll walk through architectural decisions, data modeling, data flow, and concrete code examples to implement a lean, production-ready system.
Overview and goals
- Handle large, rapid streams of user and system events with low latency ingestion.
- Support both real-time dashboards and long-running analytics jobs.
- Provide flexible, ad-hoc query capabilities without sacrificing write throughput.
- Ensure data consistency for critical counts while enabling eventual consistency for exploratory analytics.
- Enable fault tolerance, observability, and operational simplicity.
Architectural outline
- Core ideas: event sourcing for immutable records, Command-Query Responsibility Segregation (CQRS) to separate write paths from read paths, and a data lakehouse for flexible analytics.
- Ingest layer: a high-throughput event bus, a compacted log, and backpressure-aware producers.
- Write model: write side stores events; maintain materialized views (projections) for common queries.
- Read model: fast, query-optimized stores for dashboards, plus a data lakehouse for large-scale analytics.
- Storage tiers: hot (streams and recent projections), warm (aggregations and precomputed summaries), cold (immutable event history and raw logs).
- Observability: structured tracing, metrics, and lineage for data quality.
Detailed design
1) Event schema and streamlining
- Event envelope: each event has an id, type, timestamp, correlation id, and metadata.
- Event types should be additive and idempotent where possible.
- Use a stable, evolving schema with a schema registry (e.g., Confluent Schema Registry or an in-house JSONSchema registry) to manage versions without breaking producers/consumers.
Example event envelope (typescript-like shape):
{
"event_id": "evt_12345",
"type": "page_view",
"timestamp": "2026-06-02T18:10:00.000Z",
"correlation_id": "req_98765",
"user_id": "u_abc",
"properties": {
"page": "/product/123",
"referrer": "https://search.example.com",
"device": "mobile",
"ip": "203.0.113.42"
},
"metadata": {
"source": "frontend",
"version": 2
}
}
- Cardinality considerations: store user_id, session_id, and event type distinctly to enable efficient grouping and filtering.
- Idempotency: deduplicate using a global event_id channel; design producers to be idempotent where possible.
2) Ingestion layer
- Use a robust message bus (e.g., Apache Kafka, or a managed alternative like Kinesis or Pulsar).
- Backpressure: producers should be able to slow down or buffer; use partitioning to scale ingestion and ensure ordered processing per entity (e.g., user_id partitioning).
- Exactly-once delivery is hard; aim for at-least-once with idempotent processors and deduplication at the projection layer.
Code sketch (pseudo-typed with Node.js) for a Kafka producer:
const { Kafka } = require('kafkajs');
const kafka = new Kafka({ clientId: 'analytics-ingest', brokers: ['kafka:9092'] });
const producer = kafka.producer();
async function publishEvent(event) {
await producer.connect();
await producer.send({
topic: 'events',
messages: [{ key: event.event_id, value: JSON.stringify(event) }]
});
await producer.disconnect();
}
- Partition strategy: partition by user_id and/or event_type to maintain order for a given key, while enabling parallelism across keys.
3) Write model (CQRS write side)
- Commands and events: domain logic emits events rather than updating a database directly.
- Event store: an append-only log that stores all events. You can store raw events plus a compacted “state transition” event for common aggregates.
- Projections: materialized views that consume the event stream to produce query-optimized read models.
Example domain: product view analytics
- Events: page_view, product_view, add_to_cart, purchase
- Projections: per-user session aggregates, product popularity, cohort metrics
Projection design tips:
- Use a time-series oriented store for rolling aggregations (e.g., on a per-hour basis).
- Maintain exactly-countable metrics for dashboards (e.g., unique visitors) using a combination of user_id and a windowed approach to approximate distinct counts (HyperLogLog or streaming approximate distinct counts).
- Persist projections to a fast read store (e.g., a columnar store or a NoSQL database optimized for reads).
Code sketch: Node.js projection consuming events
// pseudo-code
const readModelStore = connectToDatabase('read-model');
async function handleEvent(event) {
switch (event.type) {
case 'page_view':
await updatePageViewProjection(event);
break;
case 'purchase':
await updatePurchaseProjection(event);
break;
// more event types...
}
}
async function updatePageViewProjection(e) {
const { user_id, timestamp } = e;
// Example: hourly page views per user
const bucket = new Date(timestamp);
bucket.setMinutes(0, 0, 0);
await readModelStore.increment('user_page_views', { user_id, bucket }, 1);
}
4) Read model (queries and dashboards)
- Hot path: optimized read stores for dashboards and ad-hoc queries. Use indexed SQL data warehouse or a fast NoSQL store with rich query capabilities.
- Cold path: store event logs in a data lake (e.g., Parquet on S3) for long-term analytics and machine learning.
- Data lakehouse approach: keep raw events in the data lake, and maintain curated tables in a data warehouse (e.g., Delta Lake, Apache Iceberg) for consistent analytics across teams.
Example reads
- Daily active users by region
- Revenue by product category
- Funnel analysis for a conversion flow
5) Data lakehouse integration
- Ingest raw events into a data lake (S3/Blob) with partitioning by date.
- Build a metadata catalog and use a data warehouse layer (e.g., Delta Lake) to enable ACID transactions on table updates.
- Use ETL jobs to transform events into analytics-ready tables, keeping both grain-level and aggregated views.
6) Consistency trade-offs
- Write-side consistency: prefer immutable events; maintain idempotent projections to avoid double-counting.
- Read-side consistency: for dashboards, consider lag windows (e.g., dashboards reflect data up to the previous hour) to allow projections to settle.
- Exactly-once is tough in distributed systems; implement deduplication and reconciliation jobs overnight to fix anomalies.
7) Operational considerations
- Observability: trace event lifecycles across ingestion, projection, and read layers; collect metrics on throughput, lag, error rates, and backlog depth.
- Schema evolution: versioned events; implement backward-compatible changes and a migration plan for projections.
- Security and governance: enforce least privilege on data stores; mask PII where not needed; audit access to sensitive metrics.
- Testing: use replay-based tests with mocked event streams to verify projections against a known event history.
8) Example end-to-end flow
- Frontend or backend service publishes a page_view event to the events topic.
- Ingestion layer consumes, ensures idempotency, and stores the event in the immutable event store.
- Projection consumer updates user_hourly_page_views and product_popularity read models.
- Dashboards query the read models for real-time metrics; data analysts query the data warehouse for ad-hoc analytics, while raw events are stored in the data lake for ML workloads.
9) Implementation blueprint (stack-neutral)
- Ingestion: a durable message bus (Kafka/Kinesis/Pulsar) with backpressure and partitioning.
- Event store: immutable append-only log (local or cloud storage with an offset/index).
- Projections: streaming processors (e.g., KStreams, Flink, Spark Structured Streaming) that read from the event stream and write to read models.
- Read models: fast query stores (columnar DB, Redis for hot aggregates, or document stores with rich indexing).
- Data lakehouse: object store with Parquet/ORC files; a catalog layer (Hive/Glue/Metastore) and a warehouse layer (Delta/Iceberg) for ACID transactions.
10) A compact, practical code example: a minimal projection with Node.js and a streaming library
Note: this is a simplified illustration. In production, you'd use a robust streaming framework.
// projection.ts (simplified)
type Event = {
event_id: string;
type: string;
timestamp: string;
user_id?: string;
properties?: any;
};
async function processEvent(event: Event) {
switch (event.type) {
case 'page_view':
await upsertUserHourlyViews(event);
break;
case 'purchase':
await upsertRevenue(event);
break;
default:
// ignore other events for this projection
}
}
async function upsertUserHourlyViews(e: Event) {
const userId = e.user_id;
const hourBucket = new Date(e.timestamp);
hourBucket.setMinutes(0, 0, 0);
// pseudo DB call
await db.increment('user_hourly_views', { user_id: userId, bucket: hourBucket }, 1);
}
async function upsertRevenue(e: Event) {
const amount = e.properties?.amount || 0;
const product = e.properties?.product_id || 'unknown';
const hourBucket = new Date(e.timestamp);
hourBucket.setMinutes(0, 0, 0);
await db.increment('hourly_revenue', { bucket: hourBucket, product }, amount);
}
// simulate consuming
async function run() {
const events = await fetchFromEventStream(); // fetch events from topic
for (const ev of events) {
await processEvent(ev);
}
}
run().catch(console.error);
11) Deployment and evolution notes
- Start small: implement one critical projection (e.g., daily active users) and a single dashboard to validate the flow.
- Add more projections incrementally, ensuring backward compatibility when introducing new event types.
- Use feature flags to enable/disable new projections during testing.
12) Practical tips and pitfalls
- Backpressure management: design producers to buffer or throttle; monitor queue depth and lag.
- Data drift: regularly reconcile read models with the event store to catch missed events or dedup issues.
- Privacy: avoid storing sensitive fields in events; redact or hash identifiers when possible.
Illustration: the data flow at a glance
- Event sources (frontend, services) -> Ingestion (Kafka) -> Event Store (append-only log) -> Projections (read models) -> Dashboards/BI (read stores) -> Data Lakehouse (raw + curated tables) -> ML/Analytics.
If you’d like, I can tailor this blueprint to a specific domain (e-commerce, SaaS analytics, or gaming) and provide a concrete code scaffold with a chosen tech stack (Kafka + Flink + PostgreSQL for reads, Delta Lake for lakehouse, etc.). Would you prefer a stack aligned to cloud-native services (e.g., AWS or Azure) or an open-source on-prem approach?
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)