Sebastian

Posted on Jan 2

Open Source Serverless Product Analytics on AWS

#aws #serverless #analytics #cdk

Product analytics shouldn't require managing servers, containers, or complex infrastructure. Yet most self-hosted alternatives to tools like Plausible or Umami assume you'll spin up Docker containers, manage databases, and deal with scaling headaches.

I built this open source solution to change that. It's a fully serverless, self-hostable analytics platform that deploys into your own AWS account with a single CDK command.

No servers. No Docker builds. Minimal, predictable baseline cost.

You get privacy-focused analytics infrastructure that scales from zero to millions of events without operational overhead.

Important note
This repository provides a production-grade analytics ingestion pipeline, not a polished analytics SaaS.

Event collection, buffering, replay, and storage are solid and designed for real workloads.
What is still evolving:

authorization and multi-tenant access control

the query / insights layer (dashboards, funnels, cohorts)

If you’re comfortable building on top of a strong foundation—or want to contribute—this project is for you.

In this post, I’ll walk through how analytics platforms work under the hood, explore two serverless architectures on AWS, and explain the trade-offs behind the approach I chose.

How Analytics Platforms Work

Every analytics system—whether it's Google Analytics, Plausible, or a custom solution—follows the same fundamental pattern:

Browser → Ingestion API → Buffer → Processor → Storage

This separation of concerns is what allows analytics systems to scale reliably without impacting application performance.

Collection (Browser)

A lightweight JavaScript snippet runs on your site and captures events: page views, clicks, web vitals. It batches these events and sends them to your backend using sendBeacon for reliability or fetch for flexibility.

The script should be tiny (ideally under ~1KB gzipped) so it doesn’t affect page performance or Core Web Vitals.

Ingestion API

An HTTP endpoint receives events from the browser. Its responsibilities should be minimal:

validate the payload
enrich it with metadata (e.g. geolocation from request headers)
push the event downstream

The API should return immediately and never block on heavy processing or database writes.

Buffer

The buffer decouples ingestion from processing.

Events are written to a queue or stream so your ingestion API remains fast even during traffic spikes. This layer absorbs bursts, smooths load, and allows downstream consumers to process events at their own pace.

Processor

A worker reads events from the buffer, transforms them into the shape your storage expects, and writes them out.

This is also where batching happens to reduce write amplification and keep storage costs under control.

Storage

This is the query layer. It must handle analytical workloads efficiently:

aggregations over time ranges
grouping by dimensions (referrer, country, device, etc.)

Row-based databases work at small scale, but columnar stores like ClickHouse are dramatically more efficient as data volume grows.

Two Serverless Approaches on AWS

When designing this for AWS, I evaluated two architectures. Both are fully serverless, but they differ in cost characteristics, operational complexity, and replay capabilities.

Approach 1: EventBridge + SQS (Near-Zero Cost at Rest)

Browser → Lambda Function URL → EventBridge → SQS → Processor Lambda → Storage
                                           ↘ S3 (raw archive)

This is the purest pay-per-request model.

EventBridge acts as the routing layer: one rule forwards events to SQS for processing, while another rule triggers a Lambda that archives raw events to S3.

Advantages:

Near-zero cost when there’s no traffic
Simple mental model with declarative routing rules
Easy extensibility—new consumers are just new EventBridge rules

Trade-offs:

No built-in replay mechanism
Reprocessing requires manual replay from S3
Limited control over batching semantics

This approach works well for side projects, low-traffic sites, or scenarios where minimizing idle cost is the top priority.

Approach 2: Kinesis Data Streams + Firehose (What I Built)

Browser → Lambda Function URL → Kinesis Data Stream → Firehose → S3
                                                    ↘ Lambda → Storage

This is the architecture I chose.

Kinesis Data Streams acts as the central event log. Firehose handles archival to S3 automatically, while a Lambda consumer processes events and writes them to the analytics database.

Advantages:

Built-in replay via configurable retention (7 days by default, up to 365)
Strict ordering guarantees within partitions (important for session reconstruction)
Seamless Firehose integration for batching, compression, and delivery to S3
Predictable throughput and backpressure via shards

Trade-offs:

Not zero-cost at rest (one shard is roughly ~$11/month)
Requires basic capacity planning

I chose this approach because replayability and operational simplicity matter more than absolute zero idle cost for a production analytics system. The baseline cost is predictable, and the architecture scales cleanly as traffic grows.

Storage Layer

By default, the project uses AWS Aurora DSQL. I chose it to experiment with a fully serverless SQL database.

It works—but for analytical workloads, ClickHouse is the better choice.

Columnar storage, compression, and built-in aggregation functions make a significant difference for time-series analytics.

The storage layer is abstracted behind an interface, so swapping backends is a configuration change. For real-world traffic, I recommend pointing the system at ClickHouse (ClickHouse Cloud or self-hosted) instead of DSQL.

Getting Started

The entire stack deploys with a single command:

make deploy

This provisions:

the ingestion API (built in Rust)
buffering infrastructure (Kinesis + Firehose)
raw event archival to S3
a query API built with OpenAPI backend with sane defaults

Everything is defined in CDK and can be customized via configuration without touching the core architecture.

The repository is available on GitHub:
👉 https://github.com/boringContributor/aws-serverless-product-analytics

Contributing & Roadmap

This project is intentionally modular and open for contributions.

Areas where help is especially valuable:

authorization & multi-tenant access control
query API design (funnels, breakdowns, cohorts)
ClickHouse schemas and query optimizations
dashboard and visualization experiments

If you’re interested, issues are labeled and the architecture is documented to make onboarding easier.

Wrapping Up

Serverless analytics on AWS is not only possible—it’s practical.

You get the privacy and control benefits of self-hosting without managing servers, containers, or always-on infrastructure. Whether you choose a near-zero-cost EventBridge pipeline or a replay-friendly Kinesis-based architecture depends on your traffic patterns and tolerance for baseline cost.

The code is open source. Deploy it, fork it, or use it as a reference for building your own event ingestion pipelines.

If you have questions or want to adapt this setup to your needs, feel free to set up a quick call for a one-time collaboration:
👉 https://cal.com/someone

DEV Community