Product analytics shouldn't require managing servers, containers, or complex infrastructure. Yet most self-hosted alternatives to tools like Plausible or Umami assume you'll spin up Docker containers, manage databases, and deal with scaling headaches.
I built this open source solution to change that. It's a fully serverless, self-hostable analytics platform that deploys into your own AWS account with a single CDK command.
No servers. No Docker builds. Minimal, predictable baseline cost.
You get privacy-focused analytics infrastructure that scales from zero to millions of events without operational overhead.
Important note
This repository provides a production-grade analytics ingestion pipeline, not a polished analytics SaaS.Event collection, buffering, replay, and storage are solid and designed for real workloads.
What is still evolving:
- authorization and multi-tenant access control
- the query / insights layer (dashboards, funnels, cohorts)
If you’re comfortable building on top of a strong foundation—or want to contribute—this project is for you.
In this post, I’ll walk through how analytics platforms work under the hood, explore two serverless architectures on AWS, and explain the trade-offs behind the approach I chose.
How Analytics Platforms Work
Every analytics system—whether it's Google Analytics, Plausible, or a custom solution—follows the same fundamental pattern:
Browser → Ingestion API → Buffer → Processor → Storage
This separation of concerns is what allows analytics systems to scale reliably without impacting application performance.
Collection (Browser)
A lightweight JavaScript snippet runs on your site and captures events: page views, clicks, web vitals. It batches these events and sends them to your backend using sendBeacon for reliability or fetch for flexibility.
The script should be tiny (ideally under ~1KB gzipped) so it doesn’t affect page performance or Core Web Vitals.
Ingestion API
An HTTP endpoint receives events from the browser. Its responsibilities should be minimal:
- validate the payload
- enrich it with metadata (e.g. geolocation from request headers)
- push the event downstream
The API should return immediately and never block on heavy processing or database writes.
Buffer
The buffer decouples ingestion from processing.
Events are written to a queue or stream so your ingestion API remains fast even during traffic spikes. This layer absorbs bursts, smooths load, and allows downstream consumers to process events at their own pace.
Processor
A worker reads events from the buffer, transforms them into the shape your storage expects, and writes them out.
This is also where batching happens to reduce write amplification and keep storage costs under control.
Storage
This is the query layer. It must handle analytical workloads efficiently:
- aggregations over time ranges
- grouping by dimensions (referrer, country, device, etc.)
Row-based databases work at small scale, but columnar stores like ClickHouse are dramatically more efficient as data volume grows.
Two Serverless Approaches on AWS
When designing this for AWS, I evaluated two architectures. Both are fully serverless, but they differ in cost characteristics, operational complexity, and replay capabilities.
Approach 1: EventBridge + SQS (Near-Zero Cost at Rest)
Browser → Lambda Function URL → EventBridge → SQS → Processor Lambda → Storage
↘ S3 (raw archive)
This is the purest pay-per-request model.
EventBridge acts as the routing layer: one rule forwards events to SQS for processing, while another rule triggers a Lambda that archives raw events to S3.
Advantages:
- Near-zero cost when there’s no traffic
- Simple mental model with declarative routing rules
- Easy extensibility—new consumers are just new EventBridge rules
Trade-offs:
- No built-in replay mechanism
- Reprocessing requires manual replay from S3
- Limited control over batching semantics
This approach works well for side projects, low-traffic sites, or scenarios where minimizing idle cost is the top priority.
Approach 2: Kinesis Data Streams + Firehose (What I Built)
Browser → Lambda Function URL → Kinesis Data Stream → Firehose → S3
↘ Lambda → Storage
This is the architecture I chose.
Kinesis Data Streams acts as the central event log. Firehose handles archival to S3 automatically, while a Lambda consumer processes events and writes them to the analytics database.
Advantages:
- Built-in replay via configurable retention (7 days by default, up to 365)
- Strict ordering guarantees within partitions (important for session reconstruction)
- Seamless Firehose integration for batching, compression, and delivery to S3
- Predictable throughput and backpressure via shards
Trade-offs:
- Not zero-cost at rest (one shard is roughly ~$11/month)
- Requires basic capacity planning
I chose this approach because replayability and operational simplicity matter more than absolute zero idle cost for a production analytics system. The baseline cost is predictable, and the architecture scales cleanly as traffic grows.
Storage Layer
By default, the project uses AWS Aurora DSQL. I chose it to experiment with a fully serverless SQL database.
It works—but for analytical workloads, ClickHouse is the better choice.
Columnar storage, compression, and built-in aggregation functions make a significant difference for time-series analytics.
The storage layer is abstracted behind an interface, so swapping backends is a configuration change. For real-world traffic, I recommend pointing the system at ClickHouse (ClickHouse Cloud or self-hosted) instead of DSQL.
Getting Started
The entire stack deploys with a single command:
make deploy
This provisions:
- the ingestion API (built in Rust)
- buffering infrastructure (Kinesis + Firehose)
- raw event archival to S3
- a query API built with OpenAPI backend with sane defaults
Everything is defined in CDK and can be customized via configuration without touching the core architecture.
The repository is available on GitHub:
👉 https://github.com/boringContributor/aws-serverless-product-analytics
Contributing & Roadmap
This project is intentionally modular and open for contributions.
Areas where help is especially valuable:
- authorization & multi-tenant access control
- query API design (funnels, breakdowns, cohorts)
- ClickHouse schemas and query optimizations
- dashboard and visualization experiments
If you’re interested, issues are labeled and the architecture is documented to make onboarding easier.
Wrapping Up
Serverless analytics on AWS is not only possible—it’s practical.
You get the privacy and control benefits of self-hosting without managing servers, containers, or always-on infrastructure. Whether you choose a near-zero-cost EventBridge pipeline or a replay-friendly Kinesis-based architecture depends on your traffic patterns and tolerance for baseline cost.
The code is open source. Deploy it, fork it, or use it as a reference for building your own event ingestion pipelines.
If you have questions or want to adapt this setup to your needs, feel free to set up a quick call for a one-time collaboration:
👉 https://cal.com/someone
Top comments (0)