How We Built a GA4-Compatible Analytics Pipeline to Escape US Tech Lock-in

#analytics #clickhouse #eu #bigquery

How We Built a GA4-Compatible Analytics Pipeline to Escape US Tech Lock-in - d8a.tech

Google Analytics is everywhere. It's also a deal-breaker for a growing number of teams.

Under GDPR and the post-Schrems II landscape, sending EU visitor data to Google's US infrastructure is legally murky at best. For healthcare organizations under HIPAA or government sites under FedRAMP, it's a non-starter.

The usual answer is to switch to a privacy-friendly alternative. The problem: most of them require you to throw away your existing tracking plan and start over. If you've invested in a GA4 setup - event taxonomy, GTM configuration, custom dimensions - that's a real switching cost.

One hard requirement

We built d8a around a single constraint: it had to speak GA4's protocol natively. Same /g/collect endpoint, same parameters. If you're already sending data to Google, you're already sending it in the right format for d8a. No rewrites, no migration weekend.

How it's put together

The pipeline has three moving parts: a tracking component that turns HTTP requests into hits, a queue that buffers them, and a processing component that closes sessions and writes to your warehouse.

The transport is pluggable. By default the two components communicate through the filesystem - cheap, simple, works on a single VPS. For HA deployments (how our cloud runs it), you swap in object storage and the processes scale independently. A RabbitMQ driver would fit the same interface if you need minimum latency at high throughput. The tradeoffs are real: filesystem is fine for a single node but rules out HA; object storage is pennies and unlocks horizontal scaling; a message broker gets you minimum latency but comes with either maintenance overhead or a bigger cloud bill.

The session engine has a pluggable batched KV interface. Session state - grouping hits by visitor, tracking inactivity windows - lives behind an interface with a complete blackbox test suite. The default implementation uses BoltDB: embedded, no external process, runs anywhere. Swapping it for something distributed (Redis, Cassandra, whatever fits your infra) is a matter of passing a different implementation that satisfies the same contract.

The tracking component is protocol-agnostic. GA4 is the default, but the HTTP path-to-protocol mapping is an abstraction - adding a new ingest protocol is a matter of implementing an interface. Matomo, Amplitude, or anything else with a defined HTTP tracking format could be a drop-in. This also means, that it can act as a self-hosted mirror for teams already on those platforms, intercepting existing tracking calls without touching the client.

Warehouse destinations: ClickHouse (fully self-hosted), BigQuery, or CSV files written to S3/MinIO, GCS, or local disk. The file path works with Snowflake Snowpipe, Redshift Spectrum, Databricks Auto Loader, DuckDB - if you already have a warehouse, you can pipe into it.

Deployment-wise: a single VPS is enough to get started. Our own cloud runs it on Kubernetes with the object storage transport between the two components.