So I Decided to Build My Own Analytics, This Is How It Went

#analytics #devjournal #showdev #sideprojects

Hey all, this is not AI written so you can keep on reading :)

So I needed analytics for my side projects. My first instinct was to connect PostHog, and it was great, I use it to this day, however it’s just too complicated for the simple analytics that I wanted: Country, Origin, some UTMs, per user attribution, entry page, pages, and revenue. Later I discovered that PostHog events are immutable, and I couldn’t remove my test fake data from their analytics. In order to do so I’d need to write manual SQL filters all over the place, so I started looking for an alternative.

The first one I found was Plausible, installed it — all great, but it did not have per user attribution that I really wanted. Next pick was DataFast, I’ve seen it on Twitter and it looked to me like it has exactly what I needed.

So I installed DataFast, added proxy to get all the customers, and it appeared that I actually collect much more, I’m not sure whether Plausible had the proxy setup, but I remember not being able to set it up, so I kept the DataFast.

Fast forward a couple of months. The traffic on my websites increased, and now I need to pay $40 a month, considering that my whole infra cost is $150 including front-end, back-end, emails. Greedy developer in me said, nah, I’m not gonna pay $500 a year for analytics, for a moment I thought about moving to an alternative, but I’d lose all existing data that I collected already, the revenue attribution, the referrers etc, so I decided to build it myself!

And so this is how it started.

I opened Claude Code, wrote one prompt, and it was done… jk, I’m not an 18yo from Twitter, so I’m not skilled enough to make Claude one-shot a website for me.

I got to work:

The first challenge was to get the data from DataFast, they don’t have an export data option (RED FLAG), so I had to write a very long script that would paginate through all the endpoints that are exposed, collect the data, transform it, and create an SQL that I can run against my DB.

For context I have a microservices architecture, so queues, Kafka, Redis, sockets, gateway, authentication and so on — all already done, along with the established patterns. On the front-end I have a monorepo with shared components, features, setups for forms, services etc. So all I really needed was to build the “core” analytics feature.

In a weekend I had a semi-working front-end with some data returned on the backend. I had a very ugly looking dashboard, a bunch of services, new database, no actual tracking.

Simple, a couple of days and I’m done…

Turned out that the data returned from DataFast is quite broken and lacks a lot of values. Connecting goals, revenue, and visitors became a nightmare. I connected my readonly DB via MCP, got the readonly key from my payment processor, and started doing a tedious process of re-attributing the data to actually match what was on DataFast. It took multiple days, and still it wasn’t 100% right, since DataFast did not expose all the needed data for proper attribution, but it was 95% right, so I moved on.

Now I started to review the boilerplate that Claude wrote for the backend, and had to completely refactor the system, since Claude did the attribution with direct calls to Postgres (nice work) so every visitor is a roundtrip to the database, every single one…

So I had to create an elaborate caching layer with custom flushes. Basically all events go to Redis first, and then get flushed to DB every ~30 seconds. So instead of bombarding the DB for every visitor — it was writing a modest in size query every other second at scale. The flush itself uses a distributed Redis lock, so when I have multiple instances running, only one machine flushes at a time — no duplicate writes, no race conditions. On top of that, each flush processes the data in chunks of 5,000 records per SQL statement (Postgres has parameter limits), and if a chunk fails — it gets re-buffered back to Redis with a retry counter, up to 5 retries before it’s dropped. So even if the DB hiccups mid-flush, no data is silently lost.

That would have been resolved by ClickHouse in general, but I didn’t want to just replace to a new vendor, the setup with Redis is quite scalable on its own.

Next, extracting the data. It seems LLMs absolutely have no idea about the concept of heap, because everything was loaded into memory and then iterated. With 100k+ events that means the heap will spike and my server will die, so I had to re-write the thing with optimized query calls, pagination, and batched requests. I also added a pre-aggregated daily rollup table — for historical queries where no filters are applied, the system reads from a compact summary table instead of scanning millions of raw sessions and pageviews. Simple optimization, but it made the dashboard feel instant for date ranges that don’t include today.

Subscribe to the Medium newsletter
Back to front-end. Working with charts is quite underwhelming, so had to spend quite a bit of time on perfecting it. I’m a sucker for nice UI, so I couldn’t keep it non-animated, raw state. Another thing that was bugging me with DataFast was an absolutely terrible filter system, it was… just terrible, unusable. The pristine example of filters is what PostHog has, so I had to port that to my website. And another thing — rate limits.

When I’d use DataFast and move back 3 days — I’d get a rate limit?! So I checked the network, and oh boy, 20 concurrent requests PER DAY (Red Flag), moving to yesterday? Do you think the request is aborted? Nope, another 20, one more day — you have 60 concurrent requests to the DB — you’re rate limited. Wow, I haven’t seen a lack of signal abort in a prod application in ages (Red Flag), I kept that in mind for how bad their attribution actually is (Spoiler alert, it’s bad, but more about that later)

I optimized the requests from the FE, so I had only 5 requests, all batched to get all needed info for the dashboard, + aborts when moving too fast between the filters/views, and my app was flying, I was impressed how fast it now works, and coming back to DataFast dashboard, felt nightmarish.

Time to test the attributions!

My seed scripts were running fine, payment attribution fixes were also running great, so I had fresh data every day to play with. UI is good, UX is good, time to create a simple tracking script, add it to the websites, and compare, and… yeah nothing worked. Had to fix the CORS, fix the endpoints, make plenty of adjustments to the queries (probably forgot to ask Claude to make no mistakes in the prompt). After playing around with it — everything worked!

So I started comparing the attributions, and… I had ~30–50% fewer. I was fuming, checking logs, checking DB, where the visitors are disappearing. The answer was simple. I added an Arcjet to the public endpoint, and it got to work, 100k requests in a couple of days, oops, had to turn it off, since that would have bankrupted me, started looking deeper into it.

Turned out DataFast has ABSOLUTE ZERO BOT PROTECTION (Red Flag), so datacenter IPs? passed, user-agent null? passed, resolution of the screen 10x10000 — welcome aboard, so I read a couple of blog posts from Arcjet, implemented what they suggested, and was able to achieve 96% bot blockage compared to them. How?

Main one is checking the userAgent and filtering out obvious bots, non-existing displays. The more tricky one was analyzing the IP and blocking the datacenter IDs, which turned out to be much more difficult. Spent a couple of days on that, the best I did was to use MaxMind DB of IPs and block the datacenter ones (except my infra, I did block my own infra and had 0 attributions). Then I needed to proxy the user’s IP through Cloudflare to my backend on Fly, compare it and finally filter it out or keep.

While doing that I thought, how does DataFast actually handle this, and… they don’t (RED FLAG). Here I’ll give the benefit of the doubt, it might have been my mess up and I had to proxy the real IP, but it’s not well documented in their docs. Essentially ALL users that I had tracked were attributed to the closest Cloudflare CDN… I double-checked, and turned out that I regularly do trips to Germany (I’m located in Poland), because sometimes my traffic was routed through Germany… At that point I understood that most of the tracking that was done via DataFast was actually useless garbage, so I had to do it better.

I added some non-obvious bot signals as well, like bounces, no engagement + weird screen sizes, weird browser versions, etc, dozens of params. I attach a bot score to every session I store, so now I have a toggle that shows me “probably bots” filtered. The most obvious ones are hard filtered without even getting to DB.

One thing I’m quite happy about — the bot scorer is import-aware. Since all my DataFast imported sessions have zero values for behavioral metrics (DataFast never tracked scroll depth, engagement time, or interactions), the scorer detects these and uses a separate algorithm that only looks at fingerprint anomalies like screen dimensions, instead of penalizing them for missing data they never had.

And that’s pretty much it. The backend was ready, optimized, stress tested (died, had to bump up the RAM on the microservice to deal with the load).

The front-end was looking nice, with good UX that I was happy with, so what were my savings you’d ask?

Cost of a new microservice $25/m

So $39 — $25 = $14/m…

It took me around a month to get everything right (not full-time, getting to it on and off).

Yeah absolutely genius idea on my part, replace every SaaS and never look back.

In case anyone’s interested how it turned out, here is the link:

https://flowsery.com/