Odumosu Matthew

Posted on Mar 19

Why I Built a Centralised Platform Admin in a Microservices Architecture -And What I Learned

#ai #productivity #programming #softwaredevelopment

It Started With a Slack Message

Our CTO posted in #engineering at 9:47 AM on a Tuesday:

"Can someone tell me how many active users we have across all products right now?"

Silence. For forty-five minutes.

Not because nobody cared - because nobody knew. The auth service had login counts. The user service had registration numbers. The billing service knew who was paying. But no single place could answer: "How many active users do we have?"

That Slack message cost us a morning of three engineers querying different databases and reconciling numbers in a Google Sheet. That's when we knew we had a problem.

The Microservices Hangover

Here's the thing nobody warns you about in the "Why Microservices" blog posts: decentralisation is fantastic for business logic and terrible for everything else.

We had twelve services. Each one managed its own roles. Its own permissions. Its own metrics. Its own version of the truth. And when anyone outside engineering asked a straightforward question - "Who has admin access?" "What's our uptime?" "When did this permission change?" - the answer was always the same: "Give us a few hours."

We had three specific nightmares:

The Permission Nightmare. Service A thought a user was an admin. Service B disagreed. Both were technically correct according to their own databases. A customer escalation later, we discovered the role had been revoked in one service but not the other. Fun times.

The Dashboard Nightmare. Leadership wanted a single executive dashboard. We had twelve Grafana dashboards, each telling a different story. Our "platform uptime" was a vibes-based estimate that someone calculated manually every Monday morning.

The Audit Nightmare. A compliance review asked: "Show us every admin role change in the last 90 days, who approved it, and why." We spent a week grepping logs across services and reconstructing a timeline in a spreadsheet. The auditor was not impressed.

So we did what any reasonable team would do: we built another microservice. But this one had a different job.

The Platform Admin: One Service to Rule the Cross-Cutting Concerns

The mandate was simple: own the things that get worse when distributed.

Authorization - One source of truth for roles and permissions. Every service asks us: "Can this user do this thing?" We answer in under 10ms.
Observability - One place where metrics from every service converge into a single dashboard.
Audit - One immutable, append-only log of every permission change, role assignment, and admin action across the entire platform.

Not a God service. Not a monolith in disguise. A focused service with a clear boundary: cross-cutting platform governance.

Here's what we learned building it.

Bet #1: Let Services Push Metrics (Don't Pull)

Our first instinct was to have the Platform Admin poll every service. Hit their health endpoints, scrape their stats, aggregate.

Then someone on the team asked the obvious question: "What happens when the auth service is down and we're trying to pull its uptime metric?"

Right. The monitoring system can't depend on the availability of the things it's monitoring. That's not architecture — that's a comedy sketch.

So we flipped it. Every service pushes its own metrics to us via POST endpoints. Auth service reports its call counts. Payment service reports its API volumes. User service reports active users. We receive, validate, store, and serve.

Auth Service ──→ POST /ingest/subsystem-metrics ──→ Platform Admin DB
Payment Service ──→ POST /ingest/api-calls ──→ Platform Admin DB
User Service ──→ POST /ingest/active-users ──→ Platform Admin DB

Frontend ←── GET /dashboard/executive-summary ←── Platform Admin DB

The beauty of this: we don't need to know about new services in advance. A team spins up a new microservice, starts posting to our ingestion endpoint, and their metrics appear on the dashboard. Zero configuration on our side.

And when a service goes silent? Its metrics expire via TTL. The absence of data is the signal. No heartbeat checks needed.

Bet #2: One Table for All Metrics (Yes, Really)

We needed to serve seven different metric types on the dashboard. The obvious move was seven database tables with seven migrations and seven sets of CRUD operations. Then someone on the team muttered: "And next quarter, when product wants three more metric types...?"

Instead, we went with a single table. One row per metric. A composite string key, a JSONB value column, and a TTL timestamp. That's it.

Want to add a new metric type? Create a DTO, pick a key prefix, and start writing. No migration. No PR to update the schema. No waiting for the DBA.

"But you can't query individual fields inside the JSON!"

Correct. And we don't need to. Dashboard data is always read as a complete object and handed to the frontend. We're not running analytical queries against it. We're serialising, storing, and deserialising. The table is a smart cache with persistence, not a relational data model.

Months later, we've added four more metric types. Zero migrations. The table doesn't care.

Bet #3: Validate Tenants at the Gate, Not in Business Logic

Multi-tenant platforms have a sneaky bug pattern: someone passes a tenant ID that doesn't exist, and the system happily returns empty results instead of saying "that's not a real tenant."

We caught this one during testing when someone called our endpoint with /dashboard/metrics/banana/30 and got a 200 OK with zeros. Technically correct — there's no data for the "banana" tenant. Practically useless - it should have been a 400 error.

So we built tenant validation into the request pipeline. Every request with a tenant identifier gets checked against the database before it reaches any business logic. The special keyword all (meaning "aggregate across all tenants") short-circuits and never touches the database.

Since tenant codes change roughly never, we cache the valid set in a HashSet for 10 minutes. The hot path is an O(1) lookup. The cold path hits the database once every 10 minutes. Thirteen endpoints, one validation rule, enforced consistently through FluentValidation in the MediatR pipeline. No endpoint can skip it, even by accident.

Bet #4: CQRS, Because Reads and Writes Live Different Lives

Here's the access pattern for a platform admin:

Writes: Rare, high-stakes. An admin changes a role, ingests metrics, creates an audit entry. Maybe a few hundred writes per hour.
Reads: Constant, latency-sensitive. Permission checks from other services, dashboard queries from the frontend, audit log lookups. Thousands per minute.

Treating these the same way is like designing a library and a nightclub with the same floor plan. They have fundamentally different requirements.

CQRS let us optimise each side independently. The write side does thorough validation, persists carefully, and emits domain events. The read side is aggressively cached and optimised for speed. A permission check from another service hits the cache and returns in single-digit milliseconds. The write path doesn't even notice.

Bet #5: Side Effects Belong in Event Handlers, Not Command Handlers

When an admin assigns a role, here's everything that needs to happen:

Validate the request
Persist the role assignment
Invalidate the permission cache for that user
Send an email notification to the affected user
Notify all admins for visibility
Write an immutable audit log entry
Refresh platform statistics cache

Our first version had all seven steps in the command handler. It was 200+ lines, touched four different infrastructure services, and was a nightmare to test. Mocking the email service to test the cache invalidation? Really?

We refactored to domain events. The command handler does steps 1 and 2 - validate and persist. Then it publishes an event: "A role was assigned." Separate, independent handlers pick up the event and handle their own concern. Cache handler invalidates keys. Notification handler fires emails (fire-and-forget, never blocking the response). Audit handler writes the log.

Now each handler is 20 lines, testable in isolation, and completely unaware of the others. Adding a new side effect - say, posting to a Slack channel, means adding a new event handler. No existing code changes.

Bet #6: Don't Make Reads Pay for Expired Data

Metrics have a TTL. The lazy approach: add WHERE ExpiresAt > NOW() to every query. It works... until you have 50,000 expired rows cluttering the table and your index scans are doing unnecessary work.

We built a background service that wakes up every 30 minutes, deletes expired metrics in batches, and goes back to sleep. Batch size is capped so a single cleanup cycle can't lock the table. If it hits the limit, it logs a warning so ops can tune the configuration.

The metrics table stays lean. Queries stay fast. And instead of every read request paying a tiny tax to filter garbage, a background job handles it once in bulk. It's not clever — it's just good housekeeping.

Bet #7: Design the Audit Log for the Auditor, Not the Developer

Our first audit log was a message text column with entries like: "User admin@company.com assigned role Admin to user john@company.com". Human-readable. Unsearchable. Useless for compliance.

The redesigned audit log captures:

Who initiated the action (with IP address)
Who was the target
What changed (previous state → new state, as structured JSON)
When it happened
Why (a mandatory reason field)
Whether it succeeded or failed

It's append-only. No updates. No deletes. Not even soft deletes. Once a record exists, it exists forever. That's not a technical constraint — it's a compliance requirement. When an auditor asks "has anyone tampered with these logs?" the answer needs to be architecturally impossible, not "we promise we didn't."

We also added CSV export, because auditors don't want to learn your API. They want to open a file in Excel, apply some filters, and go home. Meeting your users where they are applies to internal users too.

The Caching Philosophy That Changed Everything

Here's what I wish someone had told me earlier: caching isn't a performance optimisation. It's an architectural decision.

When fifteen microservices are calling your permission endpoint on every request, the difference between 200ms and 5ms is the difference between a platform that feels snappy and one that feels sluggish. And the bottleneck isn't your code — it's the database round trip.

Our caching strategy is tiered by how often data actually changes:

Data	Cache TTL	Invalidation
User permissions	5 min	Immediate on role change (domain event)
Reference data (role list)	30 min	On any role modification
Dashboard metrics	5 min	On ingestion writes
Tenant codes	10 min	Almost never changes

The critical insight: TTLs are safety nets, not invalidation strategies. When we know data changed (because we just processed the command), we invalidate immediately via domain events. The TTL only exists for the edge case where an event gets lost or a cache entry somehow becomes stale.

This gives us sub-10ms permission checks in production. Other services call us on every request and barely notice the network hop. That's not optimisation - that's making centralised auth viable.

Honest Mistakes We Made

We bolted on audit logging after building RBAC. Retrofitting event publishing into existing command handlers was a week of tedious work. If we'd designed the audit schema first, every handler would have been built with events from day one. Lesson: compliance requirements should shape your architecture, not be stapled on later.

We let each team define their own ingestion payload. One team used count, another used total, a third used value. Same concept, three different field names. A shared contract or schema registry would have saved us a normalisation sprint.

We underinvested in cache invalidation testing. Our unit tests mocked the cache and passed beautifully. The bugs lived in timing — a role change followed by an immediate permission check, where the cache hadn't been invalidated yet. These only surfaced under concurrent load testing. Test your cache invalidation under race conditions, not just in isolation.

What It Looks Like Today

Thirteen endpoints. Six ingestion (microservices pushing data in), seven dashboard (frontend pulling data out). Every single one has:

Tenant validation against the database
Input validation in the request pipeline
JWT authentication
In-memory caching with event-driven invalidation
Structured logging
Automated TTL cleanup

Our CTO's question - "How many active users do we have?" — now takes about 3 seconds to answer. Open the dashboard. It's right there.

TL;DR for the Architects

Centralise cross-cutting concerns, decentralise business logic. Auth, audit, and observability get worse when spread across services. Give them a home.
Push over pull for metrics. The monitoring system shouldn't depend on the availability of what it's monitoring.
Schema flexibility beats relational purity for metrics. A key-value store with JSONB means zero migrations when product wants new dashboard widgets.
Event-driven cache invalidation, not just TTLs. You know when data changes. Act on it immediately.
Design for the auditor. Immutable logs, CSV exports, structured change records. It feels like overhead until the compliance review. Then it's the best thing you built.
Caching is architecture. When your service is in the critical path of every other service, response time isn't a nice-to-have — it's existential.

Building platform infrastructure isn't the kind of work that gets you on stage at a conference. Nobody's tweeting about your audit log schema. But the next time an auditor asks "who had admin access on March 3rd?" and you answer in three seconds flat — or when the CEO opens one dashboard instead of pinging five engineering leads — you'll know it was worth it.

Built with .NET 9, Clean Architecture, CQRS, Entity Framework Core, and PostgreSQL.

DEV Community