Tomislav Nekic

Posted on May 31

Building an API-first self-hosted monitoring platform in Rust/Tokio and Postgres

#rust #opensource #postgres #devops

I build and maintain client projects, and one thing that comes up again and again is monitoring.

Not only “is the homepage up?”, but also:

is this API endpoint responding correctly?
did this webhook start failing?
is the SSL certificate still valid?
is DNS returning the expected record?
is this cron job still running?
did this background task stop calling home?

There are many good monitoring tools already, but I kept wanting something that matched the way I usually think about services.

That is why I started building Alon Sentinel.

It is an open-source, self-hosted monitoring platform built with Rust/Tokio, PostgreSQL, and React.

Repo: https://github.com/tomnekic/alon-sentinel
Demo: https://demo.alon.systems
Docs/API: https://docs.alon.systems
Benchmarks: https://github.com/tomnekic/alon-sentinel/tree/main/benchmark

The main idea: site-centric monitoring

A lot of monitoring tools treat every check as its own thing.

That works fine at small scale, but in real projects I usually do not think that way. I think in terms of services.

For example, I may have one client API. That API is not just one HTTP check. It may need:

HTTP checks
SSL certificate checks
DNS checks
TCP checks
heartbeat checks from background jobs

Those checks all belong to the same service.

So in Alon Sentinel, the main model is site-centric. A site/service can have multiple monitors attached to it, and those monitors can share incidents, notification routing, history, and public status pages.

The goal is to avoid ending up with a long flat list of unrelated checks where the relationship between them only exists in your head.

API-first from the beginning

Another important decision was making the API a first-class part of the system.

I did not want the UI to be the only real interface.

For my use cases, I often need to create or manage monitors from scripts, deployment flows, or client project tooling. That means the API cannot be an afterthought.

Alon Sentinel has a versioned /v1 API, OpenAPI docs, and machine-to-machine API clients. The admin UI is useful, but the system should also be usable by automation.

That was one of the core reasons for building it.

Splitting the API and worker

The backend is split into two Rust services:

an API service
a worker service

The API service handles users, sites, monitors, incidents, status pages, API clients, and configuration.

The worker service is responsible for actually running checks and opening or resolving incidents.

This split matters because monitoring checks can be slow or unreliable by nature. Targets time out. DNS can hang. Remote servers can behave badly. Some checks will fail in messy ways.

I do not want that to affect the HTTP/API path.

Keeping check execution isolated from the API layer makes the architecture easier to reason about, and it should also make it possible to scale workers separately later.

Rich HTTP checks

The HTTP monitor is more than a simple “status code is 200” check.

It supports:

expected status codes
response time thresholds
body contains / not-contains assertions
header assertions
JSON path assertions
redirect handling
timeout and interval configuration
TLS-related options

This matters because many real checks are not just “did I get a response?”

Sometimes you need to verify that the response contains a specific value, that a JSON field exists, that an API returns the expected structure, or that a bad response is detected even if the status code still looks successful.

Other monitor types

Current monitor types include:

HTTP monitors
SSL certificate checks
DNS record checks
TCP checks
heartbeat monitors for cron jobs and background tasks

Heartbeat monitors are especially useful for jobs that do not expose an HTTP endpoint. The job calls Alon Sentinel when it runs. If it does not call within the expected window, the monitor fails.

Incidents, notifications, and status pages

The system also includes:

incident lifecycle
public status pages
Slack notifications
Discord notifications
webhook notifications
email notifications
per-site notification overrides
RBAC
Prometheus metrics
Docker Compose deployment

The goal is not to replace Prometheus or Grafana. Alon Sentinel exposes Prometheus metrics, so it can fit alongside that stack.

The focus is different: service availability, incidents, notifications, and public status communication.

Why Rust?

I previously worked on monitoring agents for Windows and Ubuntu back in 2019, and I really liked that type of work.

For this project, Rust felt like the right fit again.

The worker needs to run many checks, deal with timeouts, avoid unnecessary resource usage, and handle failure paths carefully. Rust and Tokio are a good match for that kind of backend.

I also wanted predictable resource usage. That was one of the reasons I was interested in benchmarking it early.

Benchmarking against Uptime Kuma

I also ran benchmarks against Uptime Kuma on the same Hetzner CPX32 server.

One result that stood out was at 4,000 monitors with a 30s interval:

Alon Sentinel / PostgreSQL: ~390 MiB average RAM
Uptime Kuma / SQLite: ~738 MiB average RAM
Uptime Kuma / MariaDB: ~1.96 GiB average RAM

This is only one setup, so I do not want to over-claim from it. Benchmarks depend heavily on workload, configuration, hardware, database setup, and methodology.

But the difference was large enough that I thought it was worth sharing.

The benchmark script also tracks:

CPU average/max
RAM average/max
expected checks
completed checks
missed checks
incidents
duplicate incidents
worker errors

There are still things I want to add, especially better database-level metrics:

pool wait time
active/idle connections
query timing
check lateness

CPU and RAM are useful, but they do not fully explain where the system is spending time under load.

What I am thinking about next

Some of the areas I am currently thinking about:

better scheduling for large numbers of async checks
jitter to avoid burst behavior
bounded concurrency
PostgreSQL pool sizing
better benchmark reporting
more notification integrations
multi-region checks
custom status page domains

The scheduling side is especially interesting. At higher monitor counts and shorter intervals, it is not enough to just spawn more tasks. You need to think about bursts, timeouts, database writes, and how the system behaves when targets are slow or failing.

Current status

This is the first public release.

It is already usable, but still early. I am mostly looking for feedback from people who run real self-hosted monitoring stacks or have built similar Rust/Tokio worker systems.

The questions I care most about right now are:

Does the site-centric model make sense?
Would an API-first monitoring tool be useful in your setup?
What would stop you from trying it?
What benchmark metrics would you want to see?
How would you approach scheduling many async checks in Rust?

Project links:

Repo: https://github.com/tomnekic/alon-sentinel
Demo: https://demo.alon.systems
Docs/API: https://docs.alon.systems
Benchmarks: https://github.com/tomnekic/alon-sentinel/tree/main/benchmark

Top comments (2)

Harjot Singh • Jun 1

i like how you focus on site-centric monitoring, it really changes the perspective on how we view service reliability. at moonshift, we help you get a full app setup with next.js, postgres, and auth deployed in about 7 minutes, plus you own the code on your github. if you're interested, i can set you up with a free run to see how it works.

Tomislav Nekic • May 31

I’m especially interested in feedback from people who have built Rust/Tokio worker systems or run their own monitoring stacks.