<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tomislav Nekic</title>
    <description>The latest articles on DEV Community by Tomislav Nekic (@tomislav_nekic_d296932f02).</description>
    <link>https://dev.to/tomislav_nekic_d296932f02</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3961465%2F5d3b414e-486c-4901-8bba-ae5dd8ea6dad.jpg</url>
      <title>DEV Community: Tomislav Nekic</title>
      <link>https://dev.to/tomislav_nekic_d296932f02</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tomislav_nekic_d296932f02"/>
    <language>en</language>
    <item>
      <title>Building an API-first self-hosted monitoring platform in Rust/Tokio and Postgres</title>
      <dc:creator>Tomislav Nekic</dc:creator>
      <pubDate>Sun, 31 May 2026 17:10:09 +0000</pubDate>
      <link>https://dev.to/tomislav_nekic_d296932f02/building-an-api-first-self-hosted-monitoring-platform-in-rusttokio-and-postgres-18eb</link>
      <guid>https://dev.to/tomislav_nekic_d296932f02/building-an-api-first-self-hosted-monitoring-platform-in-rusttokio-and-postgres-18eb</guid>
      <description>&lt;p&gt;I build and maintain client projects, and one thing that comes up again and again is monitoring.&lt;/p&gt;

&lt;p&gt;Not only “is the homepage up?”, but also:&lt;/p&gt;

&lt;p&gt;is this API endpoint responding correctly?&lt;br&gt;
did this webhook start failing?&lt;br&gt;
is the SSL certificate still valid?&lt;br&gt;
is DNS returning the expected record?&lt;br&gt;
is this cron job still running?&lt;br&gt;
did this background task stop calling home?&lt;/p&gt;

&lt;p&gt;There are many good monitoring tools already, but I kept wanting something that matched the way I usually think about services.&lt;/p&gt;

&lt;p&gt;That is why I started building Alon Sentinel.&lt;/p&gt;

&lt;p&gt;It is an open-source, self-hosted monitoring platform built with Rust/Tokio, PostgreSQL, and React.&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/tomnekic/alon-sentinel" rel="noopener noreferrer"&gt;https://github.com/tomnekic/alon-sentinel&lt;/a&gt;&lt;br&gt;
Demo: &lt;a href="https://demo.alon.systems" rel="noopener noreferrer"&gt;https://demo.alon.systems&lt;/a&gt;&lt;br&gt;
Docs/API: &lt;a href="https://docs.alon.systems" rel="noopener noreferrer"&gt;https://docs.alon.systems&lt;/a&gt;&lt;br&gt;
Benchmarks: &lt;a href="https://github.com/tomnekic/alon-sentinel/tree/main/benchmark" rel="noopener noreferrer"&gt;https://github.com/tomnekic/alon-sentinel/tree/main/benchmark&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The main idea: site-centric monitoring&lt;/p&gt;

&lt;p&gt;A lot of monitoring tools treat every check as its own thing.&lt;/p&gt;

&lt;p&gt;That works fine at small scale, but in real projects I usually do not think that way. I think in terms of services.&lt;/p&gt;

&lt;p&gt;For example, I may have one client API. That API is not just one HTTP check. It may need:&lt;/p&gt;

&lt;p&gt;HTTP checks&lt;br&gt;
SSL certificate checks&lt;br&gt;
DNS checks&lt;br&gt;
TCP checks&lt;br&gt;
heartbeat checks from background jobs&lt;/p&gt;

&lt;p&gt;Those checks all belong to the same service.&lt;/p&gt;

&lt;p&gt;So in Alon Sentinel, the main model is site-centric. A site/service can have multiple monitors attached to it, and those monitors can share incidents, notification routing, history, and public status pages.&lt;/p&gt;

&lt;p&gt;The goal is to avoid ending up with a long flat list of unrelated checks where the relationship between them only exists in your head.&lt;/p&gt;

&lt;p&gt;API-first from the beginning&lt;/p&gt;

&lt;p&gt;Another important decision was making the API a first-class part of the system.&lt;/p&gt;

&lt;p&gt;I did not want the UI to be the only real interface.&lt;/p&gt;

&lt;p&gt;For my use cases, I often need to create or manage monitors from scripts, deployment flows, or client project tooling. That means the API cannot be an afterthought.&lt;/p&gt;

&lt;p&gt;Alon Sentinel has a versioned /v1 API, OpenAPI docs, and machine-to-machine API clients. The admin UI is useful, but the system should also be usable by automation.&lt;/p&gt;

&lt;p&gt;That was one of the core reasons for building it.&lt;/p&gt;

&lt;p&gt;Splitting the API and worker&lt;/p&gt;

&lt;p&gt;The backend is split into two Rust services:&lt;/p&gt;

&lt;p&gt;an API service&lt;br&gt;
a worker service&lt;/p&gt;

&lt;p&gt;The API service handles users, sites, monitors, incidents, status pages, API clients, and configuration.&lt;/p&gt;

&lt;p&gt;The worker service is responsible for actually running checks and opening or resolving incidents.&lt;/p&gt;

&lt;p&gt;This split matters because monitoring checks can be slow or unreliable by nature. Targets time out. DNS can hang. Remote servers can behave badly. Some checks will fail in messy ways.&lt;/p&gt;

&lt;p&gt;I do not want that to affect the HTTP/API path.&lt;/p&gt;

&lt;p&gt;Keeping check execution isolated from the API layer makes the architecture easier to reason about, and it should also make it possible to scale workers separately later.&lt;/p&gt;

&lt;p&gt;Rich HTTP checks&lt;/p&gt;

&lt;p&gt;The HTTP monitor is more than a simple “status code is 200” check.&lt;/p&gt;

&lt;p&gt;It supports:&lt;/p&gt;

&lt;p&gt;expected status codes&lt;br&gt;
response time thresholds&lt;br&gt;
body contains / not-contains assertions&lt;br&gt;
header assertions&lt;br&gt;
JSON path assertions&lt;br&gt;
redirect handling&lt;br&gt;
timeout and interval configuration&lt;br&gt;
TLS-related options&lt;/p&gt;

&lt;p&gt;This matters because many real checks are not just “did I get a response?”&lt;/p&gt;

&lt;p&gt;Sometimes you need to verify that the response contains a specific value, that a JSON field exists, that an API returns the expected structure, or that a bad response is detected even if the status code still looks successful.&lt;/p&gt;

&lt;p&gt;Other monitor types&lt;/p&gt;

&lt;p&gt;Current monitor types include:&lt;/p&gt;

&lt;p&gt;HTTP monitors&lt;br&gt;
SSL certificate checks&lt;br&gt;
DNS record checks&lt;br&gt;
TCP checks&lt;br&gt;
heartbeat monitors for cron jobs and background tasks&lt;/p&gt;

&lt;p&gt;Heartbeat monitors are especially useful for jobs that do not expose an HTTP endpoint. The job calls Alon Sentinel when it runs. If it does not call within the expected window, the monitor fails.&lt;/p&gt;

&lt;p&gt;Incidents, notifications, and status pages&lt;/p&gt;

&lt;p&gt;The system also includes:&lt;/p&gt;

&lt;p&gt;incident lifecycle&lt;br&gt;
public status pages&lt;br&gt;
Slack notifications&lt;br&gt;
Discord notifications&lt;br&gt;
webhook notifications&lt;br&gt;
email notifications&lt;br&gt;
per-site notification overrides&lt;br&gt;
RBAC&lt;br&gt;
Prometheus metrics&lt;br&gt;
Docker Compose deployment&lt;/p&gt;

&lt;p&gt;The goal is not to replace Prometheus or Grafana. Alon Sentinel exposes Prometheus metrics, so it can fit alongside that stack.&lt;/p&gt;

&lt;p&gt;The focus is different: service availability, incidents, notifications, and public status communication.&lt;/p&gt;

&lt;p&gt;Why Rust?&lt;/p&gt;

&lt;p&gt;I previously worked on monitoring agents for Windows and Ubuntu back in 2019, and I really liked that type of work.&lt;/p&gt;

&lt;p&gt;For this project, Rust felt like the right fit again.&lt;/p&gt;

&lt;p&gt;The worker needs to run many checks, deal with timeouts, avoid unnecessary resource usage, and handle failure paths carefully. Rust and Tokio are a good match for that kind of backend.&lt;/p&gt;

&lt;p&gt;I also wanted predictable resource usage. That was one of the reasons I was interested in benchmarking it early.&lt;/p&gt;

&lt;p&gt;Benchmarking against Uptime Kuma&lt;/p&gt;

&lt;p&gt;I also ran benchmarks against Uptime Kuma on the same Hetzner CPX32 server.&lt;/p&gt;

&lt;p&gt;One result that stood out was at 4,000 monitors with a 30s interval:&lt;/p&gt;

&lt;p&gt;Alon Sentinel / PostgreSQL: ~390 MiB average RAM&lt;br&gt;
Uptime Kuma / SQLite: ~738 MiB average RAM&lt;br&gt;
Uptime Kuma / MariaDB: ~1.96 GiB average RAM&lt;/p&gt;

&lt;p&gt;This is only one setup, so I do not want to over-claim from it. Benchmarks depend heavily on workload, configuration, hardware, database setup, and methodology.&lt;/p&gt;

&lt;p&gt;But the difference was large enough that I thought it was worth sharing.&lt;/p&gt;

&lt;p&gt;The benchmark script also tracks:&lt;/p&gt;

&lt;p&gt;CPU average/max&lt;br&gt;
RAM average/max&lt;br&gt;
expected checks&lt;br&gt;
completed checks&lt;br&gt;
missed checks&lt;br&gt;
incidents&lt;br&gt;
duplicate incidents&lt;br&gt;
worker errors&lt;/p&gt;

&lt;p&gt;There are still things I want to add, especially better database-level metrics:&lt;/p&gt;

&lt;p&gt;pool wait time&lt;br&gt;
active/idle connections&lt;br&gt;
query timing&lt;br&gt;
check lateness&lt;/p&gt;

&lt;p&gt;CPU and RAM are useful, but they do not fully explain where the system is spending time under load.&lt;/p&gt;

&lt;p&gt;What I am thinking about next&lt;/p&gt;

&lt;p&gt;Some of the areas I am currently thinking about:&lt;/p&gt;

&lt;p&gt;better scheduling for large numbers of async checks&lt;br&gt;
jitter to avoid burst behavior&lt;br&gt;
bounded concurrency&lt;br&gt;
PostgreSQL pool sizing&lt;br&gt;
better benchmark reporting&lt;br&gt;
more notification integrations&lt;br&gt;
multi-region checks&lt;br&gt;
custom status page domains&lt;/p&gt;

&lt;p&gt;The scheduling side is especially interesting. At higher monitor counts and shorter intervals, it is not enough to just spawn more tasks. You need to think about bursts, timeouts, database writes, and how the system behaves when targets are slow or failing.&lt;/p&gt;

&lt;p&gt;Current status&lt;/p&gt;

&lt;p&gt;This is the first public release.&lt;/p&gt;

&lt;p&gt;It is already usable, but still early. I am mostly looking for feedback from people who run real self-hosted monitoring stacks or have built similar Rust/Tokio worker systems.&lt;/p&gt;

&lt;p&gt;The questions I care most about right now are:&lt;/p&gt;

&lt;p&gt;Does the site-centric model make sense?&lt;br&gt;
Would an API-first monitoring tool be useful in your setup?&lt;br&gt;
What would stop you from trying it?&lt;br&gt;
What benchmark metrics would you want to see?&lt;br&gt;
How would you approach scheduling many async checks in Rust?&lt;/p&gt;

&lt;p&gt;Project links:&lt;/p&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/tomnekic/alon-sentinel" rel="noopener noreferrer"&gt;https://github.com/tomnekic/alon-sentinel&lt;/a&gt;&lt;br&gt;
Demo: &lt;a href="https://demo.alon.systems" rel="noopener noreferrer"&gt;https://demo.alon.systems&lt;/a&gt;&lt;br&gt;
Docs/API: &lt;a href="https://docs.alon.systems" rel="noopener noreferrer"&gt;https://docs.alon.systems&lt;/a&gt;&lt;br&gt;
Benchmarks: &lt;a href="https://github.com/tomnekic/alon-sentinel/tree/main/benchmark" rel="noopener noreferrer"&gt;https://github.com/tomnekic/alon-sentinel/tree/main/benchmark&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>opensource</category>
      <category>postgres</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
