Boris Kl

Posted on Jun 26

Approval-driven server ops: how I let contractors restart nginx without ever giving them SSH

#linux #security #automation #devops

I run a small WordPress + Cloudflare agency. Two recurring pains finally pushed me to build something instead of complaining:

Pain 1: contractor access drift. I'd hand a contractor root SSH for "just this one task". Three jobs later, I'd realize four ex-contractors still had access I never rotated. The "I'll rotate keys later" lie compounds quickly.

Pain 2: 3am Telegram from a client. "Site is down" while I'm on the subway with no laptop. Find Wi-Fi, SSH in, systemctl restart nginx. Five minutes of actual work, ninety minutes of friction.

The constraint I wanted: a contractor should be able to restart nginx on a specific server, and nothing else. Not "a contractor with a limited shell". Not "a contractor with sudo restricted via sudoers". A contractor with no shell at all, and a chat command that does exactly one thing.

This article walks the architecture I ended up with. Real components, real grammar — names from the actual source tree. If it's useful, fork the idea — or if you'd rather not build it, link at the end.

The shape of the solution

Three components:

Operator (Telegram)
  APPROVE: action + nonce + ts (+ TOTP)
        ↓
Python control plane
  - aiogram bot
  - audit hash-chain (SHA-256)
  - policy engine + runtime state (SQLite WAL)
        ↓ mTLS              ↓ HTTPS
   Go agent           Cloudflare API
   (customer's        (under-attack mode,
    server)            challenges, blocks)
   diagnostics +
   allowlisted ops

The Go agent sits on the managed server. It speaks one protocol — mTLS HTTP — to the Python control plane. It never accepts shell. It accepts a fixed list of operations declared by a capability manifest loaded at startup.

The control plane talks to operators via a Telegram bot (aiogram) and to Cloudflare via their REST API. It runs the policy engine, persists state in SQLite (WAL mode), and writes a hash-chained audit log.

An operator's only interface is a Telegram chat. They never touch the server.

The APPROVE grammar

Every mutating action goes through this grammar. There is no other path.

APPROVE: <action> <client_id> nonce=<16-hex> ts=<unix_epoch>

With TOTP enabled (env var SECMON_TELEGRAM_TOTP_SECRET):

APPROVE: <action> <client_id> otp=<6digits> nonce=<16-hex> ts=<unix_epoch>

Rules the control plane enforces against the message:

Action allowlist. <action> must be in the canonical ActionID enum. Today that's restart-web, restart-php, restart-db, cf-under-attack, cf-managed-challenge, cf-targeted-block, cf-pattern-block, cf-rate-limit, enable-cf-auto. New actions require code + manifest + tests, not configuration. That's deliberate friction.
TTL. |now - ts| <= 300 seconds. Stale approvals get rejected, not silently applied.
One-time nonce. The 16-hex nonce is claimed atomically against a SQLite-backed nonce store. Replay = explicit rejection, not silent dedup.
TOTP second factor (when configured). Six-digit OTP from a pyotp-compatible secret. The control plane refuses the approval if the TOTP doesn't match.
Telegram sender allowlist. SECMON_TELEGRAM_ALLOWED_USER_IDS is checked before parsing. Unknown sender → message logged, action not even parsed.

If any of those fails, the agent never gets a request — the rejection happens in the control plane, gets written to the audit log with reason, and the operator gets a curt Telegram reply explaining what was wrong.

The capability manifest

The agent advertises its capabilities via a JSON document, signed at deploy time with HMAC-SHA256.

canonical = JSON of manifest body (keys sorted, no "signature" field)
signature = HMAC-SHA256(canonical, SECMON_MANIFEST_SECRET).hexdigest()

When the agent comes up, it loads capabilities.manifest.json. When the control plane dispatches an action, it pulls the agent's manifest, verifies the HMAC, and refuses anything not in the signed list. The agent cannot lie about what it supports — and the control plane cannot accidentally route a restart-db to a host whose manifest only declared restart-web.

This is the cheap version of the right idea. The expensive version is signed-everything-everywhere. HMAC-SHA256 with a shared secret is enough for "operator wants X, agent is allowed to do X". For multi-tenant trust I'd reach for asymmetric, but for a self-hosted agent paired with a single control plane, shared HMAC is good enough.

The audit log — hash-chained, verifiable offline

Every approval, every dispatch, every Cloudflare auto-mitigation gets a JSON record in an append-only JSONL file. Each record carries prev_hash and record_hash:

canonical   = JSON of record with prev_hash + record_hash stripped, keys sorted
record_hash = sha256(prev_hash || "|" || canonical).hexdigest()
prev_hash   = record_hash of the preceding line

The first record has prev_hash = "". Every subsequent record locks in everything before it. Tamper with any line, every line after fails verification.

Verification has a CLI: secmon-audit-verify. Walks the chain, returns line numbers of any break.

The threat model is "compromised SaaS cannot lie about what your servers did, after the fact". I think that's the audit threat model that actually matters for an agency that has to show a client what happened during an incident.

(The audit log is separate from the licensing JWT, which uses Ed25519. Different concern. Don't conflate them.)

What it deliberately is NOT

This is just as important as the feature list:

Not a monitoring stack. Use Prometheus, UptimeRobot, Grafana. This tool reacts to incidents — it doesn't detect them. The diagnostics module on the agent is read-only triage (/diag), not continuous monitoring.
Not a remote desktop. No raw shell exec. If you want SSH, you want SSH; this isn't a worse SSH.
Not DDoS protection. Cloudflare is your shield. The control plane can auto-dispatch a Cloudflare under-attack mode in response to certain patterns — but auto-dispatch has a hard kill-switch (SECMON_AUTO_CF_DISABLED=1) checked on every dispatch path. Manual APPROVE: cf-under-attack still works as an explicit operator override, with auto_dispatch_overridden_by=approve written to audit.
Not a ticketing system. No "open ticket for this incident". The audit log is the record, your team's normal tools do the workflow.

Every feature I didn't add is a feature I don't have to maintain. At my scale, that math dominates.

Threat model: assume the bot token leaks

The Telegram bot token gets passed around. Plan for the leak.

If the bot token leaks alone:

Attacker can spam the bot's Telegram channel.
Attacker cannot invoke an action, because:
- Their Telegram user ID isn't in SECMON_TELEGRAM_ALLOWED_USER_IDS. Message gets logged, parsing never starts.
- Even if they were on the allowlist, they don't have the TOTP secret.
- Even if they had the TOTP, they can't reuse a stolen nonce (one-time, atomic claim).
- Even with a fresh nonce, |now - ts| > 300s kills any captured approval.
Attacker cannot elevate themselves into the allowlist or the TOTP secret — both are env-var on the control plane host. Compromising the host changes the threat model entirely.

To actually run an operation against a server, the attacker would need all of:

Bot token (Telegram side)
TOTP secret (env var on control plane)
Telegram user ID on the allowlist (env var on control plane)
A fresh, unused nonce (or write access to the nonce store)
A timestamp within ±300s of dispatch

That's the design point. Each layer is cheap to add and forces an attacker to compromise distinct surfaces.

What I'd do differently

A few honest reflections:

TOTP belongs in v1, not v3. I shipped the bot + APPROVE grammar first, added nonce/TTL in a "security hardening" sprint, added TOTP in another security hardening sprint. Should have been one combined hardening pass before the first real customer touched it. I got away with it because the first customer was me, but I wouldn't ship a v2 without it.
The hash-chain is great until it isn't. A corrupt last line used to be swallowed silently — new events would be written with prev_hash="" and produce a "verifiable but forked" log. Fixed by explicitly propagating AuditChainError on read failure or JSON parse failure, with an opt-in manual recovery path. Lesson: tamper-evidence is a property of the whole chain, including the read path, not just the write path.
Per-tenant Cloudflare credentials need their own design. I had a single CF token at the start. That works for one client. For multi-client it became per-(client_id, zone_id) rows encrypted at rest with Fernet (SECMON_DATA_KEY). Earlier this had a one-key-per-deployment shape and I had to migrate — should have started multi-tenant from the first commit.
Diagnostics is bigger than ops. The agent's read-only diagnostics module (HTTP, web, PHP, DB, disk, SSL, flood patterns) is about 1.5K lines. The mutating-ops surface is smaller. That ratio is honest: you spend much more time looking than fixing, and read-only checks have to be deeply specific or they're noise.

Closing

If you run client infrastructure and the "shared root password" thing makes you wince, the architecture above is buildable. The Go agent is a single binary with no runtime deps. The control plane is Python with SQLite. Nothing exotic in the stack.

If you'd rather not build it: I packaged this up as secmon.io. Pricing is $49/mo Starter (1 server, 4h response), $199 Pro (5 servers, 1h response), $599 Agency (20 servers, monthly incident review). 7-day pilot on all tiers. The agent source is currently closed; whether to open-source it under AGPL is on the roadmap if there's demand — happy to discuss the trade-offs in comments.

If you'd like to be one of the first ten pilots, the contact form is on the homepage. Honest feedback is more useful to me right now than your money.

Critique very welcome — what's broken in the threat model? What's the obvious bigger competitor I missed?

DEV Community