Filippo Crotti

Posted on Apr 3

Why I built a self-hosted centralized backup manager

#go #webdev #devops #opensource

There’s no shortage of backup tools.

But none of them gave me a simple way to manage backups across multiple machines from one place, so I built my own.

This is the architecture behind it.

The problem

I was managing backups for a small setup: two servers, a handful of Docker containers, and a few directories that needed regular backups.

What I wanted was simple:

one place to see all backups
clear visibility on what ran, when, and whether it succeeded
metrics like transferred data and snapshot size
SSO support via OIDC, since my stack already runs behind Zitadel

There was also another constraint in the background: compliance.

I had started evaluating what would be needed to align with ISO 27001, and backup visibility, traceability, and centralized control quickly became non-negotiable.

Backrest was the closest match I found. It’s well built, and Restic is rock solid underneath.

The problem is that Backrest is still single-machine. You deploy one instance per server, and each one has its own UI. That is not centralized management. It is just multiple dashboards.

Other solutions exist, but none of them matched the combination I needed.

I could not find an open source tool that combined:

centralized management
a modern UI
proper OIDC support
a structure that could fit into a compliance-oriented setup

So I built Arkeep.

Why not just use scripts?

Could I have glued everything together with cron, Restic, and a few shell scripts? Yes.

But that still leaves scheduling, visibility, multi-machine coordination, authentication, and auditability as separate problems.

I wanted one system, not a pile of scripts.

Why not SSH into machines?

Another option would be to have a central server SSH into each machine and trigger backups.

I avoided that on purpose.

Having agents maintain an outbound connection is:

easier to deploy
easier to secure
more reliable across real-world setups

Also, my infrastructure runs behind Netbird and everything is behind SSO. That was a hard requirement from day one.

And to be honest, both SSO and VPN constraints were a bit of a pain in the ass to design around, so I leaned into a model that works with them instead of fighting them.

The architecture

The system is built around a server/agent model.

The server is the control plane. It runs:

REST API
web UI
scheduler
gRPC server
database

It stores configuration, schedules jobs, and receives results.

Agents run on each machine you want to back up.

They do not expose any ports. Each agent connects outbound to the server using a persistent gRPC stream and waits for jobs.

When a backup needs to run, the server pushes the job through that stream.

┌─────────────────────────────────┐
│           Arkeep Server         │
│  REST API  │  gRPC Server       │
│  Scheduler │  WebSocket Hub     │
│  SQLite/PG │  Notification Svc  │
└─────────────────────────────────┘
        ▲            ▲
        │ gRPC       │ gRPC
        │ (persistent stream)
┌───────┴────┐   ┌───┴────────┐
│  Agent A   │   │  Agent B   │
│  Server 1  │   │  Server 2  │
└────────────┘   └────────────┘

This model solves a few practical problems:

no inbound ports required
works cleanly behind NAT and VPN overlays like Netbird
instant disconnect detection when the stream drops
centralized visibility, which is important for audit and compliance

At a high level, dispatching a job looks like this:

func (s *Server) DispatchJob(agentID string, job *pb.JobRequest) error {
    stream, ok := s.agentStreams.Get(agentID)
    if !ok {
        return fmt.Errorf("agent %s offline", agentID)
    }

    return stream.Send(job)
}

No polling. No SSH. Just a persistent stream.

What the agent actually does

When a job arrives, the agent:

Resolves sources (directories or Docker volumes)
Runs pre-backup hooks
Executes the backup using Restic
Runs post-backup hooks
Streams logs and metrics back to the server in real time
Reports the final result

This is roughly how the agent wraps Restic:

cmd := exec.Command("restic", "backup", "--json", sourcePath)

stdout, err := cmd.StdoutPipe()
if err != nil {
    return err
}

if err := cmd.Start(); err != nil {
    return err
}

scanner := bufio.NewScanner(stdout)
for scanner.Scan() {
    var event ResticEvent
    if err := json.Unmarshal(scanner.Bytes(), &event); err == nil {
        jobStream.Send(event)
    }
}

return cmd.Wait()

Restic stays the backup engine. The agent adds structure, observability, and control.

Docker volume discovery

For Docker workloads, the agent can automatically resolve volumes at runtime.

func resolveVolumeMounts(cli *client.Client, containerID string) ([]string, error) {
    ctx := context.Background()

    container, err := cli.ContainerInspect(ctx, containerID)
    if err != nil {
        return nil, err
    }

    var paths []string
    for _, mount := range container.Mounts {
        if mount.Type == mount.TypeVolume {
            paths = append(paths, mount.Source)
        }
    }

    return paths, nil
}

From Restic’s perspective, it is just backing up directories.

The server side

The server is written in Go.

HTTP layer: Chi
Agent communication: gRPC
Scheduler: gocron
Migrations: golang-migrate
Database: SQLite by default, PostgreSQL for larger setups

When a job triggers, the scheduler:

resolves the assigned agent
checks if the agent is connected
pushes the job over the active gRPC stream

If the agent is offline, the job fails immediately.

Real-time updates are handled via WebSockets.

As agents stream logs back, the server broadcasts them to connected clients watching that job.

Authentication

Authentication is handled with JWT (RS256).

Passwords are hashed with Argon2id.

OIDC is implemented using Authorization Code + PKCE via coreos/go-oidc.

It plugs directly into providers like Zitadel, Keycloak, or Authentik.

SSO support was not optional. It shaped a good part of the system design early on, especially in how agents and users authenticate and interact with the server.

The frontend

The frontend is a Vue 3 PWA served directly from the server binary as embedded static assets.

No separate web server needed.

It uses shadcn-vue and Tailwind CSS v4, with WebSocket subscriptions for real-time updates.

The UI is intentionally minimal for now:

agents
destinations
policies
jobs
snapshots

A proper dashboard is on the roadmap.

Current state and what’s next

Arkeep is currently at v0.1.0-beta.1

I have been running it as a secondary backup system in a real environment for about a month.

So far it has been stable enough to validate the architecture, but it is still early and bugs are expected.

Planned improvements:

better dashboards and reporting
support for VM's
more destination types

Closing thoughts

Arkeep is open source under Apache 2.0.

Repo: https://github.com/arkeep-io/arkeep

If you are dealing with backups across multiple machines, how are you handling it today?

Are you using scripts, existing tools, or something custom?

Top comments (9)

Mykola Kondratiuk • Apr 10

multi-machine backup visibility is one of those things every existing tool almost solves. per-machine view is fine until you hit 4+ boxes - then the aggregated failure state becomes the actual work.

Filippo Crotti • Apr 13

Thank you Mykola for the comment,

This is exactly the problem Arkeep is built around. The per-machine view breaks down fast, by the time you're managing 4+ hosts you don't want to SSH into each one (I had 2 hosts and was already a pain) to check last night's run, you want a single place that tells you what failed and why.
The aggregated failure state is the actual work, as you said. That's what the dashboard and the job log view are designed for. One place, all hosts, full log output without hunting through individual machines.

Mykola Kondratiuk • Apr 13

yeah 2 hosts was my breaking point too - the moment you start scheduling "SSH into box B to verify" as an actual task, something has gone wrong. does Arkeep push alerts when a job fails or is it more of a dashboard-check workflow?

Filippo Crotti • Apr 13 • Edited

I have added the possibility to connect an SMTP server to push email notification to multiple recipients so you don't have to look at a dashboard every morning. If something is wrong Arkeep tells you what even if you are not looking at it.
The SMTP servers options are configurable in the settings page on the gui so that you don't have to mess around with too much configuration via the terminal or with docker compose.
Right now the available notifications are:

Agent disconnected or gone offline
Backup job failed
Backup job succeded

I'm thinking about push notifications also, since the gui is a PWA and could be handy to have push notifications on mobile or on desktop in general.

Mykola Kondratiuk • Apr 13

smart - dashboards require intent, email doesn’t. nice that it’s already in the settings.

Jonathan Murray • Apr 4

The decision to build your own is often the right call when existing tools solve a different problem than the one you actually have. "One place to see all backups across multiple machines" sounds like table-stakes functionality but it genuinely isn't in most backup tools — they're designed around backing up one thing comprehensively, not monitoring many things consistently.

The architecture questions that usually matter most for a centralized backup manager: how do you handle the failure-reporting problem (when a backup job silently stops running vs when it runs and fails vs when it succeeds but produces a corrupt archive)? And how does the visibility layer handle the case where the central manager itself is down? There's an irony in building observability for backups on infrastructure that also needs to be backed up.

What's the transport mechanism between the remote machines and the central store? Push from agents or pull from the coordinator?

Filippo Crotti • Apr 7

Thank you Jonathan for the comment.

Arkeep has the possibility to attach an smtp host to send notifications when backup jobs fails or succeeds.
The notifications also cover when an agent switched from online status to offline.

About the central manager visibility is a thing that I'm working on. It is in the planning, not yet impemented. Observability it's a must in a system like this and it is in the roadmap for version 1.0.

The connection between agents an the central server is a presistent gRPC in pull only.

Kai Alder • Apr 3

The outbound-only agent model is really smart. Been there with SSH-based backup setups and they're a nightmare once you have more than a few machines - key management, port forwarding through NAT, timeout issues when connections drop mid-backup...

One question about the gRPC stream - how are you handling reconnection? Like if the server restarts or there's a brief network blip, does the agent just backoff-and-retry automatically, or does it require manual intervention?

Also curious about the Docker volume discovery. Are you doing anything special for containers with multiple named volumes, or relying on users to specify which ones matter? I've seen setups where people accidentally backup massive cache volumes alongside their actual data.

Really nice that you went Apache 2.0 - makes integration easier than AGPL for teams that want to embed this in their stack. Might check it out for a few homelab servers that are still running janky borgmatic scripts.

Filippo Crotti • Apr 7 • Edited

Thank you Kai for the comment.

When an agent starts, it automatically searches for the server connection until it's established. If the server goes down or a network interruption breaks the gRPC stream, the agent enters a "searching for server" state and keeps retrying until the connection is re-established, no manual intervention needed unless the agent itself is stopped.

Nothing too fancy for Docker volumes yet. The agent automatically discovers if Docker is available on each session, so if you add Docker after the agent is already deployed, it picks it up on the next connection without any manual changes. When creating a backup policy, the system shows the list of available volumes and lets you choose which ones to include, that way you don't accidentally back up cache volumes you don't care about. If new volumes are added later, you can update the policy by editing it, no need to recreate anything, since discovery via the Docker socket handles changes automatically.