DEV Community

Cover image for Why I built a self-hosted centralized backup manager
Filippo Crotti
Filippo Crotti

Posted on

Why I built a self-hosted centralized backup manager

There’s no shortage of backup tools.

But none of them gave me a simple way to manage backups across multiple machines from one place, so I built my own.

This is the architecture behind it.

The problem

I was managing backups for a small setup: two servers, a handful of Docker containers, and a few directories that needed regular backups.

What I wanted was simple:

  • one place to see all backups
  • clear visibility on what ran, when, and whether it succeeded
  • metrics like transferred data and snapshot size
  • SSO support via OIDC, since my stack already runs behind Zitadel

There was also another constraint in the background: compliance.

I had started evaluating what would be needed to align with ISO 27001, and backup visibility, traceability, and centralized control quickly became non-negotiable.

Backrest was the closest match I found. It’s well built, and Restic is rock solid underneath.

The problem is that Backrest is still single-machine. You deploy one instance per server, and each one has its own UI. That is not centralized management. It is just multiple dashboards.

Other solutions exist, but none of them matched the combination I needed.

I could not find an open source tool that combined:

  • centralized management
  • a modern UI
  • proper OIDC support
  • a structure that could fit into a compliance-oriented setup

So I built Arkeep.

Why not just use scripts?

Could I have glued everything together with cron, Restic, and a few shell scripts? Yes.

But that still leaves scheduling, visibility, multi-machine coordination, authentication, and auditability as separate problems.

I wanted one system, not a pile of scripts.

Why not SSH into machines?

Another option would be to have a central server SSH into each machine and trigger backups.

I avoided that on purpose.

Having agents maintain an outbound connection is:

  • easier to deploy
  • easier to secure
  • more reliable across real-world setups

Also, my infrastructure runs behind Netbird and everything is behind SSO. That was a hard requirement from day one.

And to be honest, both SSO and VPN constraints were a bit of a pain in the ass to design around, so I leaned into a model that works with them instead of fighting them.

The architecture

The system is built around a server/agent model.

The server is the control plane. It runs:

  • REST API
  • web UI
  • scheduler
  • gRPC server
  • database

It stores configuration, schedules jobs, and receives results.

Agents run on each machine you want to back up.

They do not expose any ports. Each agent connects outbound to the server using a persistent gRPC stream and waits for jobs.

When a backup needs to run, the server pushes the job through that stream.

┌─────────────────────────────────┐
│           Arkeep Server         │
│  REST API  │  gRPC Server       │
│  Scheduler │  WebSocket Hub     │
│  SQLite/PG │  Notification Svc  │
└─────────────────────────────────┘
        ▲            ▲
        │ gRPC       │ gRPC
        │ (persistent stream)
┌───────┴────┐   ┌───┴────────┐
│  Agent A   │   │  Agent B   │
│  Server 1  │   │  Server 2  │
└────────────┘   └────────────┘
Enter fullscreen mode Exit fullscreen mode

This model solves a few practical problems:

  • no inbound ports required
  • works cleanly behind NAT and VPN overlays like Netbird
  • instant disconnect detection when the stream drops
  • centralized visibility, which is important for audit and compliance

At a high level, dispatching a job looks like this:

func (s *Server) DispatchJob(agentID string, job *pb.JobRequest) error {
    stream, ok := s.agentStreams.Get(agentID)
    if !ok {
        return fmt.Errorf("agent %s offline", agentID)
    }

    return stream.Send(job)
}
Enter fullscreen mode Exit fullscreen mode

No polling. No SSH. Just a persistent stream.

What the agent actually does

When a job arrives, the agent:

  1. Resolves sources (directories or Docker volumes)
  2. Runs pre-backup hooks
  3. Executes the backup using Restic
  4. Runs post-backup hooks
  5. Streams logs and metrics back to the server in real time
  6. Reports the final result

This is roughly how the agent wraps Restic:

cmd := exec.Command("restic", "backup", "--json", sourcePath)

stdout, err := cmd.StdoutPipe()
if err != nil {
    return err
}

if err := cmd.Start(); err != nil {
    return err
}

scanner := bufio.NewScanner(stdout)
for scanner.Scan() {
    var event ResticEvent
    if err := json.Unmarshal(scanner.Bytes(), &event); err == nil {
        jobStream.Send(event)
    }
}

return cmd.Wait()
Enter fullscreen mode Exit fullscreen mode

Restic stays the backup engine. The agent adds structure, observability, and control.

Docker volume discovery

For Docker workloads, the agent can automatically resolve volumes at runtime.

func resolveVolumeMounts(cli *client.Client, containerID string) ([]string, error) {
    ctx := context.Background()

    container, err := cli.ContainerInspect(ctx, containerID)
    if err != nil {
        return nil, err
    }

    var paths []string
    for _, mount := range container.Mounts {
        if mount.Type == mount.TypeVolume {
            paths = append(paths, mount.Source)
        }
    }

    return paths, nil
}
Enter fullscreen mode Exit fullscreen mode

From Restic’s perspective, it is just backing up directories.

The server side

The server is written in Go.

  • HTTP layer: Chi
  • Agent communication: gRPC
  • Scheduler: gocron
  • Migrations: golang-migrate
  • Database: SQLite by default, PostgreSQL for larger setups

When a job triggers, the scheduler:

  • resolves the assigned agent
  • checks if the agent is connected
  • pushes the job over the active gRPC stream

If the agent is offline, the job fails immediately.

Real-time updates are handled via WebSockets.

As agents stream logs back, the server broadcasts them to connected clients watching that job.

Authentication

Authentication is handled with JWT (RS256).

Passwords are hashed with Argon2id.

OIDC is implemented using Authorization Code + PKCE via coreos/go-oidc.

It plugs directly into providers like Zitadel, Keycloak, or Authentik.

SSO support was not optional. It shaped a good part of the system design early on, especially in how agents and users authenticate and interact with the server.

The frontend

The frontend is a Vue 3 PWA served directly from the server binary as embedded static assets.

No separate web server needed.

It uses shadcn-vue and Tailwind CSS v4, with WebSocket subscriptions for real-time updates.

The UI is intentionally minimal for now:

  • agents
  • destinations
  • policies
  • jobs
  • snapshots

A proper dashboard is on the roadmap.

Current state and what’s next

Arkeep is currently at v0.1.0-beta.1

I have been running it as a secondary backup system in a real environment for about a month.

So far it has been stable enough to validate the architecture, but it is still early and bugs are expected.

Planned improvements:

  • better dashboards and reporting
  • support for VM's
  • more destination types

Closing thoughts

Arkeep is open source under Apache 2.0.

Repo: https://github.com/arkeep-io/arkeep

If you are dealing with backups across multiple machines, how are you handling it today?

Are you using scripts, existing tools, or something custom?

Top comments (9)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

multi-machine backup visibility is one of those things every existing tool almost solves. per-machine view is fine until you hit 4+ boxes - then the aggregated failure state becomes the actual work.

Collapse
 
filippocrotti profile image
Filippo Crotti

Thank you Mykola for the comment,

This is exactly the problem Arkeep is built around. The per-machine view breaks down fast, by the time you're managing 4+ hosts you don't want to SSH into each one (I had 2 hosts and was already a pain) to check last night's run, you want a single place that tells you what failed and why.
The aggregated failure state is the actual work, as you said. That's what the dashboard and the job log view are designed for. One place, all hosts, full log output without hunting through individual machines.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

yeah 2 hosts was my breaking point too - the moment you start scheduling "SSH into box B to verify" as an actual task, something has gone wrong. does Arkeep push alerts when a job fails or is it more of a dashboard-check workflow?

Thread Thread
 
filippocrotti profile image
Filippo Crotti • Edited

I have added the possibility to connect an SMTP server to push email notification to multiple recipients so you don't have to look at a dashboard every morning. If something is wrong Arkeep tells you what even if you are not looking at it.
The SMTP servers options are configurable in the settings page on the gui so that you don't have to mess around with too much configuration via the terminal or with docker compose.
Right now the available notifications are:

  • Agent disconnected or gone offline
  • Backup job failed
  • Backup job succeded

I'm thinking about push notifications also, since the gui is a PWA and could be handy to have push notifications on mobile or on desktop in general.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

smart - dashboards require intent, email doesn’t. nice that it’s already in the settings.

Collapse
 
jon_at_backboardio profile image
Jonathan Murray

The decision to build your own is often the right call when existing tools solve a different problem than the one you actually have. "One place to see all backups across multiple machines" sounds like table-stakes functionality but it genuinely isn't in most backup tools — they're designed around backing up one thing comprehensively, not monitoring many things consistently.

The architecture questions that usually matter most for a centralized backup manager: how do you handle the failure-reporting problem (when a backup job silently stops running vs when it runs and fails vs when it succeeds but produces a corrupt archive)? And how does the visibility layer handle the case where the central manager itself is down? There's an irony in building observability for backups on infrastructure that also needs to be backed up.

What's the transport mechanism between the remote machines and the central store? Push from agents or pull from the coordinator?

Collapse
 
filippocrotti profile image
Filippo Crotti

Thank you Jonathan for the comment.

Arkeep has the possibility to attach an smtp host to send notifications when backup jobs fails or succeeds.
The notifications also cover when an agent switched from online status to offline.

About the central manager visibility is a thing that I'm working on. It is in the planning, not yet impemented. Observability it's a must in a system like this and it is in the roadmap for version 1.0.

The connection between agents an the central server is a presistent gRPC in pull only.

Collapse
 
trinhcuong-ast profile image
Kai Alder

The outbound-only agent model is really smart. Been there with SSH-based backup setups and they're a nightmare once you have more than a few machines - key management, port forwarding through NAT, timeout issues when connections drop mid-backup...

One question about the gRPC stream - how are you handling reconnection? Like if the server restarts or there's a brief network blip, does the agent just backoff-and-retry automatically, or does it require manual intervention?

Also curious about the Docker volume discovery. Are you doing anything special for containers with multiple named volumes, or relying on users to specify which ones matter? I've seen setups where people accidentally backup massive cache volumes alongside their actual data.

Really nice that you went Apache 2.0 - makes integration easier than AGPL for teams that want to embed this in their stack. Might check it out for a few homelab servers that are still running janky borgmatic scripts.

Collapse
 
filippocrotti profile image
Filippo Crotti • Edited

Thank you Kai for the comment.

When an agent starts, it automatically searches for the server connection until it's established. If the server goes down or a network interruption breaks the gRPC stream, the agent enters a "searching for server" state and keeps retrying until the connection is re-established, no manual intervention needed unless the agent itself is stopped.

Nothing too fancy for Docker volumes yet. The agent automatically discovers if Docker is available on each session, so if you add Docker after the agent is already deployed, it picks it up on the next connection without any manual changes. When creating a backup policy, the system shows the list of available volumes and lets you choose which ones to include, that way you don't accidentally back up cache volumes you don't care about. If new volumes are added later, you can update the policy by editing it, no need to recreate anything, since discovery via the Docker socket handles changes automatically.