There’s no shortage of backup tools.
But none of them gave me a simple way to manage backups across multiple machines from one place, so I built my own.
This is the architecture behind it.
The problem
I was managing backups for a small setup: two servers, a handful of Docker containers, and a few directories that needed regular backups.
What I wanted was simple:
- one place to see all backups
- clear visibility on what ran, when, and whether it succeeded
- metrics like transferred data and snapshot size
- SSO support via OIDC, since my stack already runs behind Zitadel
There was also another constraint in the background: compliance.
I had started evaluating what would be needed to align with ISO 27001, and backup visibility, traceability, and centralized control quickly became non-negotiable.
Backrest was the closest match I found. It’s well built, and Restic is rock solid underneath.
The problem is that Backrest is still single-machine. You deploy one instance per server, and each one has its own UI. That is not centralized management. It is just multiple dashboards.
Other solutions exist, but none of them matched the combination I needed.
I could not find an open source tool that combined:
- centralized management
- a modern UI
- proper OIDC support
- a structure that could fit into a compliance-oriented setup
So I built Arkeep.
Why not just use scripts?
Could I have glued everything together with cron, Restic, and a few shell scripts? Yes.
But that still leaves scheduling, visibility, multi-machine coordination, authentication, and auditability as separate problems.
I wanted one system, not a pile of scripts.
Why not SSH into machines?
Another option would be to have a central server SSH into each machine and trigger backups.
I avoided that on purpose.
Having agents maintain an outbound connection is:
- easier to deploy
- easier to secure
- more reliable across real-world setups
Also, my infrastructure runs behind Netbird and everything is behind SSO. That was a hard requirement from day one.
And to be honest, both SSO and VPN constraints were a bit of a pain in the ass to design around, so I leaned into a model that works with them instead of fighting them.
The architecture
The system is built around a server/agent model.
The server is the control plane. It runs:
- REST API
- web UI
- scheduler
- gRPC server
- database
It stores configuration, schedules jobs, and receives results.
Agents run on each machine you want to back up.
They do not expose any ports. Each agent connects outbound to the server using a persistent gRPC stream and waits for jobs.
When a backup needs to run, the server pushes the job through that stream.
┌─────────────────────────────────┐
│ Arkeep Server │
│ REST API │ gRPC Server │
│ Scheduler │ WebSocket Hub │
│ SQLite/PG │ Notification Svc │
└─────────────────────────────────┘
▲ ▲
│ gRPC │ gRPC
│ (persistent stream)
┌───────┴────┐ ┌───┴────────┐
│ Agent A │ │ Agent B │
│ Server 1 │ │ Server 2 │
└────────────┘ └────────────┘
This model solves a few practical problems:
- no inbound ports required
- works cleanly behind NAT and VPN overlays like Netbird
- instant disconnect detection when the stream drops
- centralized visibility, which is important for audit and compliance
At a high level, dispatching a job looks like this:
func (s *Server) DispatchJob(agentID string, job *pb.JobRequest) error {
stream, ok := s.agentStreams.Get(agentID)
if !ok {
return fmt.Errorf("agent %s offline", agentID)
}
return stream.Send(job)
}
No polling. No SSH. Just a persistent stream.
What the agent actually does
When a job arrives, the agent:
- Resolves sources (directories or Docker volumes)
- Runs pre-backup hooks
- Executes the backup using Restic
- Runs post-backup hooks
- Streams logs and metrics back to the server in real time
- Reports the final result
This is roughly how the agent wraps Restic:
cmd := exec.Command("restic", "backup", "--json", sourcePath)
stdout, err := cmd.StdoutPipe()
if err != nil {
return err
}
if err := cmd.Start(); err != nil {
return err
}
scanner := bufio.NewScanner(stdout)
for scanner.Scan() {
var event ResticEvent
if err := json.Unmarshal(scanner.Bytes(), &event); err == nil {
jobStream.Send(event)
}
}
return cmd.Wait()
Restic stays the backup engine. The agent adds structure, observability, and control.
Docker volume discovery
For Docker workloads, the agent can automatically resolve volumes at runtime.
func resolveVolumeMounts(cli *client.Client, containerID string) ([]string, error) {
ctx := context.Background()
container, err := cli.ContainerInspect(ctx, containerID)
if err != nil {
return nil, err
}
var paths []string
for _, mount := range container.Mounts {
if mount.Type == mount.TypeVolume {
paths = append(paths, mount.Source)
}
}
return paths, nil
}
From Restic’s perspective, it is just backing up directories.
The server side
The server is written in Go.
- HTTP layer: Chi
- Agent communication: gRPC
- Scheduler: gocron
- Migrations: golang-migrate
- Database: SQLite by default, PostgreSQL for larger setups
When a job triggers, the scheduler:
- resolves the assigned agent
- checks if the agent is connected
- pushes the job over the active gRPC stream
If the agent is offline, the job fails immediately.
Real-time updates are handled via WebSockets.
As agents stream logs back, the server broadcasts them to connected clients watching that job.
Authentication
Authentication is handled with JWT (RS256).
Passwords are hashed with Argon2id.
OIDC is implemented using Authorization Code + PKCE via coreos/go-oidc.
It plugs directly into providers like Zitadel, Keycloak, or Authentik.
SSO support was not optional. It shaped a good part of the system design early on, especially in how agents and users authenticate and interact with the server.
The frontend
The frontend is a Vue 3 PWA served directly from the server binary as embedded static assets.
No separate web server needed.
It uses shadcn-vue and Tailwind CSS v4, with WebSocket subscriptions for real-time updates.
The UI is intentionally minimal for now:
- agents
- destinations
- policies
- jobs
- snapshots
A proper dashboard is on the roadmap.
Current state and what’s next
Arkeep is currently at v0.1.0-beta.1
I have been running it as a secondary backup system in a real environment for about a month.
So far it has been stable enough to validate the architecture, but it is still early and bugs are expected.
Planned improvements:
- better dashboards and reporting
- support for VM's
- more destination types
Closing thoughts
Arkeep is open source under Apache 2.0.
Repo: https://github.com/arkeep-io/arkeep
If you are dealing with backups across multiple machines, how are you handling it today?
Are you using scripts, existing tools, or something custom?
Top comments (0)