João Pedro Silva Setas

Posted on Feb 16

Why I Chose Elixir Over Go and Rust for My Cloud Platform

#ai #cloud #agents #architecture

I'm building OpenClaw Cloud, a managed platform where each user gets their own personal AI assistant running 24/7 in the cloud. When I started, I had a real decision to make about the core technology.

Go and Rust were serious contenders. Both are fast, well-supported, and have massive ecosystems. I ended up choosing Elixir — not because it's "better" in some absolute sense, but because it was the right fit for this specific problem. Here's the full reasoning, with honest tradeoffs.

The Problem: Managing Hundreds of Long-Lived Processes

OpenClaw Cloud manages one dedicated bot instance per user. Each instance is a long-running process that:

Maintains persistent WebSocket connections to chat platforms (Discord, Telegram, WhatsApp, Slack)
Holds conversation state and context in memory
Handles concurrent messages from multiple channels simultaneously
Needs to be started, stopped, restarted, and monitored independently
Must recover gracefully from crashes without affecting other users

This isn't a typical request/response web app. It's a process orchestration problem — hundreds of stateful, concurrent, long-lived workers that need supervision and lifecycle management.

That framing is what drove the decision.

Concurrency Models: Three Very Different Approaches

Go: Goroutines and Channels

Go's concurrency model is elegant. Goroutines are cheap (a few KB of stack), and channels provide a clean way to communicate between them.

// Spinning up a worker per user in Go
for _, user := range users {
    go func(u User) {
        bot := NewBotInstance(u)
        bot.Run() // blocks, handles reconnection internally
    }(user)
}

This is simple and works. Go would have been a perfectly fine choice for the raw concurrency part. The goroutine scheduler handles thousands of concurrent workers without breaking a sweat.

Where it gets complicated: Go doesn't have a built-in answer for what happens when a goroutine crashes. You need to build your own supervisor logic — retry loops, health checks, graceful restarts. It's doable, but it's DIY. Every team ends up writing a slightly different version of process supervision.

Rust: Async with Tokio

Rust with Tokio gives you async/await over a multi-threaded runtime. The performance is outstanding — near-zero overhead async I/O.

// Spawning tasks with Tokio
for user in users {
    tokio::spawn(async move {
        let bot = BotInstance::new(user);
        bot.run().await; // handles connections
    });
}

Rust's async model is powerful, and you get memory safety guarantees at compile time. But the ownership model adds real friction when you're managing shared state across many concurrent tasks. Arc<Mutex<T>> everywhere, lifetime annotations, and the borrow checker fighting you when you're passing context between tasks.

The honest truth: For a solo developer iterating fast on a product, Rust's compile-time overhead (both in build times and cognitive load) is significant. I love Rust for systems programming, but for a SaaS product where features change weekly, it slowed me down.

Elixir: Processes and OTP

Elixir runs on the BEAM virtual machine, which was designed from the ground up for this exact problem — massive concurrency with isolated, lightweight processes.

# Each bot is a GenServer — a managed, supervised process
defmodule Openclaw.InstanceWorker do
  use GenServer

  def start_link(%{user_id: user_id} = args) do
    GenServer.start_link(__MODULE__, args, name: via_tuple(user_id))
  end

  def init(args) do
    # Connect to chat platforms, set up state
    {:ok, %{user_id: args.user_id, connections: [], status: :starting}}
  end

  def handle_info(:health_check, state) do
    # Periodic self-check — reconnect if needed
    {:noreply, maybe_reconnect(state)}
  end
end

BEAM processes are extremely lightweight (~2KB each), fully isolated (no shared memory), and communicate via message passing. But the key differentiator isn't just the process model — it's everything built on top of it.

Supervision Trees: The Killer Feature

This is where Elixir pulled decisively ahead for my use case.

In OTP, every process lives inside a supervision tree. Supervisors are processes that watch child processes and apply a restart strategy when things go wrong.

defmodule Openclaw.InstanceSupervisor do
  use Horde.DynamicSupervisor

  def start_instance(user_id, config) do
    child_spec = {Openclaw.InstanceWorker, %{user_id: user_id, config: config}}
    Horde.DynamicSupervisor.start_child(__MODULE__, child_spec)
  end

  def stop_instance(user_id) do
    case Registry.lookup(Openclaw.InstanceRegistry, user_id) do
      [{pid, _}] -> Horde.DynamicSupervisor.terminate_child(__MODULE__, pid)
      [] -> {:error, :not_found}
    end
  end
end

If a bot instance crashes — maybe Discord's API returns an unexpected response, or a chat message triggers an unhandled edge case — the supervisor restarts it automatically. The other 200 bot instances running on the same node are completely unaffected because processes share no memory.

In Go, I'd have to build all of this manually: a registry of running goroutines, health check loops, restart logic, graceful shutdown coordination. It's probably 1,000+ lines of infrastructure code that Elixir gives me for free.

In Rust, the situation is similar. Tokio has JoinHandle for tracking spawned tasks, but building a full supervision tree with restart strategies, escalation policies, and distributed process registries is a major engineering effort.

The OTP supervision model isn't just convenient — it changes how you think about failure. Instead of defensive programming ("catch every possible error"), you write the happy path and let the supervisor handle the rest. Let it crash is a real philosophy, and it works remarkably well for managing many independent, failure-prone processes.

Hot Code Reloading: Zero-Downtime Deployments

BEAM supports hot code swapping — you can deploy new code to a running system without restarting processes or dropping connections.

For a platform where users have 24/7 always-on bot instances, this is huge. When I push an update to the platform code, I don't have to restart everyone's bot. The running processes can be updated in place, maintaining their state and connections.

# In production, Fly.io rolling deploys + BEAM hot code loading
# means existing connections stay alive during deployments

In practice, I use Fly.io's rolling deployments which handle most of this, but the BEAM's ability to maintain state across code changes is an additional safety net that neither Go nor Rust can match at the VM level.

Go requires a full process restart for any code change. You can do rolling restarts behind a load balancer, but every goroutine's state is lost.

Rust requires recompilation and restart. The compile step alone takes minutes for a non-trivial project.

Real-Time UI: Phoenix LiveView

This isn't strictly a language comparison, but the web framework was part of the decision. Phoenix LiveView lets me build real-time, interactive UIs without writing JavaScript.

The OpenClaw Cloud dashboard shows each user their bot's status, logs, and controls — all updating in real-time via WebSockets. When a bot instance starts, crashes, or reconnects, the UI reflects it instantly.

# LiveView receives real-time updates via PubSub
def handle_info({:instance_status, %{status: status}}, socket) do
  {:noreply, assign(socket, :instance_status, status)}
end

Building this in Go would mean a separate frontend (React, Vue, etc.) plus a WebSocket layer plus state synchronization logic. In Rust, same story — probably even more boilerplate with something like Axum + a JS frontend.

LiveView collapses the frontend and backend into one coherent model. For a solo developer, that's a 2-3x productivity multiplier.

Where Go and Rust Win (Honestly)

I'd be doing a disservice if I didn't acknowledge where Go and Rust genuinely outperform Elixir:

Go Wins

Raw throughput for CPU-bound work: Go compiles to native code. If I were building a platform that needed heavy computation (video processing, ML inference), Go would be faster out of the box.
Simplicity of deployment: Single static binary. No runtime dependency. go build && scp. It doesn't get simpler than that.
Ecosystem breadth: Go has libraries for everything. Cloud SDKs, Kubernetes tooling, CLI tools — the ecosystem is massive.
Hiring: If I were building a team, finding Go developers is much easier than finding Elixir developers.

Rust Wins

Performance ceiling: Rust is as fast as C/C++ with memory safety. For systems-level work, nothing else comes close.
Memory efficiency: Zero-cost abstractions and no garbage collector mean predictable, minimal memory usage. Critical for embedded systems or extremely resource-constrained environments.
Type system: Rust's type system catches entire categories of bugs at compile time. The Result and Option types make error handling explicit and exhaustive.
WebAssembly: Rust has the best WASM story. If I needed client-side compiled code, Rust would be my first choice.

Elixir's Weaknesses

Let me be upfront about the tradeoffs:

Raw CPU performance: The BEAM is not fast for computation. It's optimized for I/O-bound, concurrent workloads. If I had heavy number-crunching, I'd need to reach for NIFs (native functions) or offload to a separate service.
Smaller ecosystem: Hex (Elixir's package manager) has ~15,000 packages vs npm's 2M+ or Go's massive standard library. Sometimes you write something from scratch that would be a go get away in Go.
Smaller talent pool: Finding Elixir developers is harder. This matters less for a solo founder but would matter if I were scaling a team.
Learning curve: OTP concepts (GenServer, Supervisor, Application) are powerful but take time to internalize. The functional programming paradigm is a shift for developers coming from OOP.

Why Elixir Won for This Specific Problem

The decision came down to matching the technology to the problem domain:

Requirement	Best Fit
Hundreds of concurrent, long-lived processes	BEAM (Elixir)
Automatic crash recovery per process	OTP Supervisors (Elixir)
Real-time UI without separate frontend	Phoenix LiveView (Elixir)
Zero-downtime deployments	BEAM hot code reloading (Elixir)
Distributed process registry	Horde (Elixir)
Solo developer productivity	LiveView + OTP = less code (Elixir)
Raw computation speed	Go or Rust
Maximum ecosystem breadth	Go
Memory-constrained environments	Rust

For a managed cloud platform orchestrating hundreds of stateful, long-running AI bot instances with real-time monitoring — Elixir wasn't just a good fit, it was almost purpose-built for the job.

The BEAM was originally created by Ericsson in the 1980s to manage millions of concurrent telephone calls with 99.9999999% uptime. Managing a few hundred AI bots is a much simpler version of the same problem.

The Stack in Practice

For the curious, here's what the full OpenClaw Cloud stack looks like today:

Elixir 1.17 / OTP 27 — core language and runtime
Phoenix 1.8 with LiveView 1.1 — web framework and real-time UI
Horde — distributed supervisor and registry for bot instances
PostgreSQL via Ecto — data persistence
Fly.io — hosting (both platform and user instances)
Stripe — subscription billing
Tailwind CSS + DaisyUI — styling
Sentry — error monitoring
Bandit — HTTP server

Total monthly infrastructure cost for running the platform: under $50/month on Fly.io.

Final Thoughts

Technology choices are always contextual. If I were building a CLI tool, I'd pick Go. If I were building a game engine, I'd pick Rust. But for a managed platform that supervises hundreds of concurrent, stateful, long-lived processes with real-time monitoring — Elixir and the BEAM are in a class of their own.

The "let it crash" philosophy, supervision trees, lightweight processes, and LiveView for real-time UI made me more productive as a solo developer than I would have been in either Go or Rust. And when your competitive advantage is shipping fast with zero budget, productivity is everything.

If you're evaluating languages for a similar problem — high concurrency, stateful processes, real-time features — give Elixir a serious look. The ecosystem is smaller but the core primitives are extraordinary.

I'm João, a solo developer from Portugal building OpenClaw Cloud and other SaaS products with Elixir. Follow me here or on X @joaosetas for more build-in-public content.

DEV Community