Alex

Posted on Apr 5

Building a Discord Caller (Voice Relay) Bot in Go

#go #discord #opensource #programming

Intro

If you've ever led a raid in an MMORPG, you know the drill: the boss pulls, everything goes sideways, and you have maybe two seconds to bark precise commands at five different party groups — each with their own role, their own channel, their own job to do. Tanks here, healers there, DPS rotate — all at once, all clearly, all right now.

Our guild lived in Discord. Every strategy call, every pre-pull reminder, every panicked "STACK STACK STACK" — all of it happened there. But the moment a big fight started, we had to context-switch to an external app just to broadcast voice across multiple channels at once. One tool for coordination, another for execution. Every raid.

That friction is what built go-discord-caller. One bot listens to the raid leader. A pool of speaker bots instantly re-broadcast that voice into every party channel simultaneously — no external app, no tab switching, no lag between the call and the action. Just one voice reaching everyone who needs to hear it, right inside Discord, the moment it matters.

How Discord Voice Works

Let's start with the short basics of how Discord voice works and related libraries, since that shapes the entire architecture of the bot.

Discord gateways and voice flow:

Discord uses two separate connections: a WebSocket gateway for events and control (presence, voice state updates, slash commands) and a separate UDP socket for actual voice data
To join a voice channel, the client sends a Voice State Update over the gateway; Discord responds with a Voice Server Update containing the endpoint and session token for the voice UDP connection
Audio is encoded with the Opus codec and sent as raw Opus frames over UDP — low latency, compact, designed for real-time speech

DAVE — Discord's Audio and Video Encryption:

Before DAVE, Discord voice was encrypted in transit (TLS/SRTP) but Discord's servers could theoretically decrypt it — no true end-to-end encryption
DAVE (Discord's Audio and Video Encryption) brings real E2EE to voice and video using MLS (Messaging Layer Security), an IETF standard built for group communication, which provides forward secrecy and post-compromise security
For a relay bot this matters directly: audio frames arrive already DAVE-encrypted, so the bot must participate in the MLS session to receive and re-transmit them — it can't just forward raw bytes

disgo — the Go Discord library:

disgo is the Discord library powering this project, and it's one I've been using for a few years now in various projects. It's a full-featured, actively maintained library with good documentation and a clean API design.

godave — CGO bridge to libdave:

Discord's reference DAVE implementation is a native C++ library (libdave)
godave is a CGO wrapper that exposes libdave to Go — it's interop to the original, not a reimplementation, which means protocol correctness is guaranteed by Discord's own code
The trade-off: CGO adds a C toolchain requirement and complicates static builds and containerization — this shapes the entire build and deployment strategy covered later

Architecture Overview

The main idea is to support voice pipelines like this:

User (with caller role) speaks
        ↓
Owner bot VoiceReceiver (role-filtered Opus frames)
        ↓
Go channel ([]byte, buffered)
        ↓ fan-out
Speaker bot 1 VoiceProvider → Voice channel A
Speaker bot 2 VoiceProvider → Voice channel B
Speaker bot N VoiceProvider → Voice channel N

To achieve this, the key design principles needed to be implemented:

One caller bot receives audio and re-transmits it to N speaker bots
Per-guild isolation: each guild has its own state, speaker pool, and session
Simple persisted state to survive restarts
Setting up UI to set up channel and role bindings without any external web dashboard
Minimal slash command surface — just enough to start, stop, and configure the relay, nothing more
Role-based access control to allow any user with manager role to initiate a voice session and control who can speak without granting full admin
Easy configuration and delivery via containerization (Docker) with CGO dependencies

Ok, let's dive into a short explanation of how some of these pieces are implemented in go-discord-caller.

Inside the Relay — Audio Pipeline, Speaker Pool, and Orchestrator

The relay is built on three cooperating pieces. Here's how they fit together.

Audio pipeline (internal/opus/) — two small types implement disgo's voice interfaces. VoiceReceiver sits on the owner bot: it filters frames by role via an allowUser closure, drops frames non-blocking if the downstream channel is full, and shuts down via a done channel. VoiceProvider sits on each speaker: it blocks on channel receive, which naturally drives backpressure. Go channels are the audio bus — Opus frames flow from receiver through a fanout goroutine into N provider channels with no shared memory on the hot path.

func (v *VoiceReceiver) ReceiveOpusFrame(userID snowflake.ID, packet *voice.Packet) error {
    if packet == nil {
        return nil
    }

    // Non-blocking check: if already closed, discard silently.
    select {
    case <-v.done:
        return nil
    default:
    }

    // Ignore frames from our own bot to avoid re-echoing what we send.
    if v.botID != 0 && userID == v.botID {
        return nil
    }

    // Apply optional role/user filter.
    if v.allowUser != nil && !v.allowUser(userID) {
        return nil
    }

    // Copy the opus bytes before sending because the backing array may be reused
    // by the voice library.
    data := make([]byte, len(packet.Opus))
    copy(data, packet.Opus)

    // Try to forward the frame. Selecting on done prevents a send to a
    // channel that the relay goroutine has already stopped draining.
    select {
    case v.ch <- data:
    case <-v.done:
        // receiver was closed between the check above and here; discard safely
    default:
        slog.Info("dropping opus frame: channel full")
    }
    return nil
}

func (v *VoiceProvider) ProvideOpusFrame() ([]byte, error) {
    select {
    case <-v.done:
        return nil, fmt.Errorf("voice provider is closed")
    case data, ok := <-v.ch:
        if !ok {
            return nil, fmt.Errorf("voice provider channel closed")
        }
        return data, nil
    }
}

Speaker pool (internal/pool/service.go) — all speaker gateways connect concurrently on startup, so the bot is ready to relay as soon as all bots are online. When a voice raid starts, each enabled speaker joins its bound voice channel in parallel and the orchestrator waits for all of them to be connected before audio starts flowing.

Orchestrator (internal/manager/service.go) — the central coordinator that owns all per-guild state and drives the session lifecycle. StartVoiceRaid sequences the whole thing: resolve the guild's isolated state (each guild has its own speaker pool, bindings, and session) → snapshot enabled speakers → owner joins its channel → attach VoiceReceiver with role filter → all speakers join concurrently → fanout goroutine starts routing frames → session committed with a guard against concurrent starts.

// StartVoiceRaid initiates the voice relay for a guild.

resultCh := make(chan joinResult, len(candidates))

var wg sync.WaitGroup
wg.Add(len(candidates))
for _, c := range candidates {
    go func(sp *domain.Speaker, channelID snowflake.ID) {
        defer wg.Done()
        speakerID := sp.ID
        if err := m.speaker.JoinChannel(ctx, speakerID, guildID, channelID); err != nil {
            slog.Warn("speaker failed to join channel on raid start",
                slog.String("speakerID", speakerID.String()),
                slog.Any("err", err),
            )
            return
        }
        chOut := make(chan []byte, 10)
        if err := m.speaker.Consume(ctx, speakerID, guildID, chOut); err != nil {
            slog.Error("failed to consume voice data", slog.String("speakerID", speakerID.String()), slog.Any("err", err))
            m.speaker.LeaveChannel(ctx, guildID, speakerID)
            return
        }
        resultCh <- joinResult{sp, chOut}
    }(c.sp, c.channelID)
}
wg.Wait()
close(resultCh)

The session object tracks everything needed to tear down cleanly: the owner's voice connection, each speaker's connection, and the audio channels linking them. When StopVoiceRaid is called, the session closes the fan-out channel, which signals all VoiceProviders to stop, which causes each speaker to leave its voice channel in order. If a speaker disconnects mid-raid, the remaining speakers keep running — the broken connection is isolated to that one bot and doesn't affect the others or the owner.

Persistence — YAML Store

Configuration shouldn't disappear every time the bot restarts. The store.Store interface (internal/store/) handles that with a deliberately simple design.

The YAML-backed implementation persists channel bindings per guild and user, plus role bindings for the capture and manager roles. It loads from a single file at startup and writes on every change — no database, no migrations. An in-memory implementation exists for testing, but in production one file is all you need.

The intentional simplicity here is the point: bindings rarely change, the data is small, and a YAML file is easy to inspect, back up, or edit directly if something goes wrong.

guilds:
    - guild_id: 100000000000000001
      channels:
        - user_id: 100000000000000002
          channel_id: 100000000000000003
        - user_id: 100000000000000004
          channel_id: 100000000000000005
      roles:
        - role_type: caller
          role_id: 100000000000000006
        - role_type: manager
          role_id: 100000000000000007

Discord Slash Commands and RBAC

The bot is controlled entirely through slash commands — no external dashboard, no config file edits after initial setup. Four commands cover everything:

/status — public, shows current bindings and session state for the guild
/start — requires manager role, calls StartVoiceRaid to bring all speakers online and begin relaying
/stop — requires manager role, calls StopVoiceRaid to tear down the session and disconnect all speakers
/setup — requires Discord admin, opens the interactive setup panel

Setup UI — setup happens entirely inside Discord via ephemeral messages that act as a multi-page UI. The main menu branches into a Bind Roles page (role selectors for capture and manager roles) or a paginated Bind Speakers page where each speaker can be toggled and assigned a channel. Adding a new speaker opens a sub-page with an OAuth invite link. All navigation uses Discord component interactions — buttons and selects — so there's nothing to host or maintain externally.

Docker Build with CGO and Distroless

CGO makes containerization harder. You can't just COPY a statically linked binary — the build needs a C toolchain, libdave installed, and the resulting binary carries shared library dependencies at runtime.

The Dockerfile solves this with a three-stage build:

Build stage — CGO_ENABLED=1, installs libdave, compiles the binary
Deps stage — runs ldd against the compiled binary and extracts all shared libraries it needs
Runtime stage — starts from a distroless base image, copies in the binary and only the libraries ldd identified

The distroless base gives a minimal attack surface: no shell, no package manager, nothing that isn't needed to run the process. The ldd extraction is the key trick — it avoids having to know upfront which libraries libdave pulls in, and it keeps the runtime image lean regardless of what changes in the dependency tree.

ARG GO_VERSION=latest
ARG LIBDAVE_VERSION=v1.1.1

FROM golang:${GO_VERSION} as builder

ARG LIBDAVE_VERSION

WORKDIR /src
COPY . .

WORKDIR /src/cmd/bot

RUN apt-get update \
 && apt-get install -y --no-install-recommends clang git ca-certificates bash pkg-config build-essential libusb-1.0-0-dev unzip cmake nasm zip \
 && git clone https://github.com/disgoorg/godave /tmp/godave \
 && chmod +x /tmp/godave/scripts/libdave_install.sh \
 && /bin/bash /tmp/godave/scripts/libdave_install.sh $LIBDAVE_VERSION

ENV PKG_CONFIG_PATH="/root/.local/lib/pkgconfig"

RUN CGO_ENABLED=1 go build \
    -o /bin/runner

# Collect all shared library dependencies of the binary
RUN mkdir -p /runtime-libs && \
    ldd /bin/runner \
        | grep "=> /" \
        | awk '{print $3}' \
        | xargs -I{} cp --dereference {} /runtime-libs/

FROM gcr.io/distroless/base as runtime

COPY --from=builder /bin/runner /
COPY --from=builder /runtime-libs/ /usr/local/lib/

ENV LD_LIBRARY_PATH=/usr/local/lib

CMD ["/runner"]

Running It

Getting the bot running is straightforward. The only upfront work is creating the Discord bots and dropping all their tokens into a .env file — one owner token and as many speaker tokens as needed:

DISCORD_OWNER_BOT_TOKEN=...
DISCORD_SPEAKER_BOT_TOKEN_1=...
DISCORD_SPEAKER_BOT_TOKEN_2=...

Then pull and run:

docker run -d \
  --env-file .env \
  -e STORE_PATH=/data/store.yaml \
  -v $(pwd)/data:/data \
  sealbro/go-discord-caller

Everything else is configured from inside Discord via /setup — no need to touch the config again. On startup the bot prints invite URLs for all bots to the log, so there's no hunting through the Discord developer portal; just copy the link, invite the bot to the server, and it's ready to bind. From the /setup menu: assign the capture role, the manager role, bind the owner to a voice channel, add each speaker bot and assign it a channel. Then /start — and from that point on, bindings survive restarts automatically.

Full step-by-step setup instructions — including how to create the Discord application, configure bot permissions, and invite speaker bots — are available in the repository README.

Conclusion

go-discord-caller is a self-hosted solution — no subscription, no third-party service, no paying for a feature that Discord should arguably have built in. The whole thing lives on whatever cheap VPS or home server you already have.

What's next: two features are on the roadmap. Inter-server communication — relaying audio across guilds, not just across channels within one. And a caller/speaker audio mixer — letting audio flow in both directions, so speakers can respond to the raid leader without switching channels. If either of those sounds useful to you, star ⭐ the go-discord-caller repository and leave a thumbs-up 👍 on the relevant issue — that's the clearest signal for what gets built next.

Finally, if you want to try this for your guild raids without running your own instance, you can request access to an already-deployed bot — free, but spots are limited. Drop a message in the repository discussions. Keep in mind it runs the latest version, so it may occasionally be unstable — but you'll always get new features first.

DEV Community