I Stopped Restarting HTTP Connections Between AI Models. Here Is What I Use Instead.

#ai #go #machinelearning #networking

A 5-stage AI pipeline where each model takes 200ms of compute time should take about 1 second. In practice it often takes 1.75 seconds or more. The extra 750ms is not your models. It is your transport.

This post is about what happens when you replace per-request HTTP connections between model services with persistent tunnels, why the difference matters more than most people think, and how to implement it with a Go orchestrator that looks like normal HTTP but routes over encrypted P2P connections.

The problem with per-request HTTP between models

The standard setup for a multi-model pipeline is simple: each model exposes a REST endpoint, the orchestrator calls them in sequence, HTTP keep-alive reuses connections where possible. This works fine in development and falls apart in production in a few specific ways.

Every HTTPS request that can't reuse a keep-alive connection pays the full setup cost: DNS lookup (~5ms, cached after first), TCP handshake (~10ms, 1 RTT), TLS negotiation (~30ms, 2 RTTs). That is 45ms of overhead before a single token is processed. Across a 5-stage pipeline that is 225ms of pure networking waste per request.

HTTP keep-alive helps but is fragile. Keep-alive connections expire after idle timeouts (typically 60 seconds on most servers). Load balancers reshuffle them. If any of your model services sit behind NAT or change IP addresses, you cannot use keep-alive reliably at all because each service needs a stable public endpoint.

There is also the VRAM problem that forces distributed pipelines in the first place. A 7B model requires roughly 14 GB of VRAM in FP16. Two models exhaust a consumer GPU. Three crash the process. Spreading models across machines is the reliable answer, but then you have introduced a network dependency that you need to manage carefully.

What a persistent tunnel changes

Pilot Protocol gives each agent a permanent 48-bit virtual address and establishes encrypted UDP tunnels between them. The tunnels stay open with keepalive probes every 30 seconds and a 120-second idle timeout. The tunnel survives network changes, NAT rebinding, and transient packet loss without reconnecting at the application layer.

The result is that you pay the connection setup cost exactly once per agent pair, not once per request. From the benchmarks in the Pilot documentation, a 3-stage model chain processing 1,000 sequential inference requests sees:

Transport	Per-request network overhead	1,000 requests total
Per-request HTTPS	~150ms/req (20%)	~750s
HTTP keep-alive	~20ms/req (3%)	~625s
Pilot persistent tunnel	~5ms/req (<1%)	~605s

The tunnel reduces per-request overhead to under 5ms. Over 1,000 requests that is roughly 145 seconds saved compared to per-request HTTPS. For latency-sensitive pipelines this also eliminates tail latency spikes from sporadic TLS handshakes and DNS timeouts.

The more significant advantage is resilience. Keep-alive connections die after idle timeouts. Pilot tunnels actively maintain themselves. If a probe fails, the tunnel reconnects automatically. Long-running inference pipelines that process traffic over hours stop getting transient connection failures from expired keep-alive connections.

The architecture

Each machine in the pipeline runs a Pilot daemon and a model server. The model server listens on port 80 over the Pilot overlay. The orchestrator connects to each model agent once at startup, then reuses those tunnels for every inference call.

Machine A (A100 80GB):  LLM agent       address 1:0001.0001.0001
Machine B (T4 16GB):    Whisper agent   address 1:0001.0002.0001
Machine C (A10G 24GB):  Image agent     address 1:0001.0003.0001
Machine D (CPU):        Orchestrator    address 1:0001.0004.0001

Each model agent registers capability tags at startup:

# On Machine A (LLM)
pilotctl set-tags model-service llm reasoning

# On Machine B (Whisper)
pilotctl set-tags model-service whisper audio

# On Machine C (Image gen)
pilotctl set-tags model-service diffusion image

The orchestrator discovers available models by tag at startup, not by hardcoded address:

pilotctl find-by-tag model-service --json

This means you can add, remove, or replace model agents without changing orchestrator configuration.

The Go orchestrator

Here is the full orchestrator. The key detail is d.HTTPTransport() which returns a net/http.RoundTripper that routes requests through Pilot tunnels. The http.Client uses it transparently. The code looks like normal HTTP. There are no DNS lookups, no TCP handshakes, no TLS negotiations per request.

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io"
    "net/http"
    "time"

    "github.com/TeoSlayer/pilotprotocol/pkg/driver"
)

var (
    llmAddr     = "1:0001.0001.0001"
    whisperAddr = "1:0001.0002.0001"
    imageAddr   = "1:0001.0003.0001"
)

type ChainResponse struct {
    Transcript string `json:"transcript"`
    Analysis   string `json:"analysis"`
    ImageURL   string `json:"image_url"`
    TotalMs    int64  `json:"total_ms"`
}

func main() {
    d, err := driver.Connect()
    if err != nil {
        panic(err)
    }

    // Listen on port 80 over the Pilot overlay
    ln, err := d.Listen(80)
    if err != nil {
        panic(err)
    }

    // HTTP client that routes through persistent Pilot tunnels
    client := &http.Client{
        Transport: d.HTTPTransport(),
        Timeout:   60 * time.Second,
    }

    mux := http.NewServeMux()
    mux.HandleFunc("/chain", func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        var req struct{ AudioURL string `json:"audio_url"` }
        json.NewDecoder(r.Body).Decode(&req)

        // Stage 1: transcribe audio
        transcript, err := callModel(client, whisperAddr, "/v1/transcribe",
            map[string]string{"audio_url": req.AudioURL})
        if err != nil {
            http.Error(w, err.Error(), 500)
            return
        }

        // Stage 2: analyze transcript
        analysis, err := callModel(client, llmAddr, "/v1/completions",
            map[string]string{"prompt": "Summarize key points: " + transcript})
        if err != nil {
            http.Error(w, err.Error(), 500)
            return
        }

        // Stage 3: generate visualization
        imageURL, err := callModel(client, imageAddr, "/v1/generate",
            map[string]string{"prompt": "Infographic for: " + analysis})
        if err != nil {
            http.Error(w, err.Error(), 500)
            return
        }

        json.NewEncoder(w).Encode(ChainResponse{
            Transcript: transcript,
            Analysis:   analysis,
            ImageURL:   imageURL,
            TotalMs:    time.Since(start).Milliseconds(),
        })
    })

    fmt.Println("Orchestrator listening on port 80")
    http.Serve(ln, mux)
}

func callModel(client *http.Client, addr, path string, payload any) (string, error) {
    body, _ := json.Marshal(payload)
    // Routes through the existing persistent tunnel - no connection overhead
    resp, err := client.Post(
        fmt.Sprintf("http://%s%s", addr, path),
        "application/json",
        bytes.NewReader(body),
    )
    if err != nil {
        return "", err
    }
    defer resp.Body.Close()
    result, _ := io.ReadAll(resp.Body)
    var parsed struct{ Result string `json:"result"` }
    json.Unmarshal(result, &parsed)
    return parsed.Result, nil
}

The tunnel between orchestrator and each model agent is established when the http.Client first makes a request to that address. After that, every subsequent call to the same address reuses the existing tunnel. No reconnect, no negotiation. Just the packet hitting the other end.

What is happening under the hood

The transport layer in Pilot Protocol is a userspace reliable stream built over UDP. It implements a sliding window for flow control, AIMD congestion control (the same algorithm family TCP uses), and Nagle's algorithm to coalesce small writes. The reason this is built on UDP rather than TCP is NAT traversal: STUN hole-punching works at the UDP layer, which means two agents behind different NATs can establish a direct connection without either needing a public IP.

Encryption uses X25519 key exchange and AES-256-GCM per tunnel, implemented in pure Go with zero external dependencies. The entire crypto stack is Go's standard library: crypto/ecdh, crypto/aes, crypto/cipher, crypto/ed25519. No CGO, no OpenSSL bindings. The binary has no external dependencies.

Since the Conn type implements Read, Write, SetDeadline, and Close, it satisfies Go's standard net.Conn interface. This is what makes the HTTP server work over it without modification. For more detail on the full transport implementation, see Building a Userspace TCP-over-UDP Stack in Pure Go.

When this is worth the complexity

Distributed model chaining with persistent tunnels is the right choice when:

Models exceed single-machine VRAM. If your pipeline needs three models that each require 20+ GB of VRAM, you physically cannot fit them on one GPU. Distribute them.

Models need different hardware. LLMs benefit from A100s. Whisper runs well on T4s. Image generation needs different batch sizes. Matching model to hardware saves cost. Pilot's tag-based discovery makes this transparent to the orchestrator.

Models have different scaling profiles. Your LLM handles 10 requests/second. Your embedding model handles 1,000. Running them on the same machine wastes capacity. Scale them independently.

Pipeline processes sustained traffic. The tunnel amortizes its setup cost across every request. For pipelines processing hundreds to thousands of requests per hour, the savings compound. For a pipeline that runs twice a day, per-request HTTP is fine.

Single-machine serving is better for simple pipelines, sub-200ms hard latency requirements (no network can beat local function calls), or cases where all models fit comfortably in VRAM together.

Getting started

Install Pilot on each model machine, start the daemon, tag each agent with its capabilities, and use driver.Listen(80) in your model server. The orchestrator discovers agents by tag and routes through persistent tunnels from that point.

# Each machine
curl -fsSL https://pilotprotocol.network/install.sh | sh
pilotctl daemon start --email model-service@example.com
pilotctl join --network 1
pilotctl set-tags model-service llm   # or whisper, diffusion, etc.

The only shared infrastructure is the rendezvous server for initial discovery. The public one at pilotprotocol.network works out of the box, or you can self-host. After the initial handshake, all model traffic is direct P2P between the machines with no intermediary.

For the encryption details, see Encrypting Agent Traffic with Zero External Dependencies. For how the trust model controls which agents can connect to which, see How Pilot Protocol Works.

Network on standard UDP providing permanent 48-bit virtual addressing, authenticated encrypted channels, and direct p2p transport.
curl -fsSL https://pilotprotocol.network/plain
curl -fsSL https://pilotprotocol.network/install.sh | sh
Agents using this gain sovereign identity and direct coordination without human intermediaries or centralized platforms.