How to Build Service Discovery and Load Balancing for Distributed Systems in Go

#programming #devto #go #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Imagine you have a team of workers in a large factory. Each worker performs a specific task, like painting or assembling. Now, imagine workers can appear, disappear, or get sick at any moment. Your job is to make sure that every request for a task—like "paint this car blue"—always finds a healthy, available painter. If one painter is busy, you send the request to another. If a painter gets sick, you stop sending them work until they recover. This is the core challenge of distributed computing, and it's what we solve with service discovery and load balancing.

I want to show you how to build the system that manages this. We'll create a central registry where services announce themselves. We'll build a health checker that constantly pings them. We'll design a router that picks the best instance for each new request. The goal is to make a network of services that feels as reliable as a single, solid machine.

Let's start with the foundation: the service registry. Think of it as a dynamic phone book. When a new service instance starts, it calls the registry to say, "I'm here." It provides its address, port, and some details about itself.

package main

import (
    "fmt"
    "sync"
    "time"
)

// ServiceStatus tracks the health of an instance.
type ServiceStatus int

const (
    StatusPending ServiceStatus = iota
    StatusHealthy
    StatusUnhealthy
)

// Service holds the data for one running instance.
type Service struct {
    ID         string
    Name       string
    Address    string
    Port       int
    Tags       []string
    Status     ServiceStatus
    LastHealth time.Time
}

// ServiceRegistry is our dynamic phone book.
type ServiceRegistry struct {
    services map[string]*Service // Map from Service ID to the Service object.
    mu       sync.RWMutex        // A lock to protect the map from concurrent access.
}

// NewServiceRegistry creates a new, empty registry.
func NewServiceRegistry() *ServiceRegistry {
    return &ServiceRegistry{
        services: make(map[string]*Service),
    }
}

// Register is called by a service when it starts.
func (sr *ServiceRegistry) Register(name, address string, port int, tags []string) error {
    sr.mu.Lock()
    defer sr.mu.Unlock() // This ensures the lock is released when the function finishes.

    // Create a unique ID for this instance.
    serviceID := fmt.Sprintf("%s-%s:%d", name, address, port)

    // Check if it's already registered.
    if _, exists := sr.services[serviceID]; exists {
        return fmt.Errorf("service %s already registered", serviceID)
    }

    // Add the new service to our map.
    sr.services[serviceID] = &Service{
        ID:         serviceID,
        Name:       name,
        Address:    address,
        Port:       port,
        Tags:       tags,
        Status:     StatusPending, // Starts as pending until health-checked.
        LastHealth: time.Now(),
    }

    fmt.Printf("Registered new service: %s\n", serviceID)
    return nil
}

When a service shuts down gracefully, it should call Deregister. But what if it crashes? That's where our next component comes in. We need a health checker. It will periodically ask each service, "Are you okay?" If a service doesn't answer, we mark it as unhealthy.

We can check health in different ways: by making an HTTP request to a /health endpoint, by trying to open a TCP connection, or even by running a custom piece of code.

type HealthCheckType int

const (
    HealthCheckHTTP HealthCheckType = iota
    HealthCheckTCP
)

type HealthCheck struct {
    Type         HealthCheckType
    HTTPEndpoint string // For HTTP checks.
    Address      string // For TCP checks.
    Port         int
    Interval     time.Duration // How often to check, e.g., every 10 seconds.
    Timeout      time.Duration // How long to wait for a response.
}

// HealthChecker runs all checks and updates the registry.
type HealthChecker struct {
    registry   *ServiceRegistry
    checks     map[string]*HealthCheck // Map from Service ID to its check config.
    httpClient *http.Client
}

// performCheck runs a single health check.
func (hc *HealthChecker) performCheck(serviceID string, check *HealthCheck) {
    var healthy bool
    var latency time.Duration

    start := time.Now()

    switch check.Type {
    case HealthCheckHTTP:
        // Try to make a GET request.
        resp, err := hc.httpClient.Get(check.HTTPEndpoint)
        if err == nil && resp.StatusCode == 200 {
            healthy = true
        }
        if resp != nil {
            resp.Body.Close()
        }
    case HealthCheckTCP:
        // Try to open a socket.
        address := fmt.Sprintf("%s:%d", check.Address, check.Port)
        conn, err := net.DialTimeout("tcp", address, check.Timeout)
        if err == nil {
            healthy = true
            conn.Close()
        }
    }

    latency = time.Since(start)

    // Update the service's status in the registry.
    hc.registry.mu.Lock()
    defer hc.registry.mu.Unlock()
    if service, exists := hc.registry.services[serviceID]; exists {
        if healthy {
            service.Status = StatusHealthy
        } else {
            service.Status = StatusUnhealthy
        }
        service.LastHealth = time.Now()
    }
}

The health checker runs in a loop, performing each check at its defined interval. This keeps our registry's view of the world up-to-date. Now we have a list of services and we know which ones are healthy. The next step is to decide where to send incoming work. This is load balancing.

The simplest method is round-robin. Just go down the list, one after another.

type RoundRobinStrategy struct {
    currentIndex uint32 // Use an atomic counter for thread-safety.
    instances    []*Service
}

func (rr *RoundRobinStrategy) Pick() *Service {
    if len(rr.instances) == 0 {
        return nil
    }
    // Atomically increment the index and wrap around using modulo.
    index := atomic.AddUint32(&rr.currentIndex, 1) % uint32(len(rr.instances))
    return rr.instances[index]
}

But what if some instances are more powerful than others? Or what if some are already handling many requests? We might want a "least connections" strategy.

type LeastConnectionsStrategy struct{}

func (lc *LeastConnectionsStrategy) Pick(instances []*Service) *Service {
    if len(instances) == 0 {
        return nil
    }
    var selected *Service
    minConns := int32(1 << 30) // Start with a very large number.

    for _, instance := range instances {
        // Assume each Service has an atomic counter for active connections.
        conns := atomic.LoadInt32(&instance.Connections)
        if conns < minConns {
            minConns = conns
            selected = instance
        }
    }
    return selected
}

Sometimes, the network is the bottleneck. A server might be healthy but slow because it's far away. A latency-aware strategy can track response times and prefer faster instances.

We need a way to route actual user requests. In an HTTP server, we can use a middleware. This middleware intercepts each request, figures out which service it's for, uses the load balancer to pick an instance, and then forwards the request.

func DiscoveryMiddleware(registry *ServiceRegistry, lb *LoadBalancer) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // Extract the target service name from the request path or a header.
            // For example, a path like /api/users/ might route to the "user-service".
            serviceName := "user-service" // Simplified extraction.

            // Get a list of healthy instances for this service.
            instances, err := registry.Discover(serviceName, true) // true = healthy only.
            if err != nil {
                http.Error(w, "Service not available", http.StatusServiceUnavailable)
                return
            }

            // Use the load balancer strategy to pick one.
            instance := lb.Pick(instances)
            if instance == nil {
                http.Error(w, "Service not available", http.StatusServiceUnavailable)
                return
            }

            // Forward the request to the chosen instance.
            // This is a simple reverse proxy setup.
            proxy := httputil.NewSingleHostReverseProxy(&url.URL{
                Scheme: "http",
                Host:   fmt.Sprintf("%s:%d", instance.Address, instance.Port),
            })
            proxy.ServeHTTP(w, r)
        })
    }
}

This is the basic flow. But a production system needs more. It needs to watch for changes. If a health check fails, the registry should immediately notify the load balancer so it stops using that instance. We can use a watcher pattern.

type RegistryEvent struct {
    Type    string // "REGISTER", "DEREGISTER", "HEALTH_CHANGE"
    Service *Service
}

type RegistryWatcher interface {
    Notify(event RegistryEvent)
}

// The LoadBalancer can implement this interface.
func (lb *LoadBalancer) Notify(event RegistryEvent) {
    lb.mu.Lock()
    defer lb.mu.Unlock()

    switch event.Type {
    case "HEALTH_CHANGE":
        if event.Service.Status == StatusUnhealthy {
            // Remove this instance from the active pool.
            lb.removeInstance(event.Service.ID)
        } else {
            // Add it back.
            lb.addInstance(event.Service)
        }
    }
}

Dynamic routing means your system automatically adjusts to failures and new capacity. When you deploy a new version of a service, you can register the new instances, let them pass health checks, and the load balancer will start sending them traffic. You can then gracefully deregister the old ones.

Let's look at a more complete example, putting the registry, health checker, and a simple load balancer together.

func main() {
    // Create the core components.
    registry := NewServiceRegistry()
    healthChecker := NewHealthChecker(registry)
    lb := NewLoadBalancer(registry, "round_robin")

    // Start the health checker in the background.
    go healthChecker.Run()

    // Simulate a service registering.
    registry.Register("cart-service", "10.0.0.1", 8080, []string{"v1"})

    // Define its health check.
    healthChecker.AddCheck("cart-service-10.0.0.1:8080", &HealthCheck{
        Type:         HealthCheckHTTP,
        HTTPEndpoint: "http://10.0.0.1:8080/health",
        Interval:     10 * time.Second,
        Timeout:      2 * time.Second,
    })

    // Set up an HTTP server that uses our discovery middleware.
    router := http.NewServeMux()
    // Your actual application routes would go here on `router`.

    // Wrap the router with our discovery middleware.
    app := DiscoveryMiddleware(registry, lb)(router)

    fmt.Println("Gateway listening on :8080")
    http.ListenAndServe(":8080", app)
}

In this code, the gateway listens on port 8080. A request comes in for /api/cart. The middleware asks the registry for a healthy "cart-service". The registry checks its list, which is being updated by the health checker. The load balancer picks one instance, say 10.0.0.1:8080. The request is forwarded there.

What about more complex rules? Maybe you want to send 1% of traffic to a new version for testing. This is where routing policies come in. You can tag your services with versions and have the load balancer read those tags.

type RoutingPolicy struct {
    ServiceName string
    MatchTags   map[string]string // e.g., {"version": "canary"}
    Weight      int               // e.g., 1 for 1% traffic.
}

func (lb *LoadBalancer) PickWithPolicy(serviceName string, policies []RoutingPolicy) *Service {
    // First, get all healthy instances.
    instances := lb.registry.GetHealthyInstances(serviceName)

    // Filter instances based on the policy tags.
    var filteredInstances []*Service
    for _, instance := range instances {
        if lb.matchesPolicy(instance, policies) {
            filteredInstances = append(filteredInstances, instance)
        }
    }

    // If the filtered list has instances, pick from it based on weight.
    // Otherwise, fall back to the general pool.
    if len(filteredInstances) > 0 {
        // Implement weighted logic here.
        return lb.weightedPick(filteredInstances, policies)
    }
    return lb.Pick(instances) // Default strategy.
}

Building this yourself teaches you the mechanics, but for a real, large-scale system, you'd likely use existing tools like Consul, etcd, or Kubernetes' built-in service discovery. These provide the distributed, consistent storage we glossed over. Our in-memory map works for a single registry node, but what if that node crashes? You need multiple registry nodes that agree on the state. That's a problem solved by consensus algorithms like Raft, which is used by Consul and etcd.

The code patterns, however, remain similar. You register services, you check health, you balance load. The difference is in the durability and fault-tolerance of the registry itself.

The system I've walked you through is like the nervous system for your microservices. It knows where everything is, it feels when something is wrong, and it directs traffic smoothly. It turns a collection of independent, fragile processes into a resilient, adaptable application.

Start simple. Build a registry that keeps a list. Add a health checker that updates it. Make a load balancer that reads from it. You'll learn more from getting these three pieces talking to each other than from any diagram or lecture. Once you have that basic loop working—registration, health, routing—you've built the heart of the system. Everything after that is making it faster, more reliable, and easier to manage.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!