Iain McGinniss

Posted on Aug 8, 2021

Request routing for horizontally scaled services

#networking #scaling #cloud

Networked systems engineering is a fundamental aspect of modern software engineering. The double-edged sword of internet-connected services is the opportunity for your service to be utilized by anyone (growth! impact! profit!), but success can result in extremely unpredictable load spikes and overall growth in resource requirements to keep things running smoothly. First, we shall discuss the options for handling variable and increasing load, and then focus on how we can effectively route requests across a horizontally scaled service (a term we shall define momentarily). Through exploring this topic, we will also touch on some more advanced tools and strategies, such as API gateways and service meshes. I hope you enjoy the journey, and that it helps you make some informed choices in your next system design.

Vertical and horizontal scaling

In the early stages of your service's life, over-provisioning is the simplest strategy to handle variable load. You estimate the peak load based on some service-specific characteristics and a wet finger held to the breeze, and ensure that we have sufficient capacity to handle that peak load.

If an unexpected spike in load arrives, or you just can't keep up with demand, bad things will happen.

As this estimated peak grows over time, one strategy to keep up is to vertically scale, which entails the ability to use physical or virtual machines with more resources, such as more compute cores, memory, persistent storage space, or network bandwidth.

To make use of these additional resources, the service must typically make use of a variety of techniques, such as process forking or multi-threading, bigger in-memory caches, and RAID configurations to increase disk I/O throughput and bandwidth.

The key distinction with vertical scaling is that the service can handle additional load without the need to spread across multiple computers - it can maximize utilization of available resources on a single computer. A service that horizontally scales, in contrast, utilizes additional computers and a network to distribute additional load. This involves a very different implementation strategy, and raises an important question: how should requests distributed across a dynamic set of instances?

Most services written before the era of cloud computing utilized vertical scaling, as this was typically the only viable option - provisioning of resources to support horizontal scaling was not economically viable for organizations with small, on-premise data centers. It could take weeks to provision and install new hardware for use by a service, so system architects had to plan ahead, and building services to vertically scale on big, over-provisioned servers was simpler. A client I worked with many years ago utilized huge IBM mainframes that cost millions of dollars to provision and install into their on-premise data center, all because their service was monolithic and unable to horizontally scale. The client was pushing the limits of hardware that could be managed as a single machine, and once that path was exhausted, there would be no choice but to re-architect their system to horizontally scale.

Cloud computing providers support vertically scaling services through infrastructure-as-a-service (IaaS) products that offer a variety of virtual machine sizes, from cheap single-core and low-memory instances (e.g. Amazon's t2.nano instance type with 1 vCPU and 0.5GB of RAM), all the way up to monstrous instances with hundreds of cores and terabytes of memory (e.g. Google's m2-ultramem-416 instance type, with 416 vCPUs and 11.7TB of RAM). You will, of course, pay a steep price for such vertical scaling capability - pricing increases are linear in vCPU and memory to a point, then become bespoke and negotiated when you reach truly specialized hardware. The m2-ultramem-416 instance costs $50.91 per hour with a one year reservation (~$438k/yr), whereas a more typical n2-standard-16 instance with 16 vCPU and 64GB of RAM costs $0.79 per hour (~$7k/yr). If a service can horizontally scale, and more efficiently follow load, your maximum cost of using commodity instances like n2-standard-16 will often be an order of magnitude lower.

Freedom through constraint - PaaS and FaaS

Cloud computing also introduced other ways to think about service development, via platform-as-a-service (PaaS) or function-as-a-service (FaaS) offerings, e.g. Google App Engine, AWS Elastic Beanstalk, AWS Lambda, Azure Functions, and many others. With PaaS/FaaS systems, the compute infrastructure running your service ceases to be your concern, allowing you to focus on the higher level semantics of your service. From the developer's perspective, there is no "server", or alternatively, just one logical server with theoretically unlimited scaling. In reality, the limitations imposed on how a service is implemented by these technologies ensures that your service horizontally scales across multiple instances, in a way that is managed by the cloud provider. The restrictions may also allow for multi-tenancy, where multiple services (potentially even from multiple customers) can run on the same hardware at the same time, yielding resource utilization improvements and cost savings for the cloud provider, and maybe even for you.

The loss of implementation freedom from using a PaaS or FaaS framework may not be acceptable for all services, certainly not without a broad re-think of how the service is implemented. Many organizations will instead choose to stick with vertical scaling utilizing the instance types made available by cloud providers, for as long as possible. With enough growth, a service will inevitably hit a point where it cannot utilize bigger servers effectively, or that bigger servers are just not available.

Implementing services to horizontally scale on dynamically provisioned IaaS cloud resources is the middle ground that many organizations choose. This provides them with more direct control over when and how the system scales, but comes with a significant complexity cost. If your organization is using Docker Swarm or Kubernetes, you are likely self-managing horizontally scaled services, and may be immersed in the overhead and complexity of doing this safely and effectively.

Other good reasons to horizontally scale

Building a service to support some degree of horizontal scaling is a very complex topic in its own right. However, it is increasingly becoming a requirement of contemporary software engineering, for good reasons beyond just scalability. Horizontal scaling can also support reliability, maintainability, and efficiency goals.

Rather than having a single point of failure in our single entry point to the service, we can utilize multiple servers and work to ensure fail-over between servers is transparent. This fail-over can be "active-passive", where a single server is still responsible for all traffic, but when it fails we have a "warm" backup server that is available to take over within minutes. This type of configuration is common with traditional relational databases such as MySQL. Even better, an "active-active" configuration means that all servers are "hot" and capable of handling requests in short order, supporting recovery in under 30 seconds, and often immediately. In an active-active configuration, it is also possible to send requests to any server, and have the set of servers behave as a predictable, single "logical" service.

Aside from this reliability and recovery advantage, the ability to have multiple active-active servers and replace them at will also facilitates transparent maintenance. Rather than requiring downtime to roll out new releases of our service, as was required when replacing the deployment of a service on fixed hardware, we can instead introduce some new servers with a new version and migrate requests from the old set to the new set. We have flexibility in how this can be done, allowing for clever rollout strategies such as canaries, where we send a small percentage of traffic to the new service and ensure it performs acceptably before proceeding to a full rollout. Related to this, a blue-green deployment allows us to incrementally move from the old release to the new release, while retaining the ability to roll back quickly if any undesirable behavior is detected.

Finally, by utilizing multiple smaller machines, the capacity we provision can more closely align to the actual load our service is experiencing at any point in time, resulting in more efficient utilization of resources. For services with a day-night load cycle (i.e. your service is interactive, and all your customers are within a narrow band of time zones, so you see significantly less load at night than during the day) you then have the opportunity to scale up and down periodically, potentially saving a significant amount of money compared to over-provisioning to be capable of handling your estimated peak load at all times. This type of dynamic scaling is a huge advantage of cloud infrastructure, and can also be automated by utilizing real-time aggregate metrics (e.g. CPU usage, network throughput, etc.) to decide how many instances should be active at any given time.

So, how can we implement a service to horizontally scale? This question is so system-dependent that a succinct answer cannot be provided that covers all cases. One broad exception is in the case of "stateless" services - those which handle requests in an isolated, predictable way, with no side effects that are local to the service. A typical stateless service will utilize a data store with ACID properties, and will process requests based entirely on the contents of the request and manipulations of that data store. This service can be horizontally scaled through simple replication - the number of instances required is typically linearly dependent on request throughput. This attractive characteristic is why so much emphasis is placed on utilizing stateless services wherever possible.

DNS and client-side load balancing

So, you have a service that can be horizontally scaled, meaning that there are multiple service instances available to process requests. How can we effectively and evenly direct requests from clients to service instances?

At the simplest level, DNS records can map the canonical name for a service to multiple IP addresses of service instances that implement that service. This basic abstraction allows clients to "find" the service using a durable identifier, while allowing the service maintainer to change the set of instances handling requests over time, as needed.

A DNS record is a very flexible way to map abstract to real - it can map a name to multiple addresses (A/AAAA records), or map to another DNS record (CNAME records). Combined with a reasonable "time to live" (TTL) value for the record, we have a reliable mechanism to propagate changes of our abstract-to-real mapping to users of a service in an efficient way.

A client can choose randomly between the available addresses - in aggregate, this will evenly distributed clients across server IP addresses. This is referred to as client side load balancing, where clients possess sufficient intelligence to satisfy our goals, either through coordination (e.g. in gRPC Traffic Director) or as an emergent property of independent behavior in aggregate.

With the ability to add and remove IP addresses from the DNS mapping as needed, client side load balancing can support both our scalability and efficiency goals. If we can create new instances of our service at will, we can dynamically auto-scale, and escape the limitations of vertical scaling (big, over-provisioned servers) in favor horizontal scaling (small, cheap, easily replaced servers).

Client side load balancing can work well if clients are uniform, meaning that the behavior of each client is roughly equivalent in terms of the demands they place on the system. This is often not the case, however - clients with a 1Gbps fiber connection can place significantly more load on file servers than those with a mediocre cellular connection, for instance. The types of requests that clients make may also result in significant variability in load, dependent upon the data associated to that user's requests. So, some careful evaluation must be made of whether client side load balancing will work for your service or not. In gRPC and other client stacks, the client may attempt to self-distribute the load of the requests they generate by opening multiple connections to different server instances, and perform client-side round-robin distribution of requests across these servers. Even this can still be problematic, as if the client only opens a small number of connections (typically, three), this may still impose their load on a small subset of all available server instances.

There are also other limitations to this DNS-based approach to solving our reliability, scalability and efficiency goals. Clients must do what we desire and expect - this is fine when we also control the client, but can problematic when interacting with third party controlled software like web browsers or clients implemented by other teams or organizations. Changes to our DNS records propagate erratically, depending on which DNS service our clients use. If we use a TTL of 1 minute, we can expect many (perhaps most) of our clients to see the change within a minute, but some may take significantly longer, due to configuration details of infrastructure that may be completely out of your control. There are also some practical limits to how many IP addresses we can reference with our DNS records; managing hundreds of addresses per name may be feasible, but thousands or more is unrealistic - DNS servers do not guarantee to respond with all mappings for a name. When using UDP for DNS lookups, we are limited to what can fit in a packet. When using TCP, a DNS server may limit the number of responses to prevent slow-down for other clients.

Service-side load balancing

So, if we cannot rely on DNS and client-side behavior for fast, reliable changes to our service routing, what else can we do? Like most problems in computing science, we can add a layer of indirection! Load balancers provide a more adaptable approach - point your DNS records at a TCP or HTTP load balancer, then manage a more dynamic "target set" behind that load balancer. This leaves the DNS records in a much more static configuration, while giving you immediate and localized control over where requests are routed to behind that load balancer. Even if you're not using the load balancer for auto-scaling of your service, this is still a very effective tool for handling rolling restarts of services and general maintenance of your service, without worrying about client-side effects.

Load balancers often only handle traffic at layer 3 or 4 in the OSI model - that is, they are packet- or connection-oriented, evenly distributing traffic across the target set at the IP, UDP, or TCP levels. An inbound packet or connection arrives, and the load balancer decides which service to forward that to. This can be a simple strategy such as round-robin distribution, or a more weighted load balancing strategy can be used based on metrics reported from service instances.

L3/L4 load balancing was fine for most systems prior to the advent of HTTP/2, as HTTP/1.1 and other common internet application protocols are connection oriented. Requests are serviced serially on each connection, and each connection belongs to just one client. To achieve more concurrency in request handling, more connections were used. This is ultimately wasteful of bandwidth, with significantly more packets required for maintenance of TCP connection state. It also results in higher average and P99 request processing latency, and can exhaust the operating system's connection handling resources. In an environment where connections are created by clients over the internet (e.g. browsers), these relatively low throughput but highly variable connections can place an uneven load on the servers behind the load balancer, despite your best efforts.

Protocols such as HTTP/1.1 are effectively stateless, meaning that each request carries everything required to process it, and there is no explicit relationship between requests or expectation that requests must be processed in the order they are received. This opens the possibility of decoupling the set of connections into a load balancer from the set of connections out of the load balancer. Thousands of low-throughput inbound connections can be transformed into a much smaller number of high-throughput connections to the service instances, or alternatively, we can ensure that each request from the load balancer to a server uses its own connection, to prevent head-of-line blocking of request processing.

When a load balancer is capable of doing this type of request-level processing, we typically classify it as a layer 7 (aka. application layer) load balancer. Requests from multiple clients with separate connections may be multiplexed onto a single connection to a server. When using a protocol such as HTTP/2, which allows for requests to be processed and responded to out-of-order, this can result in a significant decrease in connection maintenance waste, or conversely much higher utilization of available resources.

Reverse proxies

Once you recognize the utility of a layer 7 load balancer in your architecture, many other possibilities become apparent. As HTTP requests carry information on the client-side view of the service intended to process a request (via the Host header), we can potentially use a single load balancer for multiple logical services. Going further, we could inspect the path, request method, query parameters, or perhaps even the body in making request routing decisions. With these features, we now have what many would call a reverse proxy - a service with a flexible configuration language that allows for more sophisticated routing decisions than just distributing requests blindly across a set of servers known to the load balancer by their IP addresses only.

One of the most commonly used open source reverse proxies is nginx, though cloud vendors also typically provide their own managed options, such as AWS Application Load Balancer, Google Cloud Load Balancing, etc.

These systems typically allow for dynamic configuration of the reverse proxy through an API or a JSON/TOML/YAML configuration language, allowing the routing rules to be changed without any disruption to the currently active request processing. Reverse proxies are typically have higher resource requirements and overhead than L3/L4 load balancers, but are still highly optimized and capable of handling upwards of 10k requests per second per instance on commodity hardware, and are also usually horizontally scalable to hundreds of instances and millions of requests per second. This meta-scaling problem when dealing with millions of requests per second is usually handled by having multiple L4 load balancers directing connections to the reverse proxies, which in turn are directing requests to your service instances.

With a reverse proxy, we can start to implement some more sophisticated request routing patterns, such as routing based on request type. With browser-based web applications we must often define all our request endpoints on the same domain, in order to comply with the web application model where cookies and related security controls will not permit certain types of cross-domain requests. Our reverse proxy can maintain this illusion for the front-end, while splitting requests of different types, typically distinguished by path, to different upstream services that handle those requests. For example, our GraphQL API endpoint (maybe "/api/graphql" from the client perspective) may be serviced by an Apollo Gateway, while other API endpoints (^\/api\/(?!graphql).*$) might be handled by a service we implement, and everything else (^\/(?!api\/).*$) is handled by a static content server.

Deeper request inspection could also allow us to do things like route expensive requests to a separate set of servers, so that we can independently manage the auto-scaling for that request type from other requests. This can also be an effective tool to ensure that these potentially problematic requests do not impact the performance expectations of the other requests; if they are all handled by the same pool of servers, expensive requests may impact the latency and jitter of others through resource contention.

API gateways

So far, we have discussed request routing middleware that is primarily tasked with routing requests efficiently across our service instances, but otherwise does not concern itself with the implementation details of how responses are formulated for requests. However, once we introduce middleware that is inspecting the contents of requests as part of routing decisions, is it not a great conceptual leap from there to middleware that is also responsible for some common request processing tasks. For example, the middleware could be responsible for ensuring that requests carry valid authentication information, such as valid cookies or request signatures. The middleware could also perform tasks such as content encoding transformations, changing uncompressed responses to Brotli compressed responses, resulting in lower bandwidth utilization when communicating with browser clients. Conceptually, any in-line transformation of requests or responses could be handled by the middleware. I refer to request routing middleware with this capability as an API gateway.

API gateways go beyond the capabilities of reverse proxies by providing an extension mechanism that allows for custom code to be executed as part of the request processing pipeline. Traefik Proxy is a good example of this, as it has a plugin mechanism that allows for custom Go code to influence request processing decisions. Kong also deserves a mention here, with the ability to write plugins in Lua, or integrate with external binaries written in practically any language. Most available API gateways provide a set of standard request processing plugins to handle authentication, rate limiting, content type transforms, and so on. In general, they provide a useful way to enforce some consistent request processing standards across all services, that can be implemented in one place, rather than requiring re-implementation across multiple services - particularly if those services are implemented in different languages.

Conclusion

The myriad of request processing middlewares does not end here - there is also the very trendy topic of service meshes that we could cover, but I choose to leave that as an exercise to interested readers, as it is a rapidly evolving and complex space (see: Istio, linkerd, Consul, Tanzu, etc).

So, what should you use in your own architecture? If you are writing something from scratch, I would strongly recommend looking at PaaS/FaaS options to avoid all of this complexity for as long as possible - the less time you have to spending thinking auto-scaling and request processing, the more time you have to build out the value-providing aspects of your service. If you maintain existing services that are incompatible with a PaaS/FaaS approach, your cloud provider's managed load balancer / reverse proxy is likely the most straightforward option to use. If you find that you need a little more flexibility, an API gateway such as Traefik or Kong can be excellent option; just be prepared to have to think much more deeply about the network layer of your application.

DEV Community

Request routing for horizontally scaled services

Vertical and horizontal scaling

Freedom through constraint - PaaS and FaaS

Other good reasons to horizontally scale

DNS and client-side load balancing

Service-side load balancing

Reverse proxies

API gateways

Conclusion

Top comments (0)

Read next

Advanced Strategies for Building Scalable Data Pipelines with Cloud Technologies

Decoding the Design: The Evolution of the Amazon Web Services Logo

ATTACKER PROFILES AND MOTIVATIONS

Set budget alerts on GCP & AWS