DEV Community: adrian cockcroft

Measuring Energy Usage

adrian cockcroft — Mon, 31 Oct 2022 23:17:27 +0000

Decided to figure out how to measure the energy used by a desktop computer and see if I can figure out a way to identify different workloads.

I have a Mac Studio M1 to run the workload, and an old MacBook laptop to run data collection on, so that it doesn't add to the workload.

First thing we need is a power monitoring plug that has an API. The TP-Link Kasa platform seems like a good place to start. It has a python based API available on GitHub.

% pip3 install python-kasa

I ordered a Kasa KP115 smart plug from Amazon for $22.99.

Setup the plug using the mobile app, then:

% kasa
No host name given, trying discovery..
Discovering devices on 255.255.255.255 for 3 seconds
== PowerMeter - KP115(US) ==
    Host: 10.0.0.46
    Device state: ON

    == Generic information ==
    Time:         2022-10-31 15:21:56 (tz: {'index': 6, 'err_code': 0}
    Hardware:     1.0
    Software:     1.0.18 Build 210910 Rel.141202
    MAC (rssi):   10:27:F5:9C:37:12 (-49)
    Location:     {'latitude': 36.xxx, 'longitude': -121.xxx}

    == Device specific information ==
    LED state: True
    On since: 2022-10-31 12:35:15

    == Current State ==
    <EmeterStatus power=3.456 voltage=122.48 current=0.049 total=0.002>

    == Modules ==
    + <Module Schedule (schedule) for 10.0.0.46>
    + <Module Usage (schedule) for 10.0.0.46>
    + <Module Antitheft (anti_theft) for 10.0.0.46>
    + <Module Time (time) for 10.0.0.46>
    + <Module Cloud (cnCloud) for 10.0.0.46>
    + <Module Emeter (emeter) for 10.0.0.46>

Next - you have to specify host, or it will crash, and specifying type saves it an extra API call, and you get a single result

% kasa --host 10.0.0.46 --type plug emeter 
== Emeter ==
Current: 0.049 A
Voltage: 121.715 V
Power: 3.628 W
Total consumption: 0.002 kWh
Today: 0.002 kWh
This month: 0.002 kWh

Using a csh loop to get a 1 second trace of the data it's clear that the data changes about every 4 seconds.

% while 1
while? kasa --host 10.0.0.46 --type plug | egrep '(Time:|EmeterStatus)'
while? sleep 1
while? end
    Time:         2022-10-31 16:09:34 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=0.0 voltage=122.49 current=0.0 total=0.004>
    Time:         2022-10-31 16:09:35 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=0.0 voltage=122.49 current=0.06 total=0.004>
    Time:         2022-10-31 16:09:36 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=0.0 voltage=122.36 current=0.06 total=0.004>
    Time:         2022-10-31 16:09:38 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.68 voltage=122.36 current=0.06 total=0.004>
    Time:         2022-10-31 16:09:39 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.68 voltage=122.36 current=0.06 total=0.004>
    Time:         2022-10-31 16:09:40 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.68 voltage=122.36 current=0.05 total=0.004>
    Time:         2022-10-31 16:09:41 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.68 voltage=122.258 current=0.05 total=0.004>
    Time:         2022-10-31 16:09:42 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.253 voltage=122.258 current=0.05 total=0.004>
    Time:         2022-10-31 16:09:43 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.253 voltage=122.258 current=0.05 total=0.004>
    Time:         2022-10-31 16:09:45 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.253 voltage=122.258 current=0.049 total=0.004>
    Time:         2022-10-31 16:09:46 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.253 voltage=122.382 current=0.049 total=0.004>
    Time:         2022-10-31 16:09:47 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.253 voltage=122.382 current=0.049 total=0.004>
    Time:         2022-10-31 16:09:48 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.675 voltage=122.382 current=0.049 total=0.004>
    Time:         2022-10-31 16:09:49 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.675 voltage=122.382 current=0.012 total=0.004>
    Time:         2022-10-31 16:09:50 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.675 voltage=122.446 current=0.012 total=0.004>
    Time:         2022-10-31 16:09:52 (tz: {'index': 6, 'err_code': 0}
    <EmeterStatus power=3.675 voltage=122.446 current=0.012 total=0.004>

The output formatting is defined in cli.py

kasa --host 10.0.0.46 --type plug | awk '/Time:|EmeterStatus/{printf("%s %s ", $2, $3)}'; echo
2022-10-31 16:23:11 power=1.017 voltage=122.608

That's all I have time to do for now, to be continued...

Sustainability Transformation and DevSusOps

adrian cockcroft — Thu, 24 Jun 2021 20:29:13 +0000

The transformation towards sustainable development practices is an emerging area, I learned a lot in my previous roles working for Amazon, and now I'm working part time as a technology and strategy advisor, I'm planning to share some of the ideas and mental models needed to make sense of this from a developer perspective in a series of short posts here.

I was part of the team that published the Well Architected Pillar for Sustainability which has detailed advice on how to optimize a workload to be more sustainable, I'll incorporate some of this advice as I go along.

To start with, most people are familiar with the phrase "Digital Transformation". It's become over-used and is a bit tired now. In essence though, it refers to the new businesses that were enabled by pervasive Internet and mobile connected customers, and the changes needed in old businesses to compete. We've had a decade or so to get used to this, so it's well understood, even if some of the laggards are still struggling to figure it out.

On the other hand "Sustainability Transformation" is an emerging topic, poorly understood and with immature solutions to support it. It refers to the changes driven by our need to reduce the impact of business on the environment, including reducing greenhouse gas emissions, clean water and zero waste to landfill. The biggest of these is carbon dioxide reduction, as we need to move from extracting and burning fossil fuels to an economy based on renewable energy. This reaches throughout business operations, from the fuel burned to heat buildings and power vehicles (which is called scope 1), to the fuel used to generate the electricity we consume (which is called scope 2), to the energy used by things we buy, own and sell (which is called scope 3).

The first problem is figuring out how much greenhouse gas is being generated. There are several different gases that matter, with different impacts. The main one is Carbon Dioxide, but Methane and CFC refrigerants are also a problem. They are combined together for measurement and reporting as Carbon Dioxide Equivalent - CO2e. The methodology published by the Green House Gas Protocol is used to calculate carbon equivalents and to get detailed guidance on what to include in scope calculations.

There are a few different ways to calculate carbon.

The economic carbon intensity of a product or a business may be reported as the grams of carbon per dollar that is spent, gCO2e/$, or metric tonnes of CO2 per million dollars mtCO2e/$M. There are a million grams in a metric tonne, so the value is the same. Most companies start out by making an estimate of their carbon footprint using an Economic Input/Output (EIO) model that is based on the financial flows and uses industry average factors to relate spend to carbon. This is good enough for reporting, and to find out where most of the carbon is likely to be generated by the business. It's not useful for optimizing carbon reduction, because if you spend more on a low carbon energy source or raw material you will end up reporting more carbon not less.

To get more accurate and actionable measurements, business processes need to be instrumented to calculate the carbon generated by raw material and product flows. For raw materials, carbon intensity is often reported as grams of CO2 per kilogram of the material - gCO2e/Kg. For fuels that are burned this provides a more accurate way to estimate scope 1 than just basing it on the total amount spent on fuel.

The carbon intensity of energy for scope 2 is measured as grams of CO2e per KWh. It depends how and when that electricity was generated, and changes all the time. For example when it's sunny or windy, there's more solar and wind energy. The "grid mix" is usually reported by an energy supplier on an average monthly basis, however you have to wait for the bill, so an accurate scope 2 report will be delayed by a month or more.

The embodied carbon from manufacturing is amortized over the lifetime of the item: gCO2e/year. This is part of scope 3. For example if you use something like a mobile phone for longer, the gCO2e that was emitted to make it is having a useful purpose for longer.

For a sustainability transformation, a business has to figure out how to measure its carbon footprint, and come up with a plan to change the way it powers everything, and change the products it's making, and even the markets that it operates in. That takes time, and I'll talk about timescales in my next post.

From a developer perspective there are three main areas of interest. The first is that most companies start with a manual spreadsheet based approach, but new disclosure regulations are driving the need to build some kind of data lake to report their carbon footprint, and to model their risk exposure to the impacts of climate change. The second is that sustainability is becoming a product attribute of the things companies do and build, so it's turning up in design decisions. The third is that the efficiency of the code we write and how we deploy it affects the carbon footprint of our IT systems. I call this DevSusOps, adding sustainability concerns to development and operations.

That's enough to start with, there's lots more to talk about, but I'd like to break up this discussion into a bunch of short posts. To learn more, a good readable document to study is this Green House Gas Protocol paper.

photo taken by Adrian at Point Lobos, California

If at first you don't get an answer...

adrian cockcroft — Thu, 14 May 2020 23:23:20 +0000

give up quickly and ask someone else. That's my favorite microservices retry policy. To understand what works and doesn't work, we need to think about how distributed systems of services and databases interact with each other, not just a single call. Like my last post on why systems are slow sometimes, I will start simple, pick out key points, and gradually explain the things that will kill your distributed system, if you don't have timeouts and retries setup properly. It's also one of the easiest things to fix, once you understand what's going on, and it generally doesn't need new code, just configuration changes.

A bad timeout and retry policy can break a distributed systems architecture, but is relatively easy to fix.

I often find the timeout and retry policy across an application is set to whatever the default is for the framework or sample code you copied when starting to write each microservice. If every microservice uses the same timeout, this is guaranteed to fail, because the microservices deeper in the system are still retrying when the microservices nearer the edge have given up waiting for them.

Don't use the same default timeout settings across your distributed system.

This causes an avalanche of extra calls, work amplification, that is triggered when just one microservice or database deep in the system slows down for some reason. I used to work on a system that fell over at 8pm every Monday night because the monolithic backend database backup was scheduled then, and database queries slowed down enough to trigger a retry storm that overloaded the application servers. The simplest way to reduce work amplification is to limit the number of retries to zero or one. I've seen defaults that are higher, and seen people react to seeing timeouts by increasing the number of retries, but this is just going to multiply the amount of extra work the system is trying to do when it's already struggling. Use zero retries in more deeply nested microservice architectures that will retry closer to the user.

Limit the number of retries to zero or one, to minimize work amplification.

This should be a fairly easy configuration change to implement, and a good checklist question that should be answered for your inventory of microservices is to ask how the timeout and retry settings are set, and what is needed to change the configuration. It's really nice if the settings can be changed dynamically and take effect in a few seconds. That provides a mechanism that may be able to bring a catatonic system back to life during an outage. Unfortunately in many cases the configuration is set once when the application starts, or could be hard coded in the application, or is so deep inside some imported library that no-one knows that a timeout setting exists, or how to change it.

Discovering and documenting current settings and how to change timeouts and retries can be a pain, but it's worth it.

Many people don't realize that there are two different kinds of timeout, some coding platforms and libraries bundle them together, and others expose them separately. The best analogy is to consider the difference between sending a letter, and making a phone call. If you send a letter, with an RSVP, you wait for a response. Maybe after a timeout you send another letter. It's the simplest request/response mechanism, and UDP based protocols like DNS work this way. A phone call is different because the first step is to make a connection by establishing a phone call. The other party has to pick up the phone and acknowledge that they can hear you, and you have to be sure you are talking to the right person. If they don't pick up, you retry the connection by calling again. Once the call is in progress, you make the request, and wait for the other party to respond. You can make several requests during a call. TCP based protocols which underlie many APIs, work this way. If the phone line goes dead, you have to call again, and establish a new connection.

Be clear about the difference between connections and requests for the protocols you are using.

The connection timeout is how long you wait before giving up on getting a connection made to the next step in your request flow. The request timeout is how long you wait for the rest of the entire flow to complete from that point. The common case is to bundle everything into the request at the application level, and have some lower level library handle the connections. Connections may be pre-setup during an initialization and authentication phase or occur when the first request is made. Subsequent application requests could cause an entire new connection to be made, or a keep-alive option would keep it around to get a quicker response for the next request.

Dig deep to find any hidden connection timeout and keep-alive settings.

How should you decide how to set the connection and request timeouts?
While connection timeouts depend on the network latency to get to the service interface, request timeouts depend on the end-to-end time to call through many layers of the system. The closer to the end user, the longer the request timeout needs to be. Deeper into the system, request timeouts need to be shorter. For simple three-tier architectures like a front end web service, back end application service and database it's clear what order things are called in, and nested or telescoped timeouts can be setup for each tier. The timeouts must be reduced at every step as you go from the edge to the back-end.

Edge timeouts must be long enough, and back end timeouts short enough, that the edge doesn't give up while the back end is still working.

Web browsers and mobile apps often have a default timeout of many seconds before they give up. However humans tend to hit the refresh button or go and look at something else if they are staring at a blank screen for too long. Some web site data says you should target 2-4 seconds for page load time. However they are waiting in an app for something like a movie to start streaming, then users will wait a bit longer, maybe up to 10 seconds. If you have too many retries and long timeouts between your services, it's easy to add up to more than a few seconds, and while your system is still working to get a response, the end user will ignore the result as they have already given up and sent a new request. The best outcome is that your application returns before they retry manually, with a message that tells them you gave up, and asks if they want to try again. This minimizes the chance that you'll get flooded by work amplification.

For human interactions, if you can't respond within 10s, give up and respond with an error. Try to keep web page load times below 4s.

One question is what to do after giving up on a call. This can be problematic, as described in the Amazon Builders Library piece Avoiding fallback in distributed systems, the fallback code to handle problems is often poorly tested, if at all. A common problem is that a system that times out calls elsewhere will crash, return bad or corrupted data, or freeze.

Injecting slow and failed responses from the dependencies of a service is an important chaos testing technique.

In my last post I talked about how end to end response time is made up of residence times for every step along the way, and those residence times are a combination of waiting time in a queue and service time getting the work done. During a timeout, there is an increase in wait time, and you can think of this as an additional timeout queue, where work is being parked that can't move forward. This can clog up systems by creating very long queues of stale requests. Processing them in order from the oldest first is often a bad policy, as you may never get to do work that someone is still actually waiting for. A better policy is to discard the entire queue of requests waiting for timeout when it hits a size limit. Each retry adds to service time, as the request is processed again.

If there is only one instance of a service endpoint or database, you have no-where else to go. Attempting to connect to it repeatedly and quickly is unlikely to help it recover, so in this case you need to use a back-off algorithm. This causes it's own problems, so should only be used where there is no alternative. Some systems implement exponentially increasing back-off, but it's important to use bounded exponential back-off with an upper limit, otherwise you are creating a big queue which will take too long to get connected, and upstream systems will already have given up waiting. This is also sometimes called capped exponential back-off. Adding some randomness to the back-off is a better policy, as it will disperse the thundering herd of clients that could end up backing off and returning in a synchronized manner. Deterministic jittered backoff is a way to give each instance a different back-off, but with a pattern that can be used as a signature for analysis purposes. Marc Brooker of AWS describes this technique in Timeouts, Retries and Backoff with Jitter.

Only if you have to, use random or jittered back-off to disperse clients and avoid thundering herd behaviors.

Consider a deeply nested microservice architecture, where the request flow depends on what kind of request is being made, and the ideal request timeout ends up being flow dependent. If we can pass a timeout value through the chain of individual spans in the flow then dynamic timeouts are possible. One way to implement this is to pass a deadline timestamp through the flow. If you can't respond within the deadline, give up. This requires closely synchronized clocks across the system, and may be a useful technique for relatively large timeouts of tens of seconds to minutes. Some batch management systems implement deadline scheduling for long running jobs. For short latency microservice architectures a better policy would be to decrement a latency budget for each span in the flow, and give up when there's not enough budget left to get a result, but there is still time to respond back up and report the failure cleanly before the user times out at the edge. The question is how much latency to deduct at each stage, and to solve this properly would need some extra instrumentation based on analysis of flows through the system.

Dynamic request timeout policies are needed for deeply nested microservices architectures with complex flows.

Connection timeouts depend on the number of round trips needed to setup a connection (at least two, more for secure connections like HTTPS), and network latency to the next service in line, they should be very short, a small multiple of the latency itself. Within a datacenter or AWS availability zone, round trip network latency is generally around a millisecond, and a few milliseconds between AWS availability zones. Sometimes people talk about one-way latency numbers, so be clear what is being discussed, as a round trip is twice as long.

Use a short timeout for connections between co-located microservices, whether direct, or via a service mesh proxy like Envoy.

Round trip latency between AWS regions in the same continent is likely to be in the range 10-100ms, and between continents, a few hundred milliseconds. Calls to third party APIs or your own remote services will need a much higher timeout setting. Your measurements will vary, but will be limited by signals propagating at less than the speed of light, about 300,000km/s, which is about 150km round trip per millisecond. Processing, queueing and routing the network packets etc. will always make it much slower than this, but unfortunately the speed of light is not increasing so this is a fundamental latency limit!

Use a longer timeout for connections between AWS regions, and calls to third party APIs.

If you set your connection timeout to be shorter than the round trip, it will fail continuously, so it's common to use a high value. This works fine until there's a problem, then it magnifies the problem, and causes requests to timeout when they could be succeeding over a different connection. It's much better to fail fast and give up quickly on a connection that isn't going to work out. It would be ideal to have the connection timeout be adaptive, to start out large, and to shrink to fit, for the actual network latency situation. TCP does something like this for congestion control once connections are setup, but I don't know of any platforms that learn and adapt to latency, and invite readers to let us know in the comments if they do, and I'll update this post.

Set connection timeouts to be about 10x the round trip latency, if you can...

The default connection retry policy is to just repeat the call. Like phoning someone over and over again on the same number. However, if that person has a home number and a mobile number, you might try the other number after the first failure, and increase your chance of getting through. It's common to have horizontally scaled services, and if you've built a stateless microservices architecture (that doesn't use sessions to route traffic to locally cached state), you should try to connect to a different instance of the same service, as your default connection retry policy. Unfortunately many microservices frameworks don't have this option, but it was built into the Netflix open source service mesh released in 2012. NetflixOSS Ribbon and the Eureka service registry implemented a Java based service mesh, Envoy implements a language agnostic "side-car" process based service mesh with a configurable connection timeout, and defaults retries to one, with bounded and jittered backoff.

If you have horizontally scaled services, don't retry connections to the same one that just failed.

There's a high probability that an instance of a microservice that failed to connect will fail again for the same reason. It could be overloaded, out of threads, or doing a long garbage collection. If an instance is not listening for connections because it's application process has crashed or it's hit a thread or connection limit, you get a fast fail and a specific error return, as the connection will be refused immediately. It's like an unobtainable number error for a phone call. Immediately calling again in this case is clearly pointless.

Fast failures are good, and don't retry to the same instance immediately if at all possible.

I've seen some microservice based applications behaving in a fragile or brittle manner, that collapsed when conditions weren't perfect. However after fixing their timeout and retry policies along these lines they became stable and robust, absorbing a lot of problems so that customers saw a slightly degraded rather than an offline service. I hope you find these ideas and references useful for improving your own service robustness.

Thanks to Seth and Jim at Amazon for feedback, corrections and clarifications.

Photo taken by Adrian in March 2019 at the Roman Amphitheater, Mérida, Spain. A nice example of a low latency, high fan-out, communication system.

Why are services slow sometimes?

adrian cockcroft — Wed, 29 Apr 2020 19:53:46 +0000

You've built a service, you call it, it does something and returns a result, but how long does it take, and why does it take longer than your users would like some of the time? In this post I'll start with the basics and gradually introduce standardized terminology and things that make the answer to this question more complicated, while highlighting key points to know.

To start with, we need a way to measure how long it takes, and to understand two fundamentally different points of view. If we measure the experience as an outside user calling the service we measure how long it takes to respond. If we instrument our code to measure the requests from start to finish as they run, we're only measuring the service. This leads to the first key point, people get sloppy with terminology and often aren't clear where they got their measurements.

Be careful to measure Response Time at the user, and Service Time at the service itself.

For real world examples, there are many steps in a process, and each step takes time. That time for each step is the Residence time, and consists of some Wait Time and some Service Time. As an example, a user launches an app on their iPhone, and it calls a web service to authenticate the user. Why might that be slow sometimes? The amount of actual service time work required to generate the request in the phone, transmit that to a web service, lookup the user, return the result and display the next step should be pretty much the same every time. The variation in response time is driven by waiting in line (Queueing) for a resource that is also processing other requests. Network transmission from the iPhone to the authentication server is over many hops, and before each hop is a queue of packets waiting in line to be sent. If the queue is empty or short, then response will be quick, and if the queue is long the response will be slow. When the request arrives at the server, it's also in a queue waiting for the CPU to start processing that request, and if a database lookup is needed, there's another queue to get that done.

Waiting in a Queue is the main reason why Response Time increases.

Most of the instrumentation we get from monitoring tools is measures of how often something completed which we call Throughput. In some cases we also have a measure of incoming work as Arrival Rate. For a simple case like a web service with a Steady state workload where one request results in one response, both are the same. However retries and errors will increase arrivals without increasing throughput, rapidly changing workloads or very long requests like batch jobs will see temporary imbalances between arrivals and completed throughput, and it's possible to construct more complex request patterns.

Throughput is the number of completed successful requests. Look to see if Arrival Rate is different, and to be sure you know what is actually being measured.

While it's possible to measure a single request flowing through the system using a tracing mechanism like Zipkin or AWS X-Ray this discussion is about how to think about the effect of large numbers of requests, and how they interact with each other. The average behavior is measured over a fixed time interval, which could be a second, minute, hour or day. There needs to be enough data to average together, and without going into the theory a rule of thumb is that there should be at least 20 data points in an average.

For infrequent requests, pick a time period that averages at least 20 requests together to get useful measurements.

If the time period is too coarse, it hides the variation in workloads. For example for a video conferencing system, measuring hourly call rates will miss the fact that most calls start around the first minute of the hour, and it's easy to get a peak that overloads the system, so per-second measurements are more appropriate.

For spiky workloads, use high resolution one-second average measurements.

Monitoring tools vary, however it's rare to get direct measurements of how long the line is in the various queues. It's also not always obvious how much Concurrency is available to process the queues. For most networks one packet at a time is in transit, but for CPUs each core or vCPU works in parallel to process the run queue. For databases, there is often a fixed maximum number of connections to clients which limits concurrency.

For each step in the processing of a request, record or estimate the Concurrency being used to process it.

If we think about a system running in a steady state, with a stable average throughput and response time, then we can estimate the queue length simply by multiplying the throughput and the residence time. This is known as Little's Law, it's very simple, and is often used by monitoring tools to generate queue length estimates, but it's only true for steady state averages of randomly arriving work.

Little's Law Average Queue = Average Throughput * Average Residence

To understand why this works, and when it doesn't, it's important to understand how work arrives at a service and what determines the gap between requests. If you are running a very simple performance test in a loop then the gap between requests is constant, Little's Law doesn't apply, queues will be short and your test isn't very realistic, unless you are trying to simulate a conveyor belt like situation. It's a common mistake to do this kind of test successfully, then put the service in production and watch it get slow or fall over at a much lower throughput than the test workload.

Constant rate loop tests don't generate queues, they simulate a conveyor belt.

For real world Internet traffic, with many independent users who aren't coordinating and make single requests, the intervals between requests are random. So you want a load generator that randomizes some wait time between each request. The way most systems do this uses a uniformly distributed random distribution, which is better than a conveyor belt, but isn't correct. To simulate web traffic, and for Little's Law to apply you need to use use a negative exponential distribution, as described in this blog post by Dr Neil Gunther.

The proper random think time calculation is needed to generate more realistic queues.

However, it gets worse. It turns out that network traffic is not randomly distributed, it comes in bursts. Those bursts come in clumps. Think of what actually happens when a user starts an iPhone app. It doesn't make one request, it makes a burst of requests. In addition users taking part in a flash sale will be synchronized to visit their app at around the same time causing a clump of bursts of traffic. The distribution is known as pareto or hyperbolic. In addition, when networks reconfigure themselves, traffic is delayed for a while and a queue builds up for a while then floods the downstream systems with a thundering herd. Update - there's some useful work by Jim Brady and Neil Gunther on how to configure load testing tools to be more realistic, and a CMG2019 paper by Jim Brady on measuring how well your test loads are behaving.

Truly real world workloads are more bursty and will have higher queues and long tail response times than common load test tools generate by default.

You should expect queues and response times to vary and to have a long tail of a few extremely slow requests even at the best of times when average utilization is low. So what happens as one of the steps in the process starts to get busy? For processing steps which don't have available concurrency (like a network transmission), as the utilization increases, the probability that requests contend with each other increases, and so does the residence time. The rule of thumb for networks is that they gradually start to get slow around 50-70% utilization.

Plan to keep network utilization below 50% for good latency.

Utilization is also problematic and measurements can be misleading, but it's defined as the proportion of time something is busy. For CPUs where there are more executions happening in parallel, slow down happens at higher utilization, but kicks in harder and can surprise you. This makes intuitive sense if you think about the last available CPU as the point of contention. For example if there are 16 vCPUs, the last CPU is the last 6.25% of capacity, so residence time kicks up sharply around 93.75% utilization. For a 100 vCPU system, it kicks up around 99% utilization. The formula that approximates this behavior for randomly arriving requests in steady state (the same conditions as apply for Little's Law) is R=S/(1-U^N).

Inflation of average residence time as utilization increases is reduced in multi-processor systems but "hits the wall" harder.

Unpicking this, take the average utilization as a proportion, not a percentage, and raise it to the power of the number of processors. Subtract from 1, and divide into the average service time to get an estimate of the average residence time. If average utilization is low, dividing by a number near 1 means that average residence time is nearly the same as the average service time. For a network that has concurrency of N=1, 70% average utilization means that we are dividing by 0.3, and average residence time is about three times higher than at low utilization.

Rule of thumb is to keep inflation of average residence time below 2-3x throughout the system to maintain a good average user visible response time.

For a 16 vCPU system at 95% average utilization, 0.95^16 = 0.44 and we are dividing by 0.56, which roughly doubles average residence time. At 98% average utilization 0.98^16=0.72 and we are dividing by 0.28, so average residence time goes from acceptable to slow for only a 3% increase in average utilization.

The problem with running a multiprocessor system at high average utilization is that small changes in workload levels have increasingly large impacts.

There is a standard Unix/Linux metric called Load Average which is poorly understood and has several problems. For Unix systems including Solaris/AIX/HPUX it records the number of operating system threads that are running and waiting to run on the CPU. For Linux it includes operating system threads blocked waiting for disk I/O as well. It then maintains three time decayed values over 1 minute, 5 minute and 15 minutes. The first thing to understand is that the metric dates back to single CPU systems in the 1960s, and I would always divide the load average value by the number of vCPUs to get a measure that is comparable across systems. The second is that it's not measuring across a fixed time interval like other metrics, so it's not the same kind of average, and it builds in a delayed response. The third is that the Linux implementation is a bug that has become institutionalized as a feature, and inflates the result. It's just a terrible metric to monitor with an alert or feed to an autoscaling algorithm.

Load Average doesn't measure load, and isn't an average. Best ignored.

If a system is over-run, and more work arrives than can be processed, we reach 100% utilization, and the formula has a divide by zero which predicts infinite residence time. In practice it's worse than this, because when the system becomes slow, the first thing that happens is that upstream users of the system retry their request, which magnifies the amount of work to be done and causes a retry storm. The system will have a long queue of work, and go into a catatonic state where it's not responding.

Systems that hit a sustained average utilization of 100% will become unresponsive, and build up large queues of work.

When I look at how timeouts and retries are configured, I often find too many retries and timeouts that are far too long. This increases work amplification and makes a retry storm more likely. I've talked about this in depth in the past, and have added a new post on this subject. The best strategy is a short timeout with a single retry, if possible to a different connection that goes to a different instance of a service.

Timeouts should never be set the same across a system, they need to be much longer at the front end edge, and much shorter deep inside the system.

The usual operator response is to clear an overloaded queue by rebooting the system, but a well designed system will limit it's queues and shed work by silently dropping or generating fast-fail responses to incoming requests. Databases and other services with a fixed maximum number of connections behave like this. When you can't get another connection to make a request, you get a fast fail response. If the connection limit is set too low, the database will reject work that it has capacity to process, and if its set too high, the database will slow down too much before it rejects more incoming work.

Think about how to shed incoming work if the system reaches 100% average utilization, and what configuration limits you should set.

The best way to maintain a good response time under extreme conditions is to fail fast and shed incoming load. Even under normal conditions, most real world systems have a long tail of slow response times. However with the right measurements we can manage and anticipate problems, and with careful design and testing it's possible to build systems that manage their maximum response time to user requests.

To dig deeper into this topic, there's a lot of good information in Neil Gunther's blog, books, and training classes. I worked with Neil to develop a summer school performance training class that was hosted at Stanford in the late 1990's, and he's been running events ever since. For me, co-presenting with Neil was a formative experience, a deep dive into queuing theory that really solidified my mental models of how systems behave.

Picture of lines of waiting barrels taken by Adrian at a wine cellar in Bordeaux.

Scaling AWS costs to match the business

adrian cockcroft — Tue, 21 Apr 2020 17:05:37 +0000

I recently wrote a Medium post on cloud native cost optimization, in part to help customers who are currently dealing with rapid large and unexpected changes in their businesses due to the impact of COVID-19. Out of the ensuing discussion, a few things emerged. One is that I should try out dev.to for developer oriented posts, so this is my first post here. Another is that there's benefits and challenges in generating a metric that reports AWS cost per unit of business, so that's the subject of this discussion.

The first challenge is to decide what your business does, and whether there is a dominant metric that measures the value you provide to customers. I was at Netflix in 2011 when we started to build tooling to optimize our AWS spend, and Netflix has a very focused business model and measured customer value as the number of "streaming starts per second" (SPS). i.e. The rate at which people decide to start watching a show on Netflix.

We also had an AWS deployment model which tagged and attributed all the entities we created on AWS back to individuals and teams, and produced detailed billing on an hourly basis. Starting with a total AWS cost, dividing by SPS produced an hourly average total "cloud cost of value". Digging in further, the cost could be broken down by production delivery vs. test and development vs. data science vs. movie encoding etc. and individual teams were sent a weekly report showing their own share of the total cost, and how it was trending.

I've found that many customers don't have good tagging and attribution setup, so find it hard to work out what is driving their AWS bill. The first step is to come up with a percentage attribution metric for the bill, and drive it to cover most of the spend. I'd focus on this as a priority until it's in the 70-90% range, then clean up the rest over time.

For a more typical complex and diverse business, with many points of value delivery, the trick is to pick a dominant expense that scales with customer activity. One of the travel industry customers I've been working with did this, and used the metric to drive a cost reduction program over the last nine months. Amongst many other optimizations they implemented autoscaling to drive up average utilization. When COVID-19 hit, their customer traffic dropped, and their autoscalers maintained high utilization on a smaller footprint, so their AWS bill for that workload automatically reduced.

AWS is working with many customers and partners to help cost optimize in these uncertain times. After my Medium post Erik Peterson of CloudZero reached out to me to discuss their product which implements automatic tagging and allocation of metrics to help SaaS engineering teams continuously optimize their AWS spend. If this sounds interesting CloudZero Inc and AWS are offering through the end of May a 20% discount, $1k credit and a free 30-day trial with upfront waste assessment.

The orignal NetflixOSS tool that implemented this was called Ice, and they outgrew it, and passed it over to Teevity, who maintain their own version.