DEV Community

Cover image for ERCOT vs. “The Cloud”: Designing for High Availability
Bobby Neelon
Bobby Neelon

Posted on

ERCOT vs. “The Cloud”: Designing for High Availability

Disclaimer

I am not an energy expert. While I have a lot of experience in the Upstream Oil and Gas industry, I’m not nearly as familiar with the nuances of the other portions of the energy value chain.

Also, I still have a lot to learn about the cloud, but as an AWS Certified Solutions Architect, I believe am qualified to discuss this topic as it is very high-level.

Background

As many of you are aware, a massive arctic blast/winter storm swept through much of the United States this past week and caught many in the south, especially my home state of Texas, with their proverbial “pants down.” Across my social media feeds there was plenty of blame being passed around to the tune of:

“[Insert the energy source my confirmation bias tells me I should hate] let us down!”

With all that came plenty of screenshots from the EIA’s Electric Grid Monitor Dashboard.

1_mwge9CghYthu3m3Rt6nYXA

Proponents of renewables pointed out how Natural Gas “failed” us on and after February 15th, and proponents of fossil fuels pointed out how wind and solar power generation practically “disappeared” as the arctic blast blew in.

As with many things in life, it’s not a zero-sum game and identifying the true culprit is much more nuanced. Thus, it is not my intent to do that with this post. What I can tell you, is that there was definitely a failure to account for the demand put on the system somewhere in the value chain.

With that in mind, and after looking at charts like the one above, it took me back to my AWS Solutions Architect Training where we learned about designing for high availability through auto-scaling virtual-machines.

Below, I will outline how cloud infrastructure is typically designed and how it compares to that of electricity generation systems.

On-Premise Server Deployment

In a traditional, on-premise server deployment, IT must understand and forecast not just the typical but also the maximum demand/load that will be placed on their infrastructure and procure resources to effectively handle those demands ahead of time.

In simple the simplest terms, it would look like this:

1_UaEmV2l_iZHN9e6iZjd0wA

You have enough compute and storage to handle whatever foreseen load will be put on the system.

That’s great and all, but there are a couple issues with this approach:

  • Total available resources must be over-provisioned, sometimes extremely so, leading to unused capacity and unnecessary costs.

  • If forecasts are wrong or unforeseen events happen, demand can outpace the provisioned resources and you end up with downtime for end-users and more work for IT.

The Cloud and Scalability

With regard to virtual machines, most major cloud suppliers provide you with 3 core types of pricing/instances:

On-Demand

These instances are “Pay-as-you-go” (usually per second billing). They are typically the most expensive but also the most flexible option. You can literally turn them on and off and only pay for the server instance type you chose and time you used.

Reserved

These instances are paid for ahead of time. Most major providers allow you to reserve either 1 or 3 years in advance which can result in up to 70+% cost savings vs On-Demand instances. Reserved instances can minimize risks, more predictably manage budgets, and comply with policies that require longer-term commitments, but they are also the most rigid. You cannot change the specs once they are procured.

Spot

With these instances, you can bid for unused capacity in a cloud vendor’s data center. You can save up to 90% of the cost when compared to On-Demand Instances. However, if some else bids higher than you, your instance will be taken away. Thus, while cheaper, Spot Instances are inherently more unreliable.

Auto Scaling

A key advantage of a cloud-based infrastructure is how quickly one can respond to changes in resource needs. With Auto Scaling in the cloud, developers/IT can adjust resources automatically utilizing APIs, Scripts, and Auto-Scaling groups based on alarms, rules, timing, etc.

There are two main types of server scaling.

  • Horizontal — Increase or decrease the count of smaller sized servers to handle the varying load. This is generally handled with On-Demand instances and then capped off with Spot instances to handle “fringy” workload spikes.

  • Vertical — Increase the compute and or storage of an existing instance to handle the varying load. Sometimes this is the only way(see AWS RDS instances for example).

There are always exceptions, but generally, horizontal scaling is the preferred method in the cloud when available.

Designing for High Availability

As opposed to the traditional, on-premise workflow, one does not need to over-provision ahead of time in the cloud(that’s not to say, it doesn’t happen because it does all the time!).

While not perfect, below is a quick mockup of how one would design a cloud VM infrastructure to handle spikes in demand while not getting gouged by costly On-Demand Instances.

1_lRwhtDMhbtQs7WOZRtqH_g

Assuming a reasonable amount of monitoring history, infrastructure teams should have a good idea of the minimum resources needed to run their applications on a daily basis. In those, instances, you would not want to pay full-price for On-Demand instance when you can save up to 70% using reserved instances. Thus, it’s advantageous to build a “base” infrastructure with Reserved Instances.

Then, the spikes in demand can be managed by the aforementioned Auto-Scaling which uses a mix of On-Demand Instances and sometimes Spot Instances.

How are the Two Related

You may be asking yourself at this point, how are electricity generation systems and cloud computing related? Why am I drawing this comparison?

To quote Werner Vogels, CTO, Amazon.com:

“Everything fails, all the time.”

In this he means, we should design our systems with failure in mind and design backwards from that.

Well, from the data, it looks like the ERCOT energy grid(and most others) is built in much the same way! You have a mix of power options that come with varying degrees of availability, durability, pricing, and reliability that must be synthesized into a performant system.

The goal of most systems is to be as efficient as possible where efficiency is the ability to provide the best quality of service in the most affordable manner.

I will say, the comparisons I make below are primarily focused on reliability and durability of the correlating systems, as opposed to pricing. There are some parallels, and I’ll mention them where necessary, but pricing is definitely not one-to-one.

1_smxfLjvvJz4ZJbjzfpYgtw

Nuclear ≈ 3-year Reserved Instances

From the chart above, Nuclear is clearly a very reliable energy source. It doesn’t need to be and, as I understand it, can’t really be scaled up very easily. You have what you have. It’s efficient and clean and handles what it’s asked to handle.

Coal ≈ 1-year Reserved Instances

Coal, while certainly “spikier” than nuclear, it is similar to Nuclear in that it holds a pretty consistent wedge of the electricity generation throughout the year and is only mildly scalable.

For all intents and purposes, you can count on Nuclear + Coal to account for about 350,000 MWh or so of electricity generation each day.

Natural Gas ≈ On-Demand Instances

This is where it gets more interesting. As you can see in the data, Natural Gas flexes it’s scalability muscles at different points in the year. Around February 8th, it went from 178,512 MWh to 899,328 MWh in a week. A percent change of over 400%! Without that, many more Texans would have incurred property damage…or much worse!

To be balanced, it should be noted that there were issues with Natural Gas as demand reached never before seen levels. In this case, ERCOT and the downstream providers, were not equipped to handle these loads. Natural Gas plants were brought offline and there was freezing at the wellheads and in the midstream infrastructure. You can see that reflected in the charts as Natural Gas generation dropped 23% from 899,328 MWh to 692,091 MWh in 2 of the storm’s coldest days.

This can very well happen to a cloud system, too, if Auto Scaling groups and logic are not employed properly or with enough foresight. Imagine a retail website on Cyber Monday. If there’s not enough compute or storage provisioned by the Auto Scaling logic to handle the demand/web traffic, the application will go offline and experience downtime which is akin to power outages and blackouts in the energy generation scenario.

Another parallel, at least from this past week, could be the price also. Typically, Natural Gas is dirt cheap. It’s even sold at negative prices at different points in the last few years, but with skyrocketing demand, prices went up 4000% in some areas.

Renewables(Wind, Solar) ≈ Spot Instances

While this may be some of my bias coming through, Wind and Solar (as employed with current technology) are not very reliable energy sources. They cannot be stored in any meaningful quantity and are subject to Mother Nature’s whims. If the wind dies, no wind power. If the sun is blocked, no solar power. That said, they are relatively cheap and can, in the case of Wind, supply a large percentage of the grid’s power generation at various points in time.

Similarly, in the case of Spot instances, they are very affordable, but somewhat unreliable. In AWS, for example, spot instance pricing comes with the caveat that AWS can “pull the plug” and terminate spot instances with just a 2 minute warning. This is much like how cloudy weather or night time can roll in and take solar power offline, or how winds can die down and take wind power offline.

So What?!

My intent with this post was to be more descriptive as opposed to prescriptive. I noticed a correlation and found it noteworthy given recent events.

In either case(energy generation or cloud computing), there is a lot of nuance to developing a durable, reliable, highly-available infrastructure that I really distilled down to the simplest components. There is SO much more to it.

In the months that come, more information will come out around why ERCOT/Texas had such a notable grid failure this past week. I believe as it comes out, we’ll see that this wasn’t any one power source’s fault but rather a failure to prepare the system effectively and employ best practices used across North America.
See this from Bill King:

1_rJBRXou5X46i0p3EZHzRdA

We have a diverse and typically resilient energy grid in Texas, but this time it wasn’t ready. We can only hope the “powers that be” learn from it and adapt!

Recommendations?

Do you agree with my assessment?

What are some other “cloud”/server design concepts that could have been applied to help the electrical grid last week?

  • Regional failover and or connecting outside your VPC, i.e. tying into other grids for support?
  • Avoid single points of failure?
  • Is there an equivalent to load balancing in the electrical grid?

Thanks for reading!

Top comments (0)