DEV Community

Ryan Scott Brown
Ryan Scott Brown

Posted on

 

AWS Re:Liability - The Status Page Status

It's been a rough few weeks for AWS. Every on-call dreads writing an update like this one and if you wrote the ones that were on status.aws.amazon.com on December 7th, I feel your pain and this post isn't about you.

We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified root cause of the issue... We do not have an ETA for full recovery at this time.

Amazon Web Services is a giant with Q3 2021 revenues reported at 16.1 billion USD, a conference that takes over multiple Vegas hotels every year, and somewhere north of 50,000 employees. It has a wide impact on the tech world directly, and on the rest of the world indirectly because of huge customers including Netflix, Amazon retail, and McDonald's. Downtime is inevitable, but the severity and frequency of these events is a result of decisions made at AWS.

TL;DR

It really grinds my gears to hear AWS yammer on about cell-based deploys and minimizing blast radius during Re:Invent and watch the following week as the status page stays green for at least an hour during a regional outage, and then only admit AWS Console problems. Scroll down to Re:Prioritize if you want to know what AWS can do about it.

Re:Hash, Back to 2017

In 2017, an S3 team member executed a command at 9:37 AM PST that took down a larger set of servers than intended. This stopped S3 in us-east-1 from serving requests and took down other services including EC2 instance launches, EBS snapshot restores, and Lambda function executions. The summary includes the following paragraph:

From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. ... We will do everything we can to learn from this event and use it to improve our availability even further.

For the first two hours of the outage, AWS could not use their status page to provide customer updates on service status.

Re:Hash, Back to 2020

After the Great Kinesis Outage of 2020 AWS themselves pointed out that their communication with customers was lacking. Buried in the second to last paragraph, we find this quote:

We experienced some delays in communicating service status to customers during the early part of this event... we typically post to the Service Health Dashboard... we were unable to update the Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event.

This is scary: the primary and secondary ways that AWS communicates outages to customers are the Personal Health Dashboard (obviously an AWS service, impacted by AWS issues) and the public status.aws.amazon.com runs on AWS. According to the above quote, it also depends on Cognito (at least). They continue:

We have a back-up means of updating the Service Health Dashboard that has minimal service dependencies. While this worked as expected, we encountered several delays during the earlier part of the event in posting to the Service Health Dashboard with this tool, as it is a more manual and less familiar tool for our support operators.

That still means the back-up method requires AWS services to be up. AWS is blessed with an interesting problem: using AWS is widespread enough that it would be hard for them to guarantee a third-party hosting their status page did not depend on them in some way. For 99.999% of companies, buying a SaaS like statuspage.io is sufficient to make sure your downtime doesn't take down your status page provider.

For AWS, the status page needs to be more reliable than AWS itself and can't (for pride) be hosted on GCP or another "competitor". This puts them in a sticky spot: hosting a status page dependent on the infrastructure it is telling the status of is bad, hosting a status page on someone else's infrastructure is embarassing. It's disappointing AWS has so far chosen vanity over customers.

Re:Evaluation

The worst sin of the status page isn't just that it's AWS-reliant though. For people who depend on AWS the real status pages are are AWS TAM's for customers who have them and Twitter/Reddit/groupchats for those who don't.

Imagine a new customer starting their AWS account after watching Re:Invent talks with titles like:

They would expect, whenever there is a problem in AWS, for it to be limited in blast radius to some cell or group of cells. Three major regional service outages in three weeks would not be a reasonable expectation to gather from what AWS claims about its operations.

The week before Re:Invent 2021 AWS took two downtime events that were region-wide to all observation. In Ohio us-east-2, AWS Lambda became unavailable for a large but indeterminate number of customers and in Oregon us-west-2 SNS became unavailable for at least 30 minutes region-wide.

The positioning of these downtime events directly ahead of Re:Invent is either coincidence, or a rushed deploy to get a Re:Invent feature out. In either case, region-size downtime doesn't inspire confidence in cell isolation for those services. In the latter case, it also indicates holes in testing and canary deployments.

As a customer, it feels like AWS is prioritizing adding features over serving existing systems. Customers should not think of Re:Invent as a harbinger of rushed releases and service downtime

De:Centralization

No, AWS shouldn't become a DAO or get Ma' Bell'd to prevent systemic downtime. Don't reply about crypto, more than half of Ethereum workloads run on AWS anyway. But the centrality of us-east-1 to every region is scary. Among other things, AWS Account root logins were unavailable for the duration of the outage, and key services like AWS Organizations and audit trails for global services only run in us-east-1.

As of November 22, 2021, AWS CloudTrail will change how trails can be used to capture global service events. After the change, events created by CloudFront, IAM, and AWS STS will be recorded in the region in which they were created, the US East (N. Virginia) region, us-east-1. This makes CloudTrail's treatment of these services consistent with that of other AWS global services. – AWS Docs

The implications of CloudFront, IAM, and STS all being "global" services centered in us-east-1 violates the expectations users have when they hear "global service." This also moots "just build in us-east-2" and similar arguments. If STS is failing, ephemeral compute like ECS tasks and Lambda functions will not start.

@awscloud is too big, and has too many customers for the overall good of society.

"Well were things more reliable before @awscloud?" No! Good lord no! The difference is that I could have a bad day and take down a hospital. AWS has a bad day and takes down all the hospitals.

It's the simultaneous outage of everything that's the problem.
Corey Quinn

Coinbase, ironically one of the biggest names in the "decentralized currency / money laundering" space, took downtime along with medical records software, pharmacies, hospitals, nursing homes, factories, and robot vacuums that are controlled through an app.

There seem to be a few Cory/Coreys with opinions on this topic. Let's pivot to another example of centralization risk, everyone's favorite Big Stuck Boat the Ever Given. It clogged the Suez Canal by being so enormous it could get stuck sideways against each side, and the reason it was so large is for throughput at the cost of resilience. Two ships with half the cargo capacity each wouldn't be long enough to stick in the canal.

The Suez crisis illustrates one of the less-appreciated harms of monopoly: all of us are dunderheads at least some of the time. When a single person wields a lot of unchecked power, their follies, errors and blind-spots take on global consequence.

The "efficiencies" of the new class of megaships – the Ever Given weighs 220 kilotons and is as long as the Empire State Building – were always offset by risks, such as the risk of getting stuck in a canal or harbor.

Running a complex system is a game of risk mitigation: not just making a system that works as well as possible, but also making one that fails as well as possible. Build the Titanic if you must, but for the love of God, make sure it has enough life-boats.
Cory Doctorow

AWS has made us-east-1 a source of systemic risk for every single AWS customer. Centralizing IAM, CloudTrail, and CloudFront means many applications are directly dependent on the AWS region with a reputation for large outages.

We started in a garage, but we're not there anymore. We are big, we impact the world, and we are far from perfect. We must be humble and thoughtful about even the secondary effects of our actions. ...
LP #16

What would this value look like applied to status pages?

  • A Personal Health Dashboard that wasn't based in us-east-1
  • A public status page hosted on another provider
  • An SLA for the status page that includes the time between issue discovery and status page update
  • Updates as issues happen
  • Using the red "Service Disruption" icon when more than 50% of a service is impacted
  • Reducing the blast radius of us-east-1 for other services

It's understandable that the status of a service can be green while an individual customer is completely down: Virginia likely has a million customers alone, but when multiple huge properties are unavailable the green dot doesn't shift responsibility.

De:Layed Communication

The failure of the status page / Personal Health Dashboard are inexcusable. The recommendation "subscribe to an RSS feed to be notified of interruptions to each individual service" is a complete joke. The following day the information on the status page doesn't have the scope of the yesterday's issue.

  • og-aws Slack: 7:47 AM PST "is there an outage? My SSO doesn't load ..."
  • AWS: 8:22 AM PST We are investigating increased error rates for the AWS Management Console.
  • AWS: 8:26 AM PST We are experiencing API and console issues in the US-EAST-1 Region. We have identified root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1. Customers may be able to access region-specific consoles going to console.aws.amazon.com. So, to access the US-WEST-2 console, try us-west-2.console.aws.amazon.com.
  • og-aws Slack: 9:01 AM PST "us-east-1 cloudformation went black"
  • AWS: 9:37 AM PST We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified the root cause and are actively working towards recovery.
  • AWS: 4:25 PM PST We are seeing improvements in the error rates and latencies in the AWS Management Console in the US-EAST-1 Region. We are continuing to work towards resolution
  • AWS: 5:14 PM PST Between 7:32 AM to 4:56 PM PST we experienced increased error rates and latencies for the AWS Management Console in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

The updates to the status page were neither up-to-date nor complete. AWS didn't admit the outage affected more than the Management console until 45 minutes after I was unable to load the console myself, and I wasn't the first person to be impacted. The full scope of affected services isn't even reflected on the status page. API Gateway is listed as only affected between 3:23 PM PST and 5:23PM PST, when I and others observed API Gateway problems until after 8 PM PST at least.

Most damningly, the service history for ECS starts at 3:32 PM PST with this update:

ECS has recovered from the issue earlier in the day, but we are still investigating task launch failures using the Fargate launch type. Task launches using the EC2 launch type are not impacted.

There is nothing on the status page about an issue earlier in the day. The service health was green, what issue?

Re:Prioritize the Customer

From the outside, any recommendation lacks context. I don't pretend to be smarter or better than the people building AWS, but I know the experience of being outside of AWS better than many of them. All these suggestions come from that context.

AWS services depending on other services is confusing from the outside. Service blast radius and dependency chains between services is hard to reason about for customers. The 2020 Kinesis outage took down CloudWatch Logs, CloudWatch Metrics, Cognito, AutoScaling, and EventBridge. AWS needs to treat customers like they are intelligent business partners. Previous architectural guidance from AWS has focused on multi-az architectures that survive one or more availability zone incidents. Recent issues have shown multiple region-level services that don't have failure boundaries at the availability zone, which previous guidance doesn't cover.

Beyond increased cellularization of key services (one of the promises after the 2020 Kinesis outage), it seems like a "Ring 0" partition or cell is required that allows AWS to build on other services without creating a Jenga tower opaque to service teams and customers.

Key services need the ability to support multiple regions. Cognito is #1 on my list because there is no way to export users to a new region without forcing a mass password reset or intercepting user passwords. S3 has cross-region replication, RDS allows read replicas across regions, but when it comes to user identities AWS doesn't have a reasonable answer. Further, we now know that AWS Root user logins can only happen if us-east-1 is available, and many DR plans I know of assume that to be the login of last resort in case of IAM or other failures.

AWS has the opportunity to show that it is learning and improving from each of these events. So far, a status page that was called out by AWS in 2017 as too dependent on AWS services to run that is still not fixed isn't showing that.

The success of the Well-Architected Framework and Cloud Adoption Framework in teaching customers about drivers of success shows that customers are willing to follow guidance, and want to learn from AWS what we can do to control our destiny.

Are regions the new Availability Zone blast radius, and we should expect them to have major failures annually or more? Explain it.

Are these incidents being taken as lessons and causing huge architectural changes inside S3, STS, CloudTrail, and AWS networking? Explain it.

Is the AWS status page not actually intended to be used to understand the status of AWS? Admit it.

What Can You Do?

So far this post has been all about AWS. As a customer, you have power too.

As an industry, our Disaster Recovery plans are based on unrealistic models of how disasters occur. If your DR plan involves rebuilding your app in another cloud provider based on some amount of downtime, realize that a real disaster looks more like a widely degraded failure state with limited information about the scope and the recovery outlook. If your DR plan involves employees coming in to work when a nuke lands on Reston, Virginia, you clearly don't like your family.

Planning to fail into another region is difficult to retrofit, let alone failing over to a second cloud provider. Realistic resiliency that stays working needs to be exercised, and often looks like multiple active regions rather than one active region and a DR plan document. All DR plans are subject to justifications. If a single region is reliable enough for you, then carry on with writing fantasy documents about failing to a new region.

Testing partial failures inside a region is far more realistic, even if it is less protective. If you are in a safety-critical industry and AWS being down impacts you, the continuity of your business process probably needs to be evaluated in a new light. Applications going down shouldn't hurt people.

Appendix A: AWS Status Page Banner

The full text of AWS' updates on the status page banner as of December 8th, 2021 is below in case it is removed.

  • [9:37 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified the root cause and are actively working towards recovery.
  • [10:12 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. We have identified root cause of the issue causing service API and console issues in the US-EAST-1 Region, and are starting to see some signs of recovery. We do not have an ETA for full recovery at this time.
  • [11:26 AM PST] We are seeing impact to multiple AWS APIs in the US-EAST-1 Region. This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. Services impacted include: EC2, Connect, DynamoDB, Glue, Athena, Timestream, and Chime and other AWS Services in US-EAST-1. The root cause of this issue is an impairment of several network devices in the US-EAST-1 Region. We are pursuing multiple mitigation paths in parallel, and have seen some signs of recovery, but we do not have an ETA for full recovery at this time. Root logins for consoles in all AWS regions are affected by this issue, however customers can login to consoles other than US-EAST-1 by using an IAM role for authentication.
  • [12:34 PM PST] We continue to experience increased API error rates for multiple AWS Services in the US-EAST-1 Region. The root cause of this issue is an impairment of several network devices. We continue to work toward mitigation, and are actively working on a number of different mitigation and resolution actions. While we have observed some early signs of recovery, we do not have an ETA for full recovery. For customers experiencing issues signing-in to the AWS Management Console in US-EAST-1, we recommend retrying using a separate Management Console endpoint (such as us-west-2.console.aws.amazon.com). Additionally, if you are attempting to login using root login credentials you may be unable to do so, even via console endpoints not in US-EAST-1. If you are impacted by this, we recommend using IAM Users or Roles for authentication. We will continue to provide updates here as we have more information to share.
  • [2:04 PM PST] We have executed a mitigation which is showing significant recovery in the US-EAST-1 Region. We are continuing to closely monitor the health of the network devices and we expect to continue to make progress towards full recovery. We still do not have an ETA for full recovery at this time.
  • [2:43 PM PST] We have mitigated the underlying issue that caused some network devices in the US-EAST-1 Region to be impaired. We are seeing improvement in availability across most AWS services. All services are now independently working through service-by-service recovery. We continue to work toward full recovery for all impacted AWS Services and API operations. In order to expedite overall recovery, we have temporarily disabled Event Deliveries for Amazon EventBridge in the US-EAST-1 Region. These events will still be received & accepted, and queued for later delivery.
  • [3:03 PM PST] Many services have already recovered, however we are working towards full recovery across services. Services like SSO, Connect, API Gateway, ECS/Fargate, and EventBridge are still experiencing impact. Engineers are actively working on resolving impact to these services.
  • [4:35 PM PST] With the network device issues resolved, we are now working towards recovery of any impaired services. We will provide additional updates for impaired services within the appropriate entry in the Service Health Dashboard.

Appendix B: Full RSS Timeline

Below is every update pushed for any service during the full incident duration. The title of each update begins with the publish time, in chronological order.

2021-12-07T08:22:57 PST AWS Management Console Service Status

Informational message: [RESOLVED] Increased Error Rates

We are investigating increased error rates for the AWS Management Console.

2021-12-07T08:26:00 PST AWS Management Console Service Status

Informational message: [RESOLVED] Increased Error Rates

We are experiencing API and console issues in the US-EAST-1 Region. We have identified root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1. Customers may be able to access region-specific consoles going to https://console.aws.amazon.com/. So, to access the US-WEST-2 console, try https://us-west-2.console.aws.amazon.com/

2021-12-07T08:49:31 PST Amazon Elastic Compute Cloud (N. Virginia) Service Status

Service degradation: [RESOLVED] Increased API Error Rates

We are experiencing elevated error rates for EC2 APIs in the US-EAST-1 region. We have identified root cause and we are actively working towards recovery.

2021-12-07T08:53:40 PST Amazon Connect (N. Virginia) Service Status

Informational message: [RESOLVED] Degraded Contact Handling

We are experiencing degraded Contact handling by agents in the US-EAST-1 Region.

2021-12-07T08:57:51 PST Amazon DynamoDB (N. Virginia) Service Status

Service degradation: [RESOLVED] Increased API Error Rates

We are currently investigating increased error rates with DynamoDB Control Plane APIs, including the Backup and Restore APIs in US-EAST-1 Region.

2021-12-07T09:01:31 PST AWS Support Center Service Status

Informational message: [RESOLVED] Increased Error Rates

We are investigating increased error rates for the Support Center console and Support API in the US-EAST-1 Region.

2021-12-07T09:08:27 PST Amazon Connect (N. Virginia) Service Status

Service degradation: [RESOLVED] Degraded Contact Handling

We are experiencing degraded Contact handling by agents in the US-EAST-1 Region. Agents may experience issues logging in or being connected with end-customers.

2021-12-07T09:18:00 PST Amazon Connect (N. Virginia) Service Status

Service degradation: [RESOLVED] Degraded Contact Handling

We can confirm degraded Contact handling by agents in the US-EAST-1 Region. Agents may experience issues logging in or being connected with end-customers.

2021-12-07T13:55:59 PST AWS Support Center Service Status

Service degradation: [RESOLVED] Increased Error Rates

We continue to see increased error rates for the Support Center console and Support API in the US-EAST-1 Region. Support Cases successfully created via the console or the API may not be successfully routed to Support Engineers. We continue to work toward full resolution.

2021-12-07T14:54:31 PST Amazon EventBridge (N. Virginia) Service Status

Service degradation: [RESOLVED] Event Delivery Delays

We have temporarily disabled event deliveries in the US-EAST-1 Region. Customers who have EventBridge rules that trigger from 1st party AWS events (including CloudTrail), scheduled events via CloudWatch, events from 3rd parties, and events they post themselves via the PutEvents API action will not trigger targets. These events will still be received by EventBridge and will deliver once we recover.

2021-12-07T15:00:14 PST Amazon EventBridge (N. Virginia) Service Status

Service degradation: [RESOLVED] Event Delivery Delays

We have re-enabled event deliveries in the US-EAST-1 Region, but are experiencing event delivery latencies. Customers who have EventBridge rules that trigger from 1st party AWS events (including CloudTrail), scheduled events via CloudWatch, events from 3rd parties, and events they post themselves via the PutEvents API action will be delayed.

2021-12-07T15:13:00 PST AWS Support Center Service Status

Service is operating normally: [RESOLVED] Increased Error Rates

Between 7:33 AM and 2:25 PM PST, we experienced increased error rates for the Support Center console and Support API in the US-EAST-1 Region. This resulted in errors in creating support cases and delays in routing cases to Support Engineers. The issue has been resolved and our Support Engineering team is responding to cases. The service is operating normally.

2021-12-07T15:23:16 PST Amazon API Gateway (N. Virginia) Service Status

Informational message: [RESOLVED] Elevated Errors and Latencies

We continue to see increased error rates and latencies for invokes in the US-EAST-1 region. We have identified the root cause and are working towards resolution.

2021-12-07T15:31:24 PST Amazon Elastic Compute Cloud (N. Virginia) Service Status

Service is operating normally: [RESOLVED] Increased API Error Rates

Between 7:32 AM and 3:10 PM PST we experienced increased API error rates in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

2021-12-07T15:32:57 PST Amazon Elastic Container Service (N. Virginia) Service Status

Informational message: [RESOLVED] Elevated Fargate task launch failures

ECS has recovered from the issue earlier in the day, but we are still investigating task launch failures using the Fargate launch type. Task launches using the EC2 launch type are not impacted.

2021-12-07T15:40:39 PST Amazon DynamoDB (N. Virginia) Service Status

Service is operating normally: [RESOLVED] Increased API Error Rates

Between 7:40 AM and 2:25 PM PST, we experienced increased error rates with DynamoDB Control Plane APIs, including the Backup and Restore APIs in US-EAST-1 Region. Data plane operations were not impacted. The issue has been resolved and the service is operating normally.

2021-12-07T16:05:21 PST Amazon API Gateway (N. Virginia) Service Status

Service degradation: [RESOLVED] Elevated Errors and Latencies

We continue to see increased error rates and latencies for invokes in the US-EAST-1 region. We have identified the root cause and are continuing to work towards resolution.

2021-12-07T16:07:39 PST AWS Batch (N. Virginia) Service Status

Informational message: [RESOLVED] Increased Job Processing Delays

We have identified the root cause of increased delay in job state transitions of AWS Batch Jobs in the US-EAST-1 Region and continue to work toward resolution.

2021-12-07T16:25:28 PST AWS Management Console Service Status

Informational message: [RESOLVED] Increased Error Rates

We are seeing improvements in the error rates and latencies in the AWS Management Console in the US-EAST-1 Region. We are continuing to work towards resolution

2021-12-07T16:31:14 PST Amazon EventBridge (N. Virginia) Service Status

Service degradation: [RESOLVED] Event Delivery Delays

We continue to see event delivery latencies in the US-EAST-1 region. We have identified the root cause and are working toward recovery.

2021-12-07T16:41:50 PST Amazon API Gateway (N. Virginia) Service Status

Informational message: [RESOLVED] Elevated Errors and Latencies

We have seen improvement in error rates and latencies for invokes in the US-EAST-1 region. We continue to drive towards full recovery.

2021-12-07T16:44:27 PST Amazon Elastic Container Service (N. Virginia) Service Status

Informational message: [RESOLVED] Elevated Fargate task launch failures

ECS has recovered from the issue earlier in the day. Task launches using the EC2 launch type are fully recovered. We have identified the root cause for the increased Fargate launch failures and are working towards recovery.

2021-12-07T16:47:46 PST Amazon Connect (N. Virginia) Service Status

Informational message: [RESOLVED] Degraded Contact Handling

We are seeing improvements to contact handling in the US-EAST-1 Region. We are continuing to work towards resolution

2021-12-07T17:10:53 PST Amazon Connect (N. Virginia) Service Status

Service is operating normally: [RESOLVED] Degraded Contact Handling

Between 7:25 AM PST and 4:47 PM PST we experienced degraded Contact handling, increased user login errors, and increased API error rates in the US-EAST-1 Region. During this time, end-customers may have experienced delays or errors when placing a call or starting a chat, and agents may have experienced issues logging in or being connected with end-customers. The issue has been resolved and the service is operating normally.

2021-12-07T17:14:09 PST AWS Management Console Service Status

Service is operating normally: [RESOLVED] Increased Error Rates

Between 7:32 AM to 4:56 PM PST we experienced increased error rates and latencies for the AWS Management Console in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

2021-12-07T17:20:13 PST AWS Batch (N. Virginia) Service Status

Informational message: [RESOLVED] Increased Job Processing Delays

We have seen improvement from the delay in job state transitions of AWS Batch Jobs in the US-EAST-1 Region and continue to work toward resolution.

2021-12-07T17:23:51 PST Amazon API Gateway (N. Virginia) Service Status

Service is operating normally: [RESOLVED] Elevated Errors and Latencies

Between 9:02 AM and 5:01 PM PST we experienced increased error rates and latencies for invokes in the US-EAST-1 Region. The issue has been resolved and the service is operating normally.

2021-12-07T17:31:42 PST Amazon Elastic Container Service (N. Virginia) Service Status

Informational message: [RESOLVED] Elevated Fargate task launch failures

ECS has recovered from the issue earlier in the day. Task launches using the EC2 launch type are fully recovered. We have identified the root cause for the increased Fargate launch failures and are starting to see recovery. As we work towards full recovery, customers may experience insufficient capacity errors and these are being addressed as well.

2021-12-07T18:00:00 PST Amazon EventBridge (N. Virginia) Service Status

Informational message: [RESOLVED] Event Delivery Delays

Event delivery latency for new events in the US-EAST-1 Region have returned to normal levels. We continue to process a backlog of events.

2021-12-07T19:30:07 PST Amazon Elastic Container Service (N. Virginia) Service Status

Informational message: [RESOLVED] Elevated Fargate task launch failures

ECS has recovered from the issue earlier in the day. Task launches using the EC2 launch type are fully recovered. Fargate task launches are currently experiencing increased insufficient capacity errors. We are working on addressing this. In the interim, tasks sizes smaller than 4vCPU are less likely to see insufficient capacity errors.

2021-12-07T20:02:29 PST AWS Batch (N. Virginia) Service Status

Informational message: [RESOLVED] Increased Job Processing Delays

Improvement from the delay in job state transitions of AWS Batch Jobs in the US-EAST-1 Region is accelerating, we continue to work towards full recovery.

2021-12-07T20:29:02 PST AWS Batch (N. Virginia) Service Status

Service is operating normally: [RESOLVED] Increased Job Processing Delays

Between 7:35 AM and 8:13 PM PST, we experienced increase job state transition delays of AWS Batch Jobs in the US-EAST-1 Region. The issue has been resolved and the service is now operating normally for new job submissions. Jobs that were delayed from earlier in the event will be processed in order until we clear the queue.

2021-12-07T21:21:58 PST Amazon EventBridge (N. Virginia) Service Status

Service is operating normally: [RESOLVED] Event Delivery Delays

Between 7:30 AM and 8:40 PM PST we experienced elevated event delivery latency in the US-EAST-1 Region. Event delivery latencies have returned to normal levels. Some CloudTrail events for API calls between 7:35 AM and 6:05 PM PST may be delayed but will be delivered in the coming hours.

2021-12-07T23:01:37 PST Amazon Elastic Container Service (N. Virginia) Service Status

Informational message: [RESOLVED] Elevated Fargate task launch failures

ECS has recovered from the issue earlier in the day. Task launches using the EC2 launch type are fully recovered. Fargate task launches are currently experiencing increased insufficient capacity errors. We are working on addressing this and have recently seen a decrease in these errors while continuing to work towards full recovery. In the interim, tasks sizes smaller than 4vCPU are less likely to see insufficient capacity errors.

2021-12-08T02:29:47 PST Amazon Elastic Container Service (N. Virginia) Service Status

Service is operating normally: [RESOLVED] Elevated Fargate task launch failures

Between 7:31 AM PST on December 7 and 2:20 AM PST on December 8, ECS experienced increased API error rates, latencies, and task launch failures. API error rates and latencies recovered by 6:10 PM PST on December 7. After this point, ECS customers using the EC2 launch type were fully recovered. ECS customers using the Fargate launch type along with EKS customers using Fargate continued to see decreasing impact in the form of insufficient capacity errors between 4:40 PM PST on December 7 and 2:20 AM on December 8. The service is now operating normally. A small set of customers may still experience low levels of insufficient capacity errors and will be notified using the Personal Health Dashboard in that case. There was no impact to running tasks during the event although any ECS task that failed health checks would have been stopped because of that failing health check.

Top comments (0)

🌚 Friends don't let friends browse without dark mode.

Sorry, it's true.