Shib™ 🚀

Posted on Feb 13 • Originally published at apistatuscheck.com

Is AWS Down? How to Check AWS Status and Fix Issues

#aws #devops #monitoring #cloud

Is AWS Down? How to Check AWS Status and Fix Issues

Last Updated: February 13, 2026

Amazon Web Services (AWS) powers approximately 33% of the global cloud infrastructure market and supports millions of businesses worldwide, from startups to Fortune 500 companies. When AWS goes down, the impact is catastrophic: e-commerce sites lose revenue, SaaS applications become unavailable, APIs stop responding, mobile apps crash, and critical business operations grind to a halt. Whether you're a DevOps engineer managing production infrastructure, a developer deploying applications, a CTO monitoring business-critical systems, or a business owner relying on AWS for operations, this guide helps you quickly determine if AWS is experiencing an outage and what to do about it.

Quick Status Check: Is AWS Down Right Now?

Before troubleshooting, verify AWS's current status:

Check API Status Check: Visit apistatuscheck.com/api/aws for real-time AWS monitoring across all major services
AWS Service Health Dashboard: Check AWS's official status at status.aws.amazon.com
Test Multiple Regions: AWS operates in 30+ regions — check if your specific region (us-east-1, eu-west-1, etc.) is affected
Try Different Services: Test EC2 vs. S3 vs. Lambda independently — partial outages are common
Check DownDetector: downdetector.com/status/amazon-web-services for user-reported issues

If multiple sources confirm problems, AWS is likely experiencing an outage. If only you're affected, follow the troubleshooting steps below.

💡 Pro tip: Sign up for free alerts to get notified the moment AWS goes down — critical for SRE teams running production workloads, e-commerce during peak sales, and financial services with strict SLAs.

AWS Infrastructure: What Can Break

AWS is composed of 200+ services that can fail independently. The most critical services include:

Service	What It Does	Impact When Down
EC2	Virtual servers and compute instances	Applications can't start, scale, or process requests
S3	Object storage for files, backups, static assets	Websites break, file uploads fail, data inaccessible
Lambda	Serverless function execution	APIs fail, automation stops, event processing halts
RDS	Managed relational databases (MySQL, PostgreSQL, etc.)	Data layer fails, read/write operations timeout
CloudFront	Global CDN for content delivery	Websites slow down or show 5xx errors globally
Route 53	DNS and domain management	Domain names stop resolving, traffic can't reach apps
API Gateway	API management and routing	REST/WebSocket APIs return errors or timeout
DynamoDB	NoSQL database	Real-time data access fails, high-traffic apps crash
ECS/EKS	Container orchestration (Docker/Kubernetes)	Containers can't start, scale, or communicate
SQS/SNS	Message queuing and notifications	Async processing stops, notifications don't send
Elastic Load Balancer	Traffic distribution across instances	Traffic routing fails, health checks stop working
IAM	Identity and access management	Authentication fails, API calls get permission errors
CloudWatch	Monitoring and logging	Can't see metrics, alarms don't trigger, blind to issues
VPC	Virtual private cloud networking	Network connectivity lost between resources

Key insight: AWS often has region-specific outages where one region (e.g., us-east-1) fails while others remain operational. The us-east-1 region (Northern Virginia) is AWS's largest and most prone to cascading failures. Multi-region architectures are critical for high availability.

Common AWS Error Messages and What They Mean

Error	Meaning	Fix
`EC2: InsufficientInstanceCapacity`	No available capacity in availability zone	Try different AZ or instance type, or wait for capacity
`S3: 503 SlowDown`	Too many requests to S3 bucket	Implement exponential backoff, use CloudFront CDN
`S3: Access Denied`	IAM permissions issue or bucket policy	Verify IAM policy, bucket ACL, and block public access settings
`Lambda: Service Unavailable`	Lambda control plane outage	Check AWS status, retry with exponential backoff
`Lambda: Task timed out after X seconds`	Function exceeded timeout limit	Increase timeout in function config or optimize code
`RDS: could not connect to server`	Database unreachable or failed over	Check security groups, verify RDS is running, check region status
`CloudFront: 502 Bad Gateway`	Origin (EC2/S3/ALB) not responding	Check origin health, verify security groups allow CloudFront
`CloudFront: 503 Service Unavailable`	CloudFront distribution or origin issue	Check origin status, verify custom headers, check AWS status
`Route 53: SERVFAIL`	DNS resolution failure	Check hosted zone config, verify nameservers, check AWS status
`API Gateway: 502 Bad Gateway`	Lambda or backend integration failure	Check Lambda logs, verify integration timeout settings
`API Gateway: 429 Too Many Requests`	Rate limit exceeded	Implement throttling, request limit increase
`DynamoDB: ProvisionedThroughputExceededException`	Exceeded read/write capacity	Enable auto-scaling or use on-demand billing
`DynamoDB: ServiceUnavailable`	DynamoDB control plane issue	Check AWS status, implement retry logic
`ECS: Service is unable to place a task`	No EC2 capacity or Fargate capacity exhausted	Check cluster capacity, try different AZ
`EKS: cannot reach cluster`	Control plane unreachable	Check AWS status for EKS, verify security groups and VPC config
`SQS: QueueDoesNotExist`	Queue deleted or wrong region	Verify queue name and region, check IAM permissions
`SNS: EndpointDisabled`	Endpoint bounced or failed too many times	Re-subscribe endpoint, verify endpoint is reachable
`IAM: AccessDenied`	Insufficient permissions	Review IAM policy, check resource-based policies
`IAM: InvalidClientTokenId`	AWS credentials invalid or rotated	Refresh credentials, check access key is active
`VPC: Network is unreachable`	Routing table, NAT gateway, or IGW issue	Check route tables, verify NAT/IGW attached and running

Common AWS CLI Error Messages

CLI Error	Meaning	Fix
`Unable to locate credentials`	AWS CLI not configured	Run `aws configure` with access key and secret key
`Could not connect to the endpoint URL`	Wrong region or endpoint	Verify `--region` flag matches resource location
`An error occurred (DryRunOperation)`	Dry-run mode enabled	Remove `--dry-run` flag to execute for real
`A client error (RequestLimitExceeded)`	API throttling	Implement exponential backoff and retry logic
`You are not authorized to perform this operation`	IAM permission missing	Check IAM policy for required actions
`Invalid ID` or `does not exist`	Resource not found or wrong region	Verify resource ID and region parameter
`SSL validation failed`	Certificate issue	Update AWS CLI or use `--no-verify-ssl` (not recommended for production)
`timed out`	Network connectivity or AWS service issue	Check internet, verify security groups, check AWS status

Step-by-Step Troubleshooting

Step 1: Confirm It's Not Just You

Check apistatuscheck.com/api/aws for real-time status across all AWS services
Visit status.aws.amazon.com and filter by your region
Search Twitter/X for "AWS down" or "us-east-1 down" to see if others report issues
Ask teammates or check internal monitoring (CloudWatch, Datadog, New Relic)
Try from a different network or device (VPN, mobile hotspot)
Check if other AWS regions are affected (try launching resources in us-west-2 if us-east-1 fails)
Review DownDetector for geographic patterns and affected services

Step 2: Identify Which AWS Service Is Affected

EC2: Try launching a new instance or connecting to existing ones via SSH/RDP
S3: Test bucket access with AWS CLI: aws s3 ls s3://your-bucket-name
Lambda: Check function execution logs in CloudWatch or trigger a test event
RDS: Try connecting to database with mysql/psql client or check RDS console
CloudFront: Test distribution URL directly and check origin health
Route 53: Test DNS resolution: nslookup yourdomain.com or dig yourdomain.com
API Gateway: Call API endpoint with curl and check response codes
DynamoDB: Test table access with AWS CLI: aws dynamodb scan --table-name YourTable --limit 1

Step 3: Fix EC2 Instance Issues

EC2 Instances Not Launching

# Check available capacity in different AZs
aws ec2 describe-instance-type-offerings \
  --location-type availability-zone \
  --filters Name=instance-type,Values=t3.medium \
  --region us-east-1

# Try launching in different AZ
aws ec2 run-instances \
  --image-id ami-xxxxx \
  --instance-type t3.medium \
  --subnet-id subnet-xxxxx \
  --region us-east-1

# Use different instance type if capacity unavailable
# Try t3.small instead of t3.medium, or switch to m5/c5 family

Cannot Connect to EC2 Instance

# Verify instance is running
aws ec2 describe-instances --instance-ids i-xxxxx

# Check security group allows SSH (port 22) or RDP (port 3389)
aws ec2 describe-security-groups --group-ids sg-xxxxx

# Test network connectivity
ping <instance-public-ip>
telnet <instance-public-ip> 22

# Connect via Session Manager (doesn't require SSH)
aws ssm start-session --target i-xxxxx

# Check system logs for boot issues
aws ec2 get-console-output --instance-id i-xxxxx

EC2 Status Checks Failing

System Status Check Failed: AWS infrastructure issue — stop and start instance to migrate to new host
Instance Status Check Failed: OS or application issue — check console output, reboot instance
Auto Scaling not launching instances: Check capacity, verify launch template, check IAM service role
Spot instances terminated: Spot price exceeded bid or AWS reclaimed capacity — use On-Demand or diversify instance types

Step 4: Fix S3 Access Issues

S3 Access Denied Errors

# Test bucket access
aws s3 ls s3://your-bucket-name

# Check bucket policy and ACL
aws s3api get-bucket-policy --bucket your-bucket-name
aws s3api get-bucket-acl --bucket your-bucket-name

# Verify IAM permissions
aws iam get-user-policy --user-name your-username --policy-name your-policy

# Test with presigned URL (bypasses IAM)
aws s3 presign s3://your-bucket-name/file.txt --expires-in 3600

# Check Block Public Access settings (can override bucket policy)
aws s3api get-public-access-block --bucket your-bucket-name

S3 Slowdown or Timeout

503 SlowDown: Reduce request rate, implement exponential backoff
Use multipart upload: For files >100MB, use multipart upload API
Enable Transfer Acceleration: Faster uploads through CloudFront edge locations
Use S3 in same region: Cross-region S3 access is slower
Implement retry logic: S3 occasionally has transient errors — retry with backoff

S3 Website or Static Hosting Issues

# Verify bucket website configuration
aws s3api get-bucket-website --bucket your-bucket-name

# Check if index.html exists and is public
aws s3 ls s3://your-bucket-name/index.html
aws s3api get-object-acl --bucket your-bucket-name --key index.html

# Test website endpoint directly
curl http://your-bucket-name.s3-website-us-east-1.amazonaws.com

Step 5: Fix Lambda Function Failures

Lambda Timeouts

# Check function logs in CloudWatch
aws logs tail /aws/lambda/your-function-name --follow

# Increase timeout (max 15 minutes)
aws lambda update-function-configuration \
  --function-name your-function-name \
  --timeout 300

# Increase memory (more memory = more CPU)
aws lambda update-function-configuration \
  --function-name your-function-name \
  --memory-size 1024

# Check if function is throttled
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Throttles \
  --dimensions Name=FunctionName,Value=your-function-name \
  --start-time 2026-02-13T00:00:00Z \
  --end-time 2026-02-13T23:59:59Z \
  --period 3600 \
  --statistics Sum

Lambda Cold Start Issues

Provisioned Concurrency: Pre-warm functions to eliminate cold starts
Keep functions warm: Use CloudWatch Events to invoke every 5 minutes (costs apply)
Optimize package size: Smaller deployment packages start faster
Use Lambda layers: Share common dependencies across functions
Switch to container images: Better cold start performance for large dependencies

Lambda VPC Connectivity Issues

ENI creation delays: Lambda in VPC can take 10-30 seconds for first invocation
NAT Gateway required: Private subnet Lambda needs NAT Gateway for internet access
Security groups: Verify Lambda security group allows outbound traffic
VPC endpoints: Use VPC endpoints for S3, DynamoDB to avoid NAT Gateway costs
Consider removing VPC: If you don't need private resources, run Lambda outside VPC

Step 6: Fix RDS Connection Failures

Cannot Connect to RDS Database

# Check RDS instance status
aws rds describe-db-instances --db-instance-identifier your-db-instance

# Verify security group allows traffic from your IP/instance
aws ec2 describe-security-groups --group-ids sg-xxxxx

# Test connection from EC2 instance in same VPC
mysql -h your-db-instance.xxxxx.us-east-1.rds.amazonaws.com -u admin -p

# Check if RDS is in same VPC/subnet
aws rds describe-db-subnet-groups --db-subnet-group-name your-subnet-group

# Enable public accessibility (not recommended for production)
aws rds modify-db-instance \
  --db-instance-identifier your-db-instance \
  --publicly-accessible

RDS Performance Issues

Too many connections: Check max_connections parameter and connection pooling
High CPU/memory: Upgrade instance class or optimize queries
Storage full: Increase allocated storage or enable autoscaling storage
Read replica lag: Check replication lag in CloudWatch metrics
IOPS exhausted: Upgrade to Provisioned IOPS or gp3 storage
Parameter group issues: Review database parameter group settings

RDS Failover or Unavailability

Multi-AZ failover: Automatic failover takes 60-120 seconds — use endpoint, not IP
Backup restoration: Point-in-time recovery creates new instance, update connection strings
Maintenance window: AWS performs updates during maintenance window — schedule carefully
Snapshot restoration: Restoring from snapshot takes 10-30 minutes depending on size

Step 7: Fix CloudFront 5xx Errors

502 Bad Gateway

# Check origin health (EC2, ALB, S3)
curl -I https://your-origin.example.com

# Verify origin security group allows CloudFront IP ranges
# Download from: https://ip-ranges.amazonaws.com/ip-ranges.json

# Check CloudFront distribution settings
aws cloudfront get-distribution --id EXAMPLEID

# Review origin response timeout (default 30 seconds)
# Increase if origin is slow to respond

# Check custom headers required by origin
# Verify CloudFront sends expected headers

503 Service Unavailable

Origin unavailable: EC2/ALB/S3 is down or overloaded
Lambda@Edge timeout: Edge function took too long to execute
Distribution disabled: Check if distribution is deployed and enabled
Origin shield overloaded: Temporary shield issue — usually resolves quickly

504 Gateway Timeout

Origin slow: Origin took >60 seconds to respond (CloudFront maximum)
Optimize origin: Reduce origin response time or cache more aggressively
Check origin keep-alive: Ensure origin supports persistent connections
Lambda@Edge timeout: Edge function exceeded execution limit

Step 8: Fix Route 53 DNS Issues

Domain Not Resolving

# Test DNS resolution
nslookup yourdomain.com
dig yourdomain.com

# Check Route 53 hosted zone
aws route53 list-hosted-zones
aws route53 list-resource-record-sets --hosted-zone-id Z1234567890ABC

# Verify nameservers match registrar
aws route53 get-hosted-zone --id Z1234567890ABC
whois yourdomain.com

# Test from different DNS servers
dig @8.8.8.8 yourdomain.com
dig @1.1.1.1 yourdomain.com

# Check health checks if using failover routing
aws route53 get-health-check-status --health-check-id abc12345

DNS Propagation Delays

TTL not expired: DNS changes take TTL seconds to propagate (default 300s)
Registrar nameserver update: Nameserver changes take 24-48 hours globally
ISP DNS caching: Some ISPs ignore TTL and cache longer
Flush local DNS cache: Clear your computer's DNS cache to see changes immediately

Health Check Failures

Endpoint unreachable: Verify target is accessible from internet
String matching: If using string matching, ensure response contains expected string
Health check interval: Shorter intervals (10s vs 30s) detect failures faster
Health check regions: Route 53 checks from multiple regions — all must pass

Step 9: Fix API Gateway Errors

502 Bad Gateway

# Check Lambda integration
aws apigateway get-integration \
  --rest-api-id abc123 \
  --resource-id xyz789 \
  --http-method GET

# Review Lambda execution logs
aws logs tail /aws/lambda/your-function-name --follow

# Verify Lambda has correct permissions
aws lambda get-policy --function-name your-function-name

# Test Lambda directly (bypass API Gateway)
aws lambda invoke \
  --function-name your-function-name \
  --payload '{"test": "data"}' \
  response.json

429 Too Many Requests

Throttle limits exceeded: Default 10,000 requests/second per account per region
Request usage plan increase: Contact AWS support for higher limits
Implement caching: Enable API Gateway caching to reduce backend load
Use client-side throttling: Implement retry with exponential backoff

Integration Timeout

29-second limit: API Gateway has hard 29-second timeout
Optimize backend: Lambda/backend must respond within 29 seconds
Use async pattern: For long operations, return immediately and process asynchronously
Increase Lambda timeout: Ensure Lambda timeout is less than 29 seconds

Step 10: Network and VPC Troubleshooting

VPC Connectivity Issues

# Check route tables
aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-xxxxx"

# Verify Internet Gateway attached
aws ec2 describe-internet-gateways --filters "Name=attachment.vpc-id,Values=vpc-xxxxx"

# Check NAT Gateway status
aws ec2 describe-nat-gateways --filter "Name=vpc-id,Values=vpc-xxxxx"

# Test security group rules
aws ec2 describe-security-groups --group-ids sg-xxxxx

# Check network ACLs (often overlooked)
aws ec2 describe-network-acls --filters "Name=vpc-id,Values=vpc-xxxxx"

# Use VPC Flow Logs to debug traffic
aws ec2 describe-flow-logs --filter "Name=resource-id,Values=vpc-xxxxx"

Common VPC Misconfigurations

No route to internet: Missing Internet Gateway or NAT Gateway route
Wrong subnet type: Private subnet can't reach internet without NAT
Security group vs NACL: Security groups are stateful, NACLs are stateless
Overlapping CIDR blocks: VPC peering fails if CIDR blocks overlap
VPC endpoint missing: S3/DynamoDB access from private subnet needs VPC endpoint

Step 11: IAM Permission Errors

Access Denied Troubleshooting

# Decode authorization error message
aws sts decode-authorization-message --encoded-message <error-message>

# Check current user identity
aws sts get-caller-identity

# List attached policies
aws iam list-attached-user-policies --user-name your-username
aws iam list-user-policies --user-name your-username

# Simulate IAM policy
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:user/your-username \
  --action-names s3:GetObject \
  --resource-arns arn:aws:s3:::your-bucket-name/*

Common IAM Issues

Explicit deny wins: Deny statements override allow statements
Resource-based policies: S3 buckets, Lambda functions have their own policies
Service Control Policies: SCPs can block actions even if IAM policy allows
Session policies: Assumed roles can have additional restrictions
Permission boundaries: Limit maximum permissions of IAM entities
MFA required: Some actions require MFA authentication

Step 12: Check AWS Service Health Dashboard

If AWS is genuinely down:

Visit status.aws.amazon.com
Filter by your region (us-east-1, eu-west-1, etc.)
Check "Service Health" tab for current incidents
Review "Event Log" for historical outages
Subscribe to RSS feed for specific services and regions
Most AWS outages resolve within 1-4 hours
Set up alerts on API Status Check to know when it's back

Historical AWS Outages

Notable Incidents

December 2021 — us-east-1 Multi-Service Outage
The largest AWS outage in years affected us-east-1 for 7+ hours. EC2, Lambda, RDS, and DynamoDB all degraded simultaneously. The incident took down Netflix, Disney+, Robinhood, Coinbase, and thousands of other services. Root cause was internal network device failure that cascaded across availability zones. AWS's status page initially showed "green" for 30+ minutes while services were clearly down.

November 2020 — Kinesis Data Streams Failure
Kinesis outage in us-east-1 lasted 8+ hours and cascaded to dependent services including CloudWatch, Lambda, and Cognito. The incident prevented new service deployments across AWS because internal tools relied on Kinesis. Companies couldn't even diagnose the issue because CloudWatch logging was broken. This outage exposed AWS's own internal dependencies on shared infrastructure.

September 2020 — Cognito Authentication Outage
AWS Cognito failed globally for 4+ hours, preventing users from logging into thousands of applications. Mobile apps, SaaS platforms, and consumer applications all lost authentication capabilities. The incident occurred during US business hours, maximizing impact. Companies with custom authentication systems were unaffected, highlighting the risk of fully managed services.

July 2019 — us-east-1 EC2 Overheating
Physical infrastructure overheating in a single us-east-1 data center caused EC2 instance failures for 6+ hours. Instances randomly terminated without warning. Elastic Load Balancers failed to route traffic correctly. Auto Scaling groups couldn't launch replacement instances due to capacity constraints. The incident demonstrated that even AWS's physical infrastructure can fail catastrophically.

February 2017 — S3 Total Outage in us-east-1
S3 went completely offline in us-east-1 for 4+ hours due to a typo in a maintenance command. The outage took down websites, APIs, and services across the internet. AWS's own status page couldn't update because it used S3 for storage. Companies learned that "S3 is always available" was not true, and multi-region replication became standard practice.

September 2015 — DynamoDB Outage
DynamoDB experienced a 5+ hour outage in us-east-1 affecting high-traffic applications and gaming platforms. Read and write operations failed across multiple availability zones. AWS's incident response was slow, with initial status updates taking 90+ minutes. The outage demonstrated that even NoSQL "infinitely scalable" databases can fail completely.

October 2012 — Hurricane Sandy Impact
While not a service failure, Hurricane Sandy caused widespread internet connectivity issues affecting AWS us-east-1. Some customers lost connectivity to their instances despite AWS infrastructure remaining operational. The incident highlighted the importance of multi-region architectures for disaster recovery.

April 2011 — EBS Outage (Original us-east-1 Disaster)
The infamous EBS outage that lasted multiple days and destroyed customer data. A network misconfiguration triggered a cascading failure in EBS replication. Many startups lost their entire databases permanently. AWS learned from this incident and improved EBS reliability significantly, but it remains a cautionary tale about cloud infrastructure risks.

Outage Patterns

us-east-1 most vulnerable: AWS's largest and oldest region fails more frequently
Cascading failures: One service failure (Kinesis) often breaks dependent services (CloudWatch, Lambda)
Internal dependencies: AWS's own tools rely on AWS services, creating circular dependencies during outages
Status page delays: AWS status page often shows "green" for 15-60 minutes after actual outage begins
Communication gaps: Initial incident communication is often vague and delayed
Multi-AZ not enough: Even multi-AZ deployments can fail during region-wide outages
Shared infrastructure: Managed services (RDS, Lambda) share underlying infrastructure that can fail simultaneously
Peak hours: Outages during US business hours (9 AM - 5 PM EST) have maximum impact

What to Use When AWS Is Down

Need	Alternative
Compute (EC2)	Google Cloud Compute Engine, Azure VMs, DigitalOcean Droplets
Object Storage (S3)	Google Cloud Storage, Azure Blob Storage, Backblaze B2, Cloudflare R2
Serverless (Lambda)	Google Cloud Functions, Azure Functions, Cloudflare Workers
Database (RDS)	Google Cloud SQL, Azure Database, self-hosted on other clouds
CDN (CloudFront)	Cloudflare, Fastly, Akamai, Google Cloud CDN
DNS (Route 53)	Cloudflare DNS, Google Cloud DNS, NS1, Dyn
Container orchestration	Google GKE, Azure AKS, self-hosted Kubernetes
Message queues (SQS)	Google Pub/Sub, Azure Service Bus, RabbitMQ, Redis
NoSQL (DynamoDB)	Google Firestore, Azure Cosmos DB, MongoDB Atlas
Static site hosting	Netlify, Vercel, Cloudflare Pages, GitHub Pages

For DevOps/SRE Teams During AWS Outages

If you're running production infrastructure on AWS:

Multi-region architecture: Deploy critical services across 2+ AWS regions
Multi-cloud strategy: Use Google Cloud or Azure as failover for critical workloads
Health checks: Implement deep health checks that test actual functionality, not just HTTP 200
Circuit breakers: Automatically fail over to backup systems when AWS degrades
Status monitoring: Integrate API Status Check into incident response workflows
Runbooks: Document manual failover procedures for AWS outages
Cost awareness: Understand that multi-region costs 2-3x more but prevents revenue loss
Terraform/IaC: Infrastructure as code enables rapid deployment to alternative regions/clouds
Data replication: Continuously replicate critical data across regions (S3 cross-region, RDS read replicas)
Communication plan: Have status page ready to inform customers of AWS-related issues

For Development Teams During Outages

Check region health: Try deploying to us-west-2 or eu-west-1 if us-east-1 is down
Local development: Use LocalStack, AWS SAM local, or Docker Compose to work offline
Test environments: Maintain test environment in different region from production
Cache AWS CLI calls: Don't repeatedly query AWS APIs during outages (rate limiting)
Document the outage: Log timestamps, error messages, and impact for post-incident reviews
Alternative AWS accounts: Some teams maintain backup AWS accounts in different regions

For Businesses Relying on AWS

Financial impact assessment: Calculate cost per hour of AWS downtime (e-commerce, SaaS revenue)
SLA awareness: AWS doesn't guarantee 100% uptime — understand your actual SLA
Business continuity plan: Document procedures for operating during AWS outages
Customer communication: Prepare status page updates explaining AWS-related incidents
Insurance: Consider cyber insurance that covers cloud provider outages
Vendor diversification: Don't put all infrastructure eggs in one cloud provider basket
Monitoring investment: Track AWS uptime with API Status Check
Legal contracts: Understand AWS's limited liability for service disruptions

How API Status Check Helps Monitor AWS

API Status Check provides comprehensive AWS monitoring:

Real-time uptime tracking: Monitor EC2, S3, Lambda, RDS, CloudFront, Route 53, and 20+ AWS services simultaneously
Instant alerts: Email, SMS, Slack, Discord, webhook notifications within seconds of AWS degradation
Region-specific monitoring: Track us-east-1, eu-west-1, ap-southeast-1, and all AWS regions independently
Historical uptime data: See AWS outage trends, frequency, and duration over months/years
Service-level granularity: Monitor specific services (Lambda in us-east-1) vs. entire regions
API response time: Measure AWS API latency and detect performance degradation early
Multi-region dashboards: Compare uptime across regions to identify geographic issues
Integration support: Connect with PagerDuty, OpsGenie, Datadog, and incident management tools
Compliance reporting: Generate uptime reports for SLA compliance and audits
Cost correlation: Understand how AWS outages impact your business revenue and operations

Why This Matters for AWS Users

For SRE/DevOps Teams:

Get alerted before customers complain about downtime
Correlate application failures with AWS service health
Make informed decisions during incidents (failover vs. wait)
Track AWS reliability for vendor accountability
Prove to stakeholders that outages were AWS-caused, not team-caused

For E-commerce/SaaS:

Minimize revenue loss by detecting AWS issues instantly
Communicate proactively with customers during AWS outages
Understand AWS uptime trends for capacity planning
Justify multi-region investments with historical outage data

For Engineering Managers:

Track AWS as vendor risk in compliance frameworks
Measure AWS impact on team productivity and on-call burden
Make data-driven decisions about multi-cloud strategy
Include AWS status in incident post-mortems

For CTOs/Decision Makers:

Understand true cost of AWS downtime (revenue + reputation)
Evaluate AWS vs. Google Cloud vs. Azure based on real uptime data
Justify infrastructure investments (multi-region, multi-cloud)
Present AWS reliability metrics to board and investors

Try it free — monitor AWS and 100+ other critical services with instant alerts.

Frequently Asked Questions

Is AWS down right now?

Check apistatuscheck.com/api/aws for real-time AWS status across all major services and regions, or visit AWS's official status page at status.aws.amazon.com. Also search Twitter/X for "AWS down" or your specific region like "us-east-1 down". If multiple sources confirm issues, AWS is likely experiencing an outage. Remember that AWS often has region-specific outages — us-east-1 may be down while eu-west-1 works perfectly.

Why can't I connect to my EC2 instance?

First, check if EC2 is experiencing an outage in your region at status.aws.amazon.com. If it's just you, common causes include: security group not allowing SSH (port 22) or RDP (port 3389) from your IP address, instance stopped or terminated, incorrect SSH key, VPN/firewall blocking access, or instance status checks failing. Try connecting via AWS Systems Manager Session Manager which doesn't require SSH port access.

Why is my Lambda function timing out?

Lambda timeouts are usually caused by: function exceeding configured timeout limit (max 15 minutes), cold starts in VPC (can add 10-30 seconds), inefficient code or database queries, waiting for external API that's slow/down, or insufficient memory allocation (Lambda CPU scales with memory). Check CloudWatch Logs for execution duration. During AWS outages, Lambda may experience degraded performance or "Service Unavailable" errors — check AWS status page for Lambda service health.

How long do AWS outages usually last?

Minor AWS outages typically resolve within 1-2 hours. Major regional outages (like us-east-1 incidents) can last 4-8 hours. The worst historical outages have lasted 24+ hours for complete service restoration. Service-specific outages (just Lambda or just S3) tend to resolve faster than multi-service cascading failures. Set up alerts at apistatuscheck.com to know immediately when services recover.

Why is S3 giving me Access Denied errors?

S3 Access Denied errors are usually caused by: missing IAM permissions for s3:GetObject or s3:PutObject, bucket policy blocking your request, S3 Block Public Access settings preventing access, incorrect bucket name or region, requester pays bucket requiring request payer header, or VPC endpoint policy restrictions. Use aws s3api get-bucket-policy and aws iam simulate-principal-policy to debug. During S3 outages, you may see 503 errors instead of Access Denied.

Why is CloudFront returning 502 or 503 errors?

CloudFront 502/503 errors typically indicate: origin (EC2/ALB/S3) is down or unreachable, origin security group not allowing CloudFront IP ranges, origin taking >60 seconds to respond (timeout), custom SSL certificate invalid or expired, or Lambda@Edge function failing. Check your origin health first with direct curl request. During AWS outages affecting EC2 or S3, CloudFront will return 5xx errors because the origin is unavailable. Use API Status Check to confirm if it's an AWS-wide issue.

Why isn't my Route 53 domain resolving?

DNS resolution failures are usually caused by: nameservers at registrar don't match Route 53 hosted zone, DNS records incorrectly configured (wrong IP, missing A record), TTL hasn't expired after recent changes (wait TTL seconds), Route 53 health checks failing for failover routing, or registrar account locked/suspended. Use dig yourdomain.com and whois yourdomain.com to diagnose. During Route 53 outages, DNS queries may return SERVFAIL or timeout completely.

What should I do during an AWS outage?

Check status.aws.amazon.com to confirm the outage and affected services/regions. If you have multi-region setup, failover to healthy region. Communicate with customers via status page about AWS-related issues. Avoid making infrastructure changes during outages (deployments, scaling) as they may fail unpredictably. Monitor API Status Check or AWS status page for recovery updates. Document the incident for post-mortem review. For critical services, consider having manual procedures ready (static maintenance page, direct database access, etc.).

How can I prevent AWS outages from affecting my business?

Deploy critical services across multiple AWS regions (multi-region architecture). Use multiple availability zones within each region (multi-AZ). Consider multi-cloud strategy with Google Cloud or Azure as fallback. Implement health checks and automatic failover with Route 53 or external DNS. Use CloudFront CDN to cache content and reduce origin dependency. Maintain database replicas in different regions. Have runbooks for manual failover procedures. Monitor AWS health with API Status Check for early warning. Remember: multi-region costs 2-3x more but prevents 100% revenue loss during regional outages.

Why does AWS status page show green when my services are down?

AWS's status page often has 15-60 minute delays before showing incidents. The status page itself has failed during outages (S3 outage in 2017). AWS defines "degraded" differently than complete failure — your use case may be broken while AWS shows yellow, not red. Some AWS internal services affect customers but aren't shown on status page. Use third-party monitoring like API Status Check, DownDetector, and Twitter/X to get real-time community reports. Trust your own monitoring over AWS status page during suspected outages.

Never miss an AWS outage again. Get free instant alerts when AWS or any critical service goes down. Essential for SRE teams, e-commerce platforms, SaaS applications, and any business running production workloads on AWS.