You wake up to this email from AWS:
Irregular Activity Detected for Your AWS Access Key
As part of our standard monitoring of AWS systems, we observed anomalous activity in your AWS account that indicated your AWS access key(s), along with the corresponding secret key, may have been inappropriately accessed by a third party.
Your stomach drops. The email links to a compromised access key: AKIA1234567890ABCDEF. User: app-integration-user. Event: GetCallerIdentity. Time: yesterday at 12:11:58 UTC. IP: 198.51.100.50.
AWS gives you four steps:
- Rotate the key.
- Check CloudTrail for unwanted activity.
- Review account for unexpected usage.
- Respond to the support case.
Four steps. Clean. Linear. Assumes everything goes right.
It won't.
What AWS Documentation Assumes
AWS's steps assume:
- CloudTrail is already enabled and logs are queryable.
- Someone on your team knows how to read CloudTrail.
- You have time to investigate without pressure.
- The only damage is the exposed key.
- Rotating the key is enough to fix it.
In reality:
- CloudTrail might not be enabled. Or enabled but logs are in an S3 bucket nobody checks.
- The person who set up the account left months ago.
- You have 4 hours before customers start calling about errors.
- The attacker might have created backdoor credentials, roles, or policies while they were in.
- Rotating the key stops them from using that key. But if they left a trail of IAM users, keys, or assumed roles behind, you're still exposed.
What Actually Happened
You look at the details. The compromised key belongs to app-integration-user. A user who was supposed to only send emails via SES. Instead, someone called GetCallerIdentity from IP 198.51.100.50 at 12:11 UTC.
(If the compromised key is your root account's access key: this is a P1 incident. Root cannot be restricted by IAM policies. Rotate immediately, audit all root activity in the last 30+ days, and contact AWS Security right now.)
That one call tells you:
- The key was exfiltrated (not guessed in a bruteforce).
- The attacker tested it immediately to confirm it works.
- They got basic information about your account and role.
- The next calls happened after that test.
Now you need to answer: What did they do next?
This is where the 4-step plan breaks down. AWS doesn't tell you how to find that out if your logs aren't ready.
The Three Things That Actually Save You
1. Access to CloudTrail, Even If It's Basic
If CloudTrail is off or inaccessible, you're blind. You can't answer the question: What happened after that GetCallerIdentity call?
If CloudTrail is on:
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=AccessKeyId,AttributeValue=AKIA1234567890ABCDEF \
--start-time 2026-05-21T12:00:00Z \
--end-time 2026-05-21T14:00:00Z \
--region us-east-1
You'll see every API call made with that key. Not glamorous. Not a dashboard. But it works. And it shows you the sequence: GetCallerIdentity → what came next.
From a typical reconnaissance scenario, that query might show:
GetCallerIdentity (12:11:58)
ListUsers (12:12:05)
ListAccessKeys (12:12:12)
ListRoles (12:12:19)
ListPolicies (12:12:25)
GetUser (12:12:33, targeting 'admin-user')
The attacker was doing reconnaissance. They're mapping your account structure. That tells you what they might do next: assume the admin role, create a backdoor key, or escalate.
Without CloudTrail, you're guessing. With CloudTrail, even basic, you have facts.
2. A Playbook
The four AWS steps are necessary but insufficient. A playbook is what you execute while following those steps, what you execute before the key is fully rotated, and what you execute after you think it's over.
A minimal playbook for a compromised key looks like this:
Immediate (first 30 minutes):
- Do NOT delete the exposed key yet. Mark it as inactive. You need it in CloudTrail for the investigation.
- Query CloudTrail for all events from that key in the last 30 days (not just the past hour).
- Check if that key was used to assume any roles or create temporary credentials. If yes, those STS tokens are in the wild and valid until expiration. Monitor those roles' activity separately.
- In parallel: create a new key for the application using that user, update your code/deployment.
Investigation (first 2 hours):
- Look for any new IAM users, roles, or policies created in the same timeframe.
- Check for any API calls to sensitive services: RDS, Secrets Manager, KMS, S3 policy changes.
- Check CloudTrail for any actions after the GetCallerIdentity test that were anomalous (deletes, policy changes, cross-account AssumeRole).
- Verify the alternate contact (Billing, Operations, Security) was not modified. An attacker could reroute support tickets or recovery emails.
Containment (2-4 hours):
- Once the new key is confirmed working in production, mark the exposed key as inactive in the console.
- If the attacker created backdoor credentials or roles, delete them.
- If they touched resources, take snapshots or note the state for forensics.
Post-incident (next business day):
- Review all other IAM users in the account. Are there other keys that should be rotated? Other users with overprivileged access?
- Check S3 bucket policies, security group rules, and VPC peering for unexpected changes.
- Enable MFA on all human IAM users and the root account.
A playbook turns panic into a sequence. It answers the question "what do I do first?" before you need the answer.
3. Rotate the Key Without Breaking Your Application
Here's the trap: the application using app-integration-user is running in production right now. It's sending emails, and it's using that exposed key.
If you delete the key immediately, the application fails. Customers' emails don't send. You get paged. You panic. You revert.
If you rotate the key slowly, the application keeps working while the attacker still has access.
The solution is simple: rotate before you block.
- Create a new access key for
app-integration-userright now. - Update your application to use the new key (redeploy or restart).
- Test that the application works with the new key.
- Only then mark the exposed key as inactive in the console.
This takes 10 to 15 minutes if you have automation. If you don't, it takes longer. But it works.
The attacker can't use the old key once it's inactive. Your application never stops. You avoid the panic of choosing between security and uptime.
If the key is embedded in a third-party tool or service, contact the vendor right away. Tell them the key is compromised. Ask them to help you rotate it. In parallel, mark the key as inactive in AWS. Once the vendor confirms the new key is working, you're done.
How to Build This Capacity Before You Need It
You don't build incident response by responding to incidents. You build it by preparing for them.
Start with CloudTrail
aws cloudtrail create-trail \
--name my-organization-trail \
--s3-bucket-name my-cloudtrail-logs-bucket \
--region us-east-1
aws cloudtrail start-logging \
--name my-organization-trail \
--region us-east-1
CloudTrail has free tier: 1 trail, 90 days of event history in the console. That's not enough for forensics. Older events are archived to S3, which is where you'll do real investigation.
For long-term retention, query CloudTrail logs in S3 using Athena:
SELECT
useridentity.arn,
eventname,
sourceipaddress,
eventtime
FROM cloudtrail_logs
WHERE eventtime > '2026-05-21'
ORDER BY eventtime DESC
LIMIT 100;
Run this query in the Athena console after you've configured Athena tables on your CloudTrail S3 bucket (AWS CloudTrail documentation has the setup steps).
If you're in an AWS Organization, one trail in the management account logs all accounts. Do that instead of one trail per account.
Write Your Playbook Now
Don't write it during an incident. Write it next Tuesday.
Use the structure above or a template. Share it with your team. Update it when you learn something new. Version control it (GitHub, not a Google Doc).
Test Key Rotation Without Pressure
Pick a test application (or a test user with a limited policy) and rotate its key. How long does it take? Where do you get stuck? Fix those problems now.
If you have 15 applications using access keys, and rotation takes 30 minutes per app, an incident will take you 7.5 hours under pressure. That's a problem.
If you can rotate a key in 5 minutes because you automated it or have a runbook, an incident is 1.25 hours. Still not fun. But survivable.
The Real Problem Isn't Technical
The email from AWS assumes you're ready. The four steps assume you have the foundation.
Most teams don't.
Not because they're careless. Because incident response looks like overhead until the incident happens.
Then it becomes the only thing that matters.
The three things that save you — CloudTrail access, a playbook, and the ability to rotate a key in 15 minutes — aren't expensive. They don't require a SIEM or a SOC or a fancy tool.
They require 4 hours of work before something breaks.
That's the gap AWS doesn't mention.
Top comments (0)