DEV Community

Cover image for Who Restarted Prod? How to Find It in CloudTrail
Matt
Matt

Posted on • Originally published at fortem.dev

Who Restarted Prod? How to Find It in CloudTrail

Who Restarted Prod? ECS Audit in CloudTrail

Originally published at https://fortem.dev/blog/ecs-audit-log-compliance
Every ECS change — UpdateService, StopTask, RunTask — lands in CloudTrail with who, when, and from where. Three CLI commands find the culprit in under 2 minutes.


Use Case · June 16, 2026 · 8 min read

ecs-audit-logecs-compliance-loggingaws-ecs-cloudtrail

How to Find It in CloudTrail

Your ECS service restarted. Or a task was manually stopped. Or desiredCount dropped to zero and nobody admits it. The ECS console shows WHAT happened — not WHO. CloudTrail has the answer, and three CLI commands get you there in under two minutes.

TL;DR

  • 01CloudTrail captures every ECS API call — UpdateService, StopTask, RunTask, RegisterTaskDefinition — with who, when, and from where.
  • 02Event History is free for the last 90 days. Three CLI commands find the culprit in under 2 minutes.
  • 03The userIdentity field tells you human vs CI/CD vs AWS service. Root account activity in ECS is always suspicious.
  • 04Download the skill file — an AI agent runs the full fleet audit and produces a structured report automatically.

Why the ECS events tab doesn't tell you who did it

ECS events show WHAT happened — "service updated", "task stopped" — but not WHO. The userIdentity lives in CloudTrail, not in the ECS console. That's the gap most teams waste an hour trying to bridge.

You open the ECS service page. Under Events: "service my-api has started 1 tasks" at 14:23, "service my-api has stopped 1 running tasks" at 14:21. Something stopped your service and triggered a redeploy. The ECS console stops there — it doesn't record the API caller, the IAM identity, or whether it was a human clicking the console or Terraform applying a change.

ECS Events tabCloudTrail

Shows WHAT happenedShows WHO did it, WHEN, and FROM WHERE

Service-level messages onlyAll API calls including StopTask, UpdateService, RunTask

No API caller infouserIdentity: human, CI/CD role, or AWS service

Kept for a few hours90-day Event History, free

Not queryableSearchable by event name, username, resource, IP

KEY INSIGHT: Key insight CloudTrail records every ECS API call automatically — no setup required. The 90-day Event History is free. You're not paying for it already; it's just there. The only thing missing is knowing where to look.

Three commands to find the culprit in under 2 minutes

aws cloudtrail lookup-events with AttributeKey=EventName filters to specific actions. Pipe through jq to extract userIdentity.userName, eventTime, and sourceIPAddress. Covers the last 90 days at no charge.

Find who stopped a task

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=StopTask \
  --query 'Events[*].CloudTrailEvent' \
  --output text | \
jq -r '. | {
  time: .eventTime,
  who: (
    if .userIdentity.type == "IAMUser" then .userIdentity.userName
    elif .userIdentity.type == "AssumedRole" then .userIdentity.sessionContext.sessionIssuer.userName
    else .userIdentity.type
    end
  ),
  from: .sourceIPAddress,
  via: .userAgent,
  task: .requestParameters.task
}'
Enter fullscreen mode Exit fullscreen mode

Find who updated a service (deployments, scale changes)

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=UpdateService \
  --query 'Events[*].CloudTrailEvent' \
  --output text | \
jq -r '. | {
  time: .eventTime,
  who: (
    if .userIdentity.type == "IAMUser" then .userIdentity.userName
    elif .userIdentity.type == "AssumedRole" then .userIdentity.sessionContext.sessionIssuer.userName
    else .userIdentity.type
    end
  ),
  via: .userAgent,
  service: .requestParameters.service,
  desiredCount: .requestParameters.desiredCount
}'
Enter fullscreen mode Exit fullscreen mode

Narrow by specific user or role

# Find everything a specific IAM user did in the last 24h
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=Username,AttributeValue=john.smith \
  --start-time $(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ 2>/dev/null || date -u -v-24H +%Y-%m-%dT%H:%M:%SZ) \
  --query 'Events[*].{Time:EventTime,Event:EventName}' \
  --output table
Enter fullscreen mode Exit fullscreen mode

Rate limit: lookup-events is capped at 2 requests/second per account per region. If you're scripting across many event types, add a 0.5s sleep between calls or use --next-token for pagination. Max 50 events per request; paginate if you need more.

Which ECS events map to which actions

UpdateService = scale change or deployment. StopTask = manual kill. RegisterTaskDefinition = new image or config. RunTask = standalone task launch. Each has a different userIdentity pattern worth knowing.

ScenarioCloudTrail eventNameWho typically calls it

Service scaled up/downUpdateServiceHuman, CI/CD, autoscaler

Deployment triggeredUpdateService + RunTaskCI/CD pipeline

Task manually stoppedStopTaskHuman, script, ECS agent

New task definitionRegisterTaskDefinitionCI/CD pipeline, human

Service created/deletedCreateService / DeleteServiceHuman, Terraform

Cluster deletedDeleteClusterHuman, Terraform

The most ambiguous one is StopTask. It appears in CloudTrail when a human manually stops a task, when a script does it, and when ECS itself stops a task during a rolling deployment. Check userIdentity.invokedBy — if it says ecs.amazonaws.com, ECS triggered the stop internally during service orchestration, not a human.

Decoding userIdentity: human, CI/CD, or AWS service

userIdentity.type tells you who called the API: IAMUser = human, AssumedRole = CI/CD or Lambda, AWSService = autoscaler or ECS itself. Root type should never appear in ECS — alert immediately if it does.

userIdentity.typeMeaningHow to extract the name

IAMUserHuman with IAM credentials.userIdentity.userName

AssumedRoleCI/CD, Lambda, or human via role.userIdentity.sessionContext.sessionIssuer.userName

RootAWS root account — alert immediatelytype = Root is the signal

AWSServiceAWS-owned service (autoscaling, ECS agent).userIdentity.invokedBy

AWSAccountCross-account call from another AWS account.userIdentity.accountId

FederatedUserSSO / identity provider user.userIdentity.principalId

The tricky one is AssumedRole. When a GitHub Actions pipeline runs aws ecs update-service, the CloudTrail event shows type: AssumedRole and the ARN of the role. The human-readable role name is in sessionContext.sessionIssuer.userName. That's the field to surface in your audit report — not the full ARN.

To distinguish console vs CLI vs Terraform, use the userAgent field:

userAgent valueWhat called the API

console.amazonaws.comAWS console (someone clicked)

aws-cli/2.*AWS CLI (manual or script)

Terraform/1.* terraform-provider-aws/*Terraform apply

github-actions/*GitHub Actions CI/CD

ECS ConsoleECS service console actions

KEY INSIGHT: Key insight If userIdentity.type is Root, stop everything else and investigate. Root credentials should never be used for routine ECS operations. A Root call in CloudTrail means either someone is using the root account directly (a security failure) or credentials were compromised.

Alerting in real time: EventBridge rule for critical ECS changes

EventBridge can trigger a notification within seconds of a StopTask or UpdateService call — before you notice the incident. One Terraform resource sets up the rule with no additional infrastructure.

Searching CloudTrail after an incident is reactive. EventBridge makes it proactive: you define a rule that matches specific CloudTrail events, and EventBridge triggers an SNS notification, Lambda, or Slack webhook immediately when the event occurs. For teams running 10+ ECS environments, catching a DeleteService before the on-call rotation starts saves significant incident response time.

Terraform: EventBridge rule for critical ECS events

resource "aws_cloudwatch_event_rule" "ecs_critical" {
  name        = "ecs-critical-changes"
  description = "Alert on destructive or suspicious ECS API calls"

  event_pattern = jsonencode({
    source      = ["aws.ecs"]
    detail-type = ["AWS API Call via CloudTrail"]
    detail = {
      eventSource = ["ecs.amazonaws.com"]
      eventName   = [
        "StopTask",
        "DeleteService",
        "DeleteCluster",
        "UpdateService"
      ]
    }
  })
}

resource "aws_cloudwatch_event_target" "ecs_critical_sns" {
  rule      = aws_cloudwatch_event_rule.ecs_critical.name
  target_id = "SendToSNS"
  arn       = aws_sns_topic.alerts.arn

  input_transformer {
    input_paths = {
      event   = "$.detail.eventName"
      who     = "$.detail.userIdentity.sessionContext.sessionIssuer.userName"
      time    = "$.time"
      service = "$.detail.requestParameters.service"
    }
    input_template = ""ECS alert: <event> on <service> by <who> at <time>""
  }
}
Enter fullscreen mode Exit fullscreen mode

For UpdateService, add a second rule specifically for scale-to-zero: filter where requestParameters.desiredCount = 0. That's the most common accidental incident — someone running a cleanup script that hits the wrong environment.

The Oct 2025 addition: ECS CloudTrail data events

Since October 2025, ECS supports CloudTrail data events for ContainerInstance agent API activity (ecs:Poll, ecs:StartTelemetrySession). These aren't in Event History — they require a CloudTrail trail or CloudTrail Lake.

AWS management events (UpdateService, StopTask, etc.) are what most teams need for incident response. The October 2025 addition is different: ECS now supports CloudTrail data events for ContainerInstance agent API calls — the low-level polling activity between the ECS agent and the control plane.

Management eventsData events (Oct 2025)

What they captureUpdateService, StopTask, RunTask, etc.ecs:Poll, ecs:StartTelemetrySession, ecs:PutSystemLogEvents

CostFree (Event History)Additional CloudTrail charges

In Event History?Yes — 90 daysNo — trail or Lake required

Who needs themEveryone — incident responseEC2 launch type, compliance auditing

Resource type—AWS::ECS::ContainerInstance

For most ECS Fargate teams, data events aren't needed for incident response — management events cover UpdateService and StopTask which is where incidents come from. Data events matter if you run EC2 launch type and need to audit ContainerInstance registration activity, or if compliance requires a full record of agent-to-control-plane communication. Enable them only if you have a specific requirement — at scale, ContainerInstance polling events generate significant volume and cost. Details in the ECS CloudTrail logging docs.

Download the skill file — let the AI agent do the audit

The skill file instructs an AI agent to pull all critical ECS CloudTrail events from the last 24 hours across every cluster in your account and produce a structured "who did what" report. Read-only — no changes applied.

ECS CloudTrail Audit Agent scans all clusters, pulls critical ECS events (Update

The agent lists all clusters, runs lookup-eventsfor each critical event type, decodes the userIdentity, and produces a structured output: "Service X was updated at HH:MM by role deploy-prod via GitHub Actions from IP 140.82.114.3." It also flags Root account activity, unexpected source IPs, and scale-to-zero incidents. For teams where "who did this?" is a recurring post-incident question, this is the 2-minute version of the 20-minute manual process.

"To identify the user who initiates a StopTask API call, view StopTask in AWS CloudTrail for userIdentity information."

AWS Knowledge Center: Troubleshoot running task count changes in ECS


Book a 20-min fleet walkthrough: fortem.dev/book

Top comments (0)