Matt

Posted on Jun 4 • Edited on Jun 30 • Originally published at fortem.dev

It's Friday at 6pm. Your Developer Can't Restart Staging Without You.

#aws #ecs #fargate #selfservice

Your Developer Can't Restart Staging Without You: The Fix

Originally published at https://fortem.dev/blog/ecs-staging-self-service
Platform engineers are the single point of failure for staging ops when developers can't safely act. Here's how ECS environment RBAC fixes it.

TL;DR

Platform engineers field 3–8 Slack interruptions per week because IAM has no environment-scoped ECS permissions — so they become the gatekeeper by default.
Five operations cause 80% of interruptions: restart a service, redeploy, view logs, flush Redis, run a one-off task. None require infrastructure knowledge.
Giving developers raw AWS Console access is the wrong fix — IAM scopes by ARN, not by environment, so a staging restart policy also covers production.
Fortem adds per-environment RBAC: developers see only their assigned environments and can act without AWS access. Setup takes 15 minutes. Production is off by default.

The 6pm Slack message

Without per-environment RBAC, every ECS staging restart requires a platform engineer — turning a 30-second fix into a 2–14 hour wait and an off-hours Slack message.

It's Friday, 6:47pm. You're at dinner. Your phone buzzes.

Jamie6:47 PM

hey — staging is down, orders-api won't start. i have a smoke test to finish before the monday deploy. can you take a look?

You open the AWS Console on your phone. Fargate console on mobile is a special kind of awful — tiny text, nested dropdowns, a task definition ARN you have to scroll sideways to read. You find the service, stop the broken task, wait for it to restart. The new task fails to start too. You check the CloudWatch logs. Missing environment variable. You update the task definition, force a new deployment. Fifteen minutes. By the time the service is healthy, dinner is cold and you've lost the conversation.

Monday, Jamie finishes the smoke test in 20 minutes and the deploy goes out fine.

Jamie didn't need you to debug a config issue. Jamie needed to restart a service. The entire incident was a permission problem — and it happens on most teams with 10+ environments, at least twice a week.

Why this keeps happening

IAM has no environment-scoped ECS permissions — ecs:UpdateService grants account-wide access — so platform engineers become the default gatekeeper rather than risking broad console access.

“Platform engineers become the single point of failure for staging ops when developers have no safe, scoped way to act.”

— Observed pattern across ECS teams at scale

Developers don't have scoped AWS access because the alternative — broad IAM — is dangerous. But 'no access' creates a single point of failure: the platform engineer. The middle ground — scoped, per-environment RBAC — is the solution nobody bothers to set up.

AWS IAM doesn't have environment-scoped permissions for ECS. You can grant someone ecs:UpdateService — but that's access to every ECS service in the account, including production. You can try to scope it by resource ARN, but when your environments have 15 services each, maintaining those policies manually becomes its own full-time job. The challenges of managing ECS multi-environment strategies compound quickly once a team grows past three or four envs.

So most platform engineers made the only rational decision available to them: they kept the keys and became the gatekeeper. Developers file a Slack request, platform engineer handles it, developers wait.

The platform engineer didn't choose to be a deployment gatekeeper. They became one because the alternative — handing over AWS Console access — was genuinely risky. The right answer is a permission layer that doesn't exist in native AWS.

The cost is invisible because it's spread across the week in small increments. A Slack ping here, a 15-minute console task there. But count the interruptions in a month: 3–8 per week for a mid-sized team. Each one breaks a flow state that takes 20 minutes to rebuild. Each Friday or weekend message is unpaid on-call work for a non-incident.

The 5 ops that cause 80% of interruptions

Five actions — restart a service, redeploy the latest image, read logs, flush Redis, and run a one-off task — account for 80% of platform-engineer interruptions on ECS staging teams.

Most platform engineers, when they audit their staging-ops interruptions, find the same five actions accounting for nearly all of them:

1.

Restart a crashed or stuck service. A task died, maybe due to a failed health check or OOM. The developer knows it — they can't restart it.
2.

Redeploy the latest image. A new build was pushed to ECR. The developer wants to pick it up in staging without waiting for the next CI run to trigger a deployment.
3.

Read logs. The service is behaving strangely. The developer needs to tail CloudWatch — not navigate five levels of AWS console to get there.
4.

Flush a Redis cache. Bad data got written. A key needs to be cleared so the service reads fresh state. One operation, one line of code if they had access.
5.

Run a one-off task. A database migration, a data backfill, a cleanup script. Not a deployment — a single-run task against staging data.

None of these require infrastructure knowledge. None of them should require a platform engineer.

Why raw AWS Console access is the wrong answer

IAM cannot scope ecs:UpdateService by environment — a policy allowing staging restarts also covers production — so granting console access trades one risk for a bigger one.

The obvious first instinct: give them limited AWS access. Create a developer IAM role with read and restart permissions.

In practice, this goes wrong in predictable ways:

IAM doesn't scope ECS permissions by environment — it scopes by account, region, and ARN. A policy that allows restarting staging services also allows restarting production services in the same account.
ARN-scoped policies break every time a service is renamed, a new environment is added, or an account is restructured. Someone has to maintain them.
AWS Console access gives visibility into things developers shouldn't see: secret ARNs, network config, IAM role names. Not a security catastrophe, but not ideal.
There's no audit trail per action. CloudTrail tells you which IAM user ran which API call — but not why, from what context, or what the environment state was before and after.

The right answer isn't broader AWS access — it's a permission layer that understands environments. Teams following ECS Fargate best practices at scale consistently land on environment-scoped RBAC rather than direct console access for exactly these reasons.

How Fortem solves it

Fortem adds SSO-based, per-environment RBAC to ECS Fargate: developers see only assigned environments and can restart, redeploy, view logs, and flush Redis without any AWS Console access.

Fortem's self-service layer gives each developer a scoped view of environments they own. You assign ownership in the dashboard — takes about 15 minutes for a typical team. From that point, developers log in via SSO and see only their environments.

Within their assigned environments, the full permission breakdown:

Action	Can do?	Notes
Restart a service	✓ Yes	Scoped to assigned environments only
Redeploy to latest image	✓ Yes	Uses the image already in the task definition
View / tail CloudWatch logs	✓ Yes	Real-time and historical
Flush Redis keys (pattern-matched)	✓ Yes	Pattern input required — no wildcard delete
Run one-off ECS tasks	✓ Yes	From pre-approved task definitions only
Pause / resume environment schedule	✓ Yes	Operator permission required
Touch any production resource	✗ No	Prod is a separate environment class; disabled by default
Access AWS credentials or secrets	✗ No	Fortem never exposes secrets to the UI
Modify task definitions or IAM roles	✗ No	Infrastructure config is platform-team only
See environments not assigned to them	✗ No	Environment scope is enforced server-side

The key point is the last four rows. Production is off. AWS credentials are off. Infrastructure config is off. The scope boundary is enforced server-side — it's not UI hiding.

Fortem runs inside your own AWS account and uses a least-privilege IAM role with the minimum permissions needed to perform ECS operations. Your developers authenticate via SSO — they never interact with AWS directly.

Before and after

With per-environment self-service, a Friday staging restart drops from a 2–14 hour Slack wait to 40 seconds — and the platform engineer never has to open AWS Console on their phone at dinner.

Situation	Before	After
Staging service crashes Friday at 6pm	Developer Slacks platform team. Waits 2–14 hrs for someone to restart it.	Developer clicks Restart in Fortem. Service is up in 40 seconds.
New engineer needs to read staging logs	IAM ticket to security team. 1–3 business days. Maybe AWS Console access.	Platform engineer assigns log-viewer role in Fortem. Done in 2 minutes.
QA needs to flush Redis cache to test a bug	Blocked. Can't flush Redis without console access. Creates a ticket.	QA flushes specific key pattern in Fortem without touching AWS.
Developer wants to redeploy their branch to staging	Asks platform engineer. Gets queued. Usually done same day, sometimes tomorrow.	Developer triggers redeploy from Fortem. 3 clicks.
SOC 2 auditor asks who restarted staging last Tuesday	CloudTrail search, cross-reference IAM user, 2 hours of work.	Filter by environment and date in Fortem audit log. 30 seconds.

The most common feedback from platform engineers after turning on self-service: the first week felt like they gave something up. The second week, they realized the thing they gave up was being woken up on Friday night.

KEY INSIGHT: Platform engineers on mid-sized ECS teams field 3–8 Slack interruptions per week — and 5 operations (restart, redeploy, view logs, flush Redis, run one-off tasks) account for 80% of them. Scoped per-environment RBAC eliminates the platform engineer as single point of failure.

What gets logged

Every Fortem action records actor, environment, operation, timestamp, and service state before and after — queryable in 30 seconds with no CloudTrail cross-referencing required.

Every action through Fortem creates an audit entry: who, what environment, what action, what time, what the service state was before and after.

Audit log — staging / orders-api

Fri May 17 18:52  jamie@acme.co         Service restart         staging   HEALTHY after 38s
Fri May 17 16:30  sam@acme.co           Redeploy (latest)       qa-eu     Deployed sha:a3f2b1
Thu May 16 09:14  kai@acme.co           Redis flush             staging   Pattern: session:\* — 4 keys deleted

When your SOC 2 auditor asks who restarted staging last Tuesday, this is the answer — filtered and exported in under 30 seconds. You don't need to cross-reference CloudTrail against IAM users against a timezone conversion.

Audit retention is configurable: 90 days on the Fortem plan, 365 days on Enterprise.

Fortem adds per-environment RBAC to your ECS fleet — developers self-serve restarts, redeployments, and log access without AWS Console access. Setup takes 15 minutes and production stays off by default.

Book a 20-min call →

Map your fleet in 5 min: fortem.dev/audit