DEV Community

Cover image for It's Friday at 6pm. Your Developer Can't Restart Staging Without You.
Matt
Matt

Posted on • Originally published at fortem.dev

It's Friday at 6pm. Your Developer Can't Restart Staging Without You.

It's Friday at 6pm. Your Developer Can't Restart Staging Without You

Originally published at https://fortem.dev/blog/ecs-staging-self-service
Platform engineers become the single point of failure for staging ops when developers have no safe, scoped way to act. Here's how to fix it with ECS environment RBAC.


The 6pm Slack message

It's Friday, 6:47pm. You're at dinner. Your phone buzzes.

J

Jamie6:47 PM

hey — staging is down, orders-api won't start. i have a smoke test to finish before the monday deploy. can you take a look?

You open the AWS Console on your phone. Fargate console on mobile is a special kind of awful — tiny text, nested dropdowns, a task definition ARN you have to scroll sideways to read. You find the service, stop the broken task, wait for it to restart. The new task fails to start too. You check the CloudWatch logs. Missing environment variable. You update the task definition, force a new deployment. Fifteen minutes. By the time the service is healthy, dinner is cold and you've lost the conversation.

Monday, Jamie finishes the smoke test in 20 minutes and the deploy goes out fine.

Jamie didn't need you to debug a config issue. Jamie needed to restart a service. The entire incident was a permission problem — and it happens on most teams with 10+ environments, at least twice a week.

Why this keeps happening

AWS IAM doesn't have environment-scoped permissions for ECS. You can grant someone ecs:UpdateService— but that's access to every ECS service in the account, including production. You can try to scope it by resource ARN, but when your environments have 15 services each, maintaining those policies manually becomes its own full-time job.

So most platform engineers made the only rational decision available to them: they kept the keys and became the gatekeeper. Developers file a Slack request, platform engineer handles it, developers wait.

The platform engineer didn't choose to be a deployment gatekeeper. They became one because the alternative — handing over AWS Console access — was genuinely risky. The right answer is a permission layer that doesn't exist in native AWS.

The cost is invisible because it's spread across the week in small increments. A Slack ping here, a 15-minute console task there. But count the interruptions in a month: 3–8 per week for a mid-sized team. Each one breaks a flow state that takes 20 minutes to rebuild. Each Friday or weekend message is unpaid on-call work for a non-incident.

The 5 ops that cause 80% of interruptions

Most platform engineers, when they audit their staging-ops interruptions, find the same five actions accounting for nearly all of them:

  1. 1.

    Restart a crashed or stuck service. A task died, maybe due to a failed health check or OOM. The developer knows it — they just can't restart it.

  2. 2.

    Redeploy the latest image. A new build was pushed to ECR. The developer wants to pick it up in staging without waiting for the next CI run to trigger a deployment.

  3. 3.

    Read logs. The service is behaving strangely. The developer needs to tail CloudWatch — not navigate five levels of AWS console to get there.

  4. 4.

    Flush a Redis cache. Bad data got written. A key needs to be cleared so the service reads fresh state. One operation, one line of code if they had access.

  5. 5.

    Run a one-off task. A database migration, a data backfill, a cleanup script. Not a deployment — a single-run task against staging data.

None of these require infrastructure knowledge. None of them should require a platform engineer.

Why raw AWS Console access is the wrong answer

The obvious first instinct is: just give them limited AWS access. Create a developer IAM role with read and restart permissions.

In practice, this goes wrong in predictable ways:

  • IAM doesn't scope ECS permissions by environment — it scopes by account, region, and ARN. A policy that allows restarting staging services also allows restarting production services in the same account.
  • ARN-scoped policies break every time a service is renamed, a new environment is added, or an account is restructured. Someone has to maintain them.
  • AWS Console access gives visibility into things developers shouldn't see: secret ARNs, network config, IAM role names. Not a security catastrophe, but not ideal.
  • There's no audit trail per action. CloudTrail tells you which IAM user ran which API call — but not why, from what context, or what the environment state was before and after.

The right answer isn't broader AWS access — it's a permission layer that understands environments.

How Fortem solves it

Fortem's self-service layer gives each developer a scoped view of environments they own. You assign ownership in the dashboard — takes about 15 minutes for a typical team. From that point, developers log in via SSO and see only their environments.

Within their assigned environments, here's exactly what they can and cannot do:

Action Can do? Notes
Restart a service ✓ Yes Scoped to assigned environments only
Redeploy to latest image ✓ Yes Uses the image already in the task definition
View / tail CloudWatch logs ✓ Yes Real-time and historical
Flush Redis keys (pattern-matched) ✓ Yes Pattern input required — no wildcard delete
Run one-off ECS tasks ✓ Yes From pre-approved task definitions only
Pause / resume environment schedule ✓ Yes Operator permission required
Touch any production resource ✗ No Prod is a separate environment class; disabled by default
Access AWS credentials or secrets ✗ No Fortem never exposes secrets to the UI
Modify task definitions or IAM roles ✗ No Infrastructure config is platform-team only
See environments not assigned to them ✗ No Environment scope is enforced server-side

The key point is the last four rows. Production is off. AWS credentials are off. Infrastructure config is off. The scope boundary is enforced server-side — it's not just UI hiding.

No IAM changes required on your end. Fortem uses a cross-account role with the minimum permissions needed to perform ECS operations. Your developers authenticate via SSO — they never interact with AWS directly.

Before and after

Situation Before After
Staging service crashes Friday at 6pm Developer Slacks platform team. Waits 2–14 hrs for someone to restart it. Developer clicks Restart in Fortem. Service is up in 40 seconds.
New engineer needs to read staging logs IAM ticket to security team. 1–3 business days. Maybe AWS Console access. Platform engineer assigns log-viewer role in Fortem. Done in 2 minutes.
QA needs to flush Redis cache to test a bug Blocked. Can't flush Redis without console access. Creates a ticket. QA flushes specific key pattern in Fortem without touching AWS.
Developer wants to redeploy their branch to staging Asks platform engineer. Gets queued. Usually done same day, sometimes tomorrow. Developer triggers redeploy from Fortem. 3 clicks.
SOC 2 auditor asks who restarted staging last Tuesday CloudTrail search, cross-reference IAM user, 2 hours of work. Filter by environment and date in Fortem audit log. 30 seconds.

The most common feedback from platform engineers after turning on self-service: the first week felt like they gave something up. The second week, they realized the thing they gave up was being woken up on Friday night.

What gets logged

Every action through Fortem creates an audit entry: who, what environment, what action, what time, what the service state was before and after.

Audit log — staging / orders-api

Fri May 17 18:52  jamie@acme.co         Service restart         staging   HEALTHY after 38s
Fri May 17 16:30  sam@acme.co           Redeploy (latest)       qa-eu     Deployed sha:a3f2b1
Thu May 16 09:14  kai@acme.co           Redis flush             staging   Pattern: session:\* — 4 keys deleted
Enter fullscreen mode Exit fullscreen mode

When your SOC 2 auditor asks who restarted staging last Tuesday, this is the answer — filtered and exported in under 30 seconds. You don't need to cross-reference CloudTrail against IAM users against a timezone conversion.

Audit retention is configurable: 90 days on the Fortem plan, 365 days on Enterprise.

Top comments (0)