Orel Bello for AWS Community Builders

Posted on Feb 23

The Hard Truth About Platform Engineering Adoption

#aws #platformengineering #awscommunitybuilders #fintech

Intro

You know how it is. There is always this ancient struggle between doing things the fast way or the right way. Most of the time, the right way is slower, but we still have to deliver, and fast.

Platform Engineering adoption doesn't fail because of tooling.
It fails because of habits.

In a world where every request is critical and urgent (I'll never forget the developer who opened a ticket with a severity of "Production is down," saying that his personal AWS account didn't work, and it was Production for him), the struggle is real between fixing it manually because it's urgent, or investing time in building automation that will do it the right way.

So what would you do?

We're a small DevOps team, responsible for more than 200 engineers. Naturally, we started handling requests manually, tying up loose ends and eliminating blockers.
After all, nobody wants to be the one slowing everyone down.

But if you continue doing things manually just to keep up with urgent requests, in the long run you'll slow the entire company down. What works for a company of 50 engineers doesn't always scale to 200 engineers.

That's where Platform Engineering comes into play.

What Platform Engineering Actually Means

The Platform Engineering concept is pretty simple. Instead of handing developers a fish, we give them a fishing rod.

If a traditional DevOps team builds infrastructure for developers, here we give them the tools to do it themselves.

More importantly, Platform Engineering is about creating a default way of working.
Not just tools that developers can use, but paths they're expected to use.

The goal is to eliminate bottlenecks and accelerate innovation. If the DevOps team can't take a day off without the company going up in flames, something probably needs to change.

We need to adopt an enablement mindset. How can we give developers tools to work independently, while still keeping the organization's best practices?

Remember, the wisdom is to find ways to allow, not to block. Although, as we will see later on, sometimes blocking is inevitable.

First Steps

The first thing we did was build a self-service platform, which we call the Buffet.

Now, whenever developers need to create a new MySQL user, MongoDB cluster, or even a secret (and many more different resources), they can do it completely automatically, without waiting for DevOps, by using the Buffet (which we implemented as a Slack bot).

It's a win-win. Developers move much faster without waiting for DevOps, and DevOps has more capacity to focus on real work instead of manual support and acting as a help desk.

But is that all? No.

Self-service solves symptoms. It doesn't solve standardization.
Platform Engineering doesn't end here, and this is also where the real problems usually start.

The Challenges We Didn't Expect

Saying No

Migrating to a self-service portal instead of handling every request manually isn't always smooth. This is where DevOps needs to start saying "No."

Example:

"Can you please create me a MySQL user? It's urgent."
"We're moving all MySQL users to the self-service platform. If we keep doing this manually, we'll never finish building it. You'll need to use the Buffet."

This is usually the point where developers start feeling that DevOps is blocking them instead of helping.
But in order to invest in innovation, you need to stop doing things manually, even if it means developers will temporarily have to wait.

Short-term friction. Long-term acceleration.

It's not always pleasant, but it's necessary.

Lack of Standard

That was just the tip of the iceberg. What came next hit us harder than expected.
The root of almost all our headaches was lack of standardization.
Each team did things their own way, which made organization-wide improvements painful. That included:

Architecture
Not all services followed organization or AWS best practices.

Deployment
Teams wrote CI/CD pipelines differently, ran different pre-deploy checks, and deployed to AWS in their own ways.

Monitoring
Alarms, custom metrics, and dashboards varied widely.

Permissions
Access was inconsistent and sometimes overly permissive.

And that was just the beginning.

Every change felt risky, manual, and error-prone.

The real kicker was that making an organization-wide change required touching every repository individually. This created friction, extra manual work, and a high risk of mistakes.

Unicorn Startup Pressure

Doing all this while scaling a unicorn startup in a race for acquisition added even more pressure.

Legacy services, tight deadlines, and a high-growth environment made the transition especially tricky. There was no clean slate.

Everything had to keep working while we improved it.

So What Did We Do?

We tackled the most impactful challenges first.
Creating a self-service platform immediately eliminated a lot of manual work.

But as good as the self-service platform was, it handled only one aspect of Platform Engineering. Our biggest challenge remained lack of standardization.

Building the Golden Path

We started creating a Golden Path, our organization's right way of doing things.

Our backend-platform team built the MDK (Melio Development Kit) - an internal opinionated CLI that generates and enforces AWS SAM service templates. Similar in spirit to CDK, it helps developers create a standardized template.yaml, which we use to deploy our services.
It wasn't only about building templates faster, although writing SAM templates manually is never fun.

More importantly, it finally allowed us to define how a service should look.

Let that sink in for a moment. It's a big deal.

With the MDK, we unlocked many opportunities:

Set best practices for AWS architecture, security, logging, tagging, and FinOps
No more wide IAM permissions
No more SQS without DLQs
No more using highly expensive resources without a reason

From the developer side, this also meant less guesswork and fewer decisions when starting a new service.

And the best thing?

Want to add a new feature across the organization? No more opening pull requests on hundreds of repositories, each one looking different.

Just open one pull request.

Or so we thought.

Adoption Challenges

Developers were comfortable with how they used to work and weren't excited about migrating their services to the MDK.

The naive solution was setting this as a cross-organization initiative, prioritizing it with product managers, and working closely with R&D.

In practice, enforcement became necessary.

The harsh truth is that you can only get so far by asking nicely. To really move forward, you have to define guidelines and enforce them.

Enforcement can happen at multiple levels:

Infrastructure guardrails (for example, AWS Service Control Policies)
Deployment blocking for non-compliant services
Clear deprecation timelines for old versions

Example: announce that developers have one month to start using the MDK. Other deployment methods will be deprecated and blocked. Teams that don't migrate won't be able to deploy new versions.

It sounds aggressive, but this is often the only way to make progress at scale.

The same applies to versions. If developers keep using an old version of the MDK, new features won't help. Deprecation and enforced upgrades are necessary.

Where We Are Now

Today, with both the MDK and the Buffet, which we continuously improve, we're on the right track. But there is still a long way to go.

One of the clearest examples of why standardization matters is tagging.

It may look insignificant, but tagging unlocks many capabilities. From ABAC-based permissions, which are critical for least privilege access, to cost allocation per team, and easily finding owners of budget-eating services, tagging is the foundation for everything.

When moving to a Platform Engineering approach, we always need to operate on two paths in parallel.

We must define guidelines and enforce them. For example, all new services must include predefined tags, and non-compliant deployments should be blocked (AWS Config can help here).

At the same time, we must migrate existing services, which often takes much longer.

The same principle applies to CI/CD, monitoring and observability, and yes, also AI.

We live in a world with constant AI FOMO. Implementing everything immediately leads to dis-standardization, which is the enemy of Platform Engineering.

We need to choose the right tools, define guidelines, and invest in proper rollout with training sessions, tutorials, documentation, and even hackathons.

Just like DevOps, Platform Engineering is a mindset and a methodology. It should be reflected everywhere.
Security, AI, FinOps, CI/CD, monitoring. The same approach applies to all of them.

Enable, don't block. But make the right way the easiest way.

Lessons Learned and Conclusion

Platform Engineering is never done. It's an ongoing journey.
The most important rule is standardization. Define guidelines and enforce them. That's the key.

Don't do the work for developers. Give them the tools to do it themselves.

Always prioritize building self-service tools over manual, repetitive work, no matter how urgent it feels, unless it's truly P0.

Remember that the self-service platform is only part of the story. As big as it is, it's not the whole picture.

Start as early as you can. It will have a massive impact later, when you need to support many legacy services that don't follow organizational guidelines.

The beginning will be hard, but it pays off. Platform Engineering boosts productivity, even if at first it slows you down.

If DevOps scales people, Platform Engineering scales standards.

And just as important, explain to developers why you're doing this and how it benefits them. You'll need their cooperation.

DEV Community