Trust Is the Architecture (Part 1)

#aws #serverless #platformengineering #governance

Over the past few years, I have written about serverless from a technical perspective, covering topics like latency, multi-region setups, and complex scenarios. That is all important, but this time I want to shift focus. Instead of building and deploying, I want to talk about what happens next: how to handle governance and answer the question:

How do I run this when the blast radius is large, the org is big, and somebody outside engineering needs proof?

This is Part 1, where I will introduce the main idea and share my perspective.

Looking back at my journey with serverless, I started out using managed services to deliver value and ship features faster. As traffic increased, I learned how concurrency, retries, throttling, and quotas work. That is when I added caching and other solutions to help the system scale. I also discovered the limits of the system the hard way, especially when multiple teams relied on the same things. This is when Serverless stopped being just a way to build. At a global scale the problems are becoming about:

How teams deploy and scale
How failures propagate
How data is accessed

My point of view has shifted from Can we build it? to Can we trust it?

When I say "trust", I am not talking about feelings but how a system operate

Can we trust that a tenant boundary is real?
Can we trust that production changes are controlled?
Can we trust that data does not leave the region?
Can we trust that when something goes wrong, we can explain what happened?
Can we trust that if an auditor asks, we have evidence, not just stories?

I have worked in places where trust is based on hope:

Daniele is careful
We all know not to touch that thing in production
I will fix it if it breaks

That is not bad, but it does not scale. Informal rules get forgotten, people get tired, and mistakes happen all the time. So when I say "Trust," I mean being able to make a claim about how the system works and back it up with evidence.

If I do not rely on people, I start thinking in this way:

"Only approved pipelines can deploy to production."
"EU personal data does not leave EU regions."
"A developer cannot disable logging."
"I can trace an incident end-to-end across services."

This means that "Trust" does not depend on people, but is the set of decisions so that the system itself ensures the right things happen and the wrong things become impossible or extremely hard.

Serverless does not create the need for trust, it just makes it more visible, since the application is split into many separate services. We can find some of the answers by combining:

AWS CloudTrail -> who did what, when
Logs and telemetry -> what happened at runtime
Snapshots -> what was true at a point in time
AWS Audit Manager -> automated evidence

If the application is global and regulated, or simply part of a large company, this becomes a core part of how the app is designed. The goal is to make sure the right evidence exists by default.

When trust fails

When a system feels unsafe at scale, it is usually because of one of these problems:

1) Trust between teams

Goal: teams can ship independently without accidentally breaking each other, and without relying on heroics.

The reality is sometimes different:

Shared components with unclear ownership
Production access via long-lived admin roles
Open a ticket and wait for someone to solve it

That is why ownership, approval, and permissions need to be clear. For example, a good practice is to have:

AWS Organizations + Service Control Policies (SCPs) used as permission guardrails across
Production access with time-bound access patterns and temporary elevated access with TEAM

There is a trade-off: more structure means less freedom at first, but I have learned it leads to faster progress later.

2) Trust between systems - EDA

Goal: systems can interact through events without coupling

The reality is sometimes different:

Events without owners
Schema changes without coordination
Accidentally ingesting sensitive data

This is where the "contracts" exist, and tools like EventCatalog are essential because they explicitly define ownership, contracts, rules, and boundaries.

There is a trade-off here too: more explicit contracts can feel slower at first, but the benefit is that I stop guessing about the cost of decoupling.

3) Trust across boundaries - data residency

Goal: prove where data is stored and processed, and prevent accidental cross-boundary movement.

Multi-region is simple to implement, but the hard part is the compliance:

What data can be stored where
What processing is allowed where
What needs to stay inside the region (logs, audit, telemetry, tax, PII, etc.)

Tools like SaaS Lens treat tenant isolation and deployment models as architectural decisions, not just operational ones. Compliance rules make it harder to centralize, so it is important to plan from day one that "regional by default" is the norm, not the exception.

4) Trust with external - auditors, regulators, customers

Goal: claims are backed up by evidence, quickly, repeatedly, and consistently.

I learned the hard way what evidence really means. For example:

Who had access?
Who changed production?
What data moved where?
Is logging complete and tamper resistant?

Questions like these seem simple, but it is often hard to provide evidence after the application is built. Unless there is a major misconfiguration in toos like CloudTrail and other we still need to live with the reality of:

Logs exist but are inconsistent
Evidence exists but it takes weeks to assemble manually

That is why it is so important to follow best practices and use tools like AWS Audit Manager that help scale evidence collection by gathering and organizing data from AWS services and control sources. Most importantly, teams must follow the same patterns everywhere, making logs, metrics, and events consistent.

Conclusion

If this article sounds boring and focused on governance, that is because it is. When serverless goes global, the hardest challenges are not Lambda timeouts or DynamoDB keys. The challenges are:

making boundaries real (and preventing cross-boundary mistakes)
scaling delivery without central bottlenecks
making event-driven coupling visible and owned
producing evidence on demand

That is why I say trust is part of the architecture. Bureaucracy usually slows things down and adds extra work, but when it is automated and built into daily routines, it reduces confusion and helps prevent unsafe situations. In theory, this lets me move faster without breaking things.