In Part 1, I explained that at a global scale, trust is part of the architecture. Not trust as a feeling, but trust as something the system must enforce and prove.
In this article, I aim to explain the 3 distinct responsibilities that enable the system to grow organically.
Why global systems feel complex
Most complexity in global systems does notcome from services. It comes from mixing concerns.
I have seen this pattern many times:
- CloudWatch dashboards are used to answer audit questions
- CloudTrail logs are pulled into debugging workflows
- Metrics start carrying tenant identifiers just to be safe
None of the above are wrong in isolation, but when they are together, they create systems that are:
- Hard to operate
- Hard to explain
- Hard to defend
The problem is the system is trying to answer too many different questions at once. In short, it is the same pattern that applies to code Single Responsibility.
My job brought me to a point where I stopped thinking in terms of architectures and started thinking in terms of responsibilities. No matter how the application is built, it must answer the same 3 questions:
- How does work actually happen?
- How can we prove that work happened correctly?
- How do we know whether the system is healthy?
When these responsibilities are clearly separated, decisions become easier. When they are mixed, every discussion becomes a mess.
Responsibility #1 — Doing the work (execution)
This is the responsibility most devs are comfortable with.
It is where:
- Business logic runs
- Requests are processed
- Events are handled
- Workflows progress
In AWS terms, this is:
- AWS Lambda
- Step Functions
- EventBridge
- DynamoDB
- SQS
- SNS
This responsibility answers one question only:
What does the system do for the business?
And it should be optimised for:
- Correctness
- Scalability
- Resilience
- Isolation
Problems start when this responsibility is overloaded.
Examples:
- embedding compliance logic directly into business code
- adding just in case logging everywhere without structure
- leaking operational concerns into domain logic
Execution code should focus on doing the work, not explaining or defending it.
Responsibility #2 — Proving the work (evidence and control)
This responsibility exists because someone outside the team will ask questions like:
- Who had access?
- Who changed production?
- What data moved where?
- Was logging enabled at the time?
This responsibility is not about debugging. It is about proof.
In AWS, this responsibility is expressed through things like:
- AWS CloudTrail
- IAM configuration and access records
- Configuration history
- Retention policies
- AWS Audit Manager
A common issue is teams trying to reuse execution or observability data as evidence. That usually fails because:
- Logs change format
- Metrics are aggregated
- Dashboards get deleted
- Devs remember things differently
Evidence systems must be:
- Complete
- Consistent
- Tamper‑resistant
This is why this responsibility must be separate from execution.
Responsibility #3 — Understanding the system (operations)
This responsibility answers a very different question:
Is the system healthy right now?
Not:
- What happened to tenant X?
- Who changed this?
But:
- Are errors increasing?
- Is latency degrading?
- Is this regional or global?
And the answers are:
- Metrics
- Alerts
- SLOs
- Dashboards
In AWS environments, this usually means:
- CloudWatch metrics
- Amazon Managed Prometheus
- Service telemetry
- Alerts
Those services exist, they trigger the investigation and actions.
Why mixing responsibilities breaks systems
Once these 3 responsibilities are separated, many things become obvious. For example:
- Metrics with tenantId - That is usually execution detail leaking into operations, and this is not what metrics are actually meant for.
- CloudWatch dashboards as audit - Dashboards explain system behaviour while auditors need immutable, verifiable evidence.
- Debugging incidents by scrolling through CloudTrail - CloudTrail is excellent at answering who did what, but it is a terrible tool for answering why the system is behaving this way right now.
Each of these feels ok on its own, but at scale, there is confusion about what the system is actually trying to tell us.
Benefits of separation
Once responsibilities are separated, conversations change.
Instead of:
Should we centralise logs?
I ask:
Which responsibility are we trying to serve?
Instead of:
Why cannot I just add tenantId to metrics?
I ask:
Is this an operational signal or an accounting question?
Instead of:
Why is governance slowing us down?
I ask:
Which responsibility are we trying to satisfy?
The trade‑offs are the same, but they are made explicit.
Conclusion
From a governance angle, global systems do not fail because they are distributed. They fail because we ask one system to do everything at once.
Separating responsibilities does notreduce complexity, it puts complexity where it belongs.
Top comments (0)