When a system reaches a global scale, teams usually have plenty of data. The real problem is figuring out which signals matter.
Metrics, logs, traces, and audit records are all available. Dashboards are packed with information, and storage keeps increasing. When something goes wrong, the same questions come up:
- The metrics look fine, but users are complaining
- The logs do not say much
- We cannot find the request everyone is talking about
- Are we even allowed to look at this data?
In this article, I try to explain why they should be separate.
Responsibilities first
In Part 2, I introduced the idea of 3 responsibilities in a global serverless platform:
- Execution — doing the work
- Evidence — proving what happened
- Operations — understanding system health
Each of these serves a specific responsibility. Problems begin when we use the same signals for different responsibilities just because it seems easier.
Metrics
Metrics are designed to answer one question quickly:
Is the system behaving normally?
They are designed to be:
- Aggregated
- Cheap to query
- Fast to evaluate
- Stable over time
In AWS, metrics usually come from:
Metrics are useful because they turn lots of events into clear trends.
This is why cardinality matters. If you include userId or tenantId, you can turn a few time-series into millions, which makes aggregation, alerting, and cost control much harder.
For example, I should have something like this:
metrics.counter("orders_created_total").add(1, {
region: process.env.AWS_REGION,
service: "orders",
});
The above shows where to investigate, but not which user or tenant is involved.
Logs
Logs exist to answer the question:
What happened inside this component?
Logs are:
- Local
- Detailed
- Contextual
- Tied to execution
In AWS, logs are typically collected via:
A log could look like this:
{
"level": "error",
"message": "Failed to create invoice",
"tenantId": "tenant-123",
"correlationId": "abc-456",
"service": "billing",
"region": "eu-west-1"
}
Including tenantId is useful because:
- Logs are scoped
- Access is controlled
- Logs can be automatically deleted or archived
- Searches are intentional
Logs are not designed for:
- To be globally aggregated
- Queried continuously
- To be the single truth for compliance
Centralising logs across regions might seem convenient, but in systems that are aware of data residency, it often leads to compliance and access issues.
Traces
Traces answer a question that neither metrics or logs can answer:
How did this request move through the system?
Traces maps relations across components.
In AWS:
A trace connects multiple spans:
- 1 span per Lambda invocation
- 1 span per downstream call (DynamoDB, HTTP and so on)
- 1 span per asynchronous hop (event bus, queue)
For example:
const span = tracer.startSpan("invoice.create");
span.setAttribute("correlation_id", correlationId);
try {
await SomethingOnDynamoDB(); // auto-instrumentation often creates child spans
await SomethingOnEB(); // publish spans live under this root span
span.setStatus({ code: "OK" });
} catch (e) {
span.recordException(e);
span.setStatus({ code: "ERROR" });
throw e;
} finally {
span.end();
}
In this example, the goal is to make the intent clear. The root span stands for a real business operation, while the child spans show the downstream work. All of these are linked by the CorrelationID.
In residency aware application, it is good practise:
- Keep traces regional
- Explicitly control which attributes may be attached
Audit
Audit exists for one purpose:
Can we prove what happened later?
This responsibility is fundamentally different from observability.
In AWS, audit signals typically come from:
Audit systems are built to record everything reliably, not to make a dev comfortable during an incident. They optimise for completeness, immutability, and long-term retention.
In AWS, it usually looks like this:
- CloudTrail records control-plane actions (who did what, when, and under which identity)
- Trails are delivered to S3 as immutable records
- Investigations happen later, often by querying those logs with Athena
Correlation in audit is rarely automatic. It is time-based and identity-based:
- This change happened at this time
- Under this role or permission set
- Shortly before or after this incident
Tools like AWS Audit Manager help by collecting and organising infrastructure evidence. AWS provides authoritative records for infrastructure actions, and those tools together can tell who did what to the platform and when.
What is missing is the intent, and this responsibility lives in the application. This is why an application must emit domain-level audit events. These events explain what the system decided to do in business terms.
Conclusion
Global serverless systems don’t fail because they lack observability. They fail because:
- The wrong signals are used for the wrong questions
- Responsibilities get blurred
- Convenience overrides intent
In a healthy system, the flow looks like this:
- Metrics alert you that something is wrong
- You identify where and when
- Logs and traces help you understand what happened
- Audit records explain who changed what and provide defensible proof
Top comments (0)