Daniele Frasca for AWS Community Builders

Posted on Apr 13

Metrics, Logs, Traces, and Audit (Part 3)

#aws #serverless #observability #governance

When a system reaches a global scale, teams usually have plenty of data. The real problem is figuring out which signals matter.

Metrics, logs, traces, and audit records are all available. Dashboards are packed with information, and storage keeps increasing. When something goes wrong, the same questions come up:

The metrics look fine, but users are complaining
The logs do not say much
We cannot find the request everyone is talking about
Are we even allowed to look at this data?

In this article, I try to explain why they should be separate.

Responsibilities first

In Part 2, I introduced the idea of 3 responsibilities in a global serverless platform:

Execution — doing the work
Evidence — proving what happened
Operations — understanding system health

Each of these serves a specific responsibility. Problems begin when we use the same signals for different responsibilities just because it seems easier.

Metrics

Metrics are designed to answer one question quickly:

Is the system behaving normally?

They are designed to be:

Aggregated
Cheap to query
Fast to evaluate
Stable over time

In AWS, metrics usually come from:

Metrics are useful because they turn lots of events into clear trends.
This is why cardinality matters. If you include userId or tenantId, you can turn a few time-series into millions, which makes aggregation, alerting, and cost control much harder.

For example, I should have something like this:

metrics.counter("orders_created_total").add(1, {
  region: process.env.AWS_REGION,
  service: "orders",
});

The above shows where to investigate, but not which user or tenant is involved.

Logs

Logs exist to answer the question:

What happened inside this component?

Logs are:

Local
Detailed
Contextual
Tied to execution

In AWS, logs are typically collected via:

Amazon CloudWatch Logs

A log could look like this:

{
  "level": "error",
  "message": "Failed to create invoice",
  "tenantId": "tenant-123",
  "correlationId": "abc-456",
  "service": "billing",
  "region": "eu-west-1"
}

Including tenantId is useful because:

Logs are scoped
Access is controlled
Logs can be automatically deleted or archived
Searches are intentional

Logs are not designed for:

To be globally aggregated
Queried continuously
To be the single truth for compliance

Centralising logs across regions might seem convenient, but in systems that are aware of data residency, it often leads to compliance and access issues.

Traces

Traces answer a question that neither metrics or logs can answer:

How did this request move through the system?

Traces maps relations across components.

In AWS:

A trace connects multiple spans:

1 span per Lambda invocation
1 span per downstream call (DynamoDB, HTTP and so on)
1 span per asynchronous hop (event bus, queue)

For example:

const span = tracer.startSpan("invoice.create");
span.setAttribute("correlation_id", correlationId);

try {
  await SomethingOnDynamoDB();     // auto-instrumentation often creates child spans
  await SomethingOnEB();  // publish spans live under this root span
  span.setStatus({ code: "OK" });
} catch (e) {
  span.recordException(e);
  span.setStatus({ code: "ERROR" });
  throw e;
} finally {
  span.end();
}

In this example, the goal is to make the intent clear. The root span stands for a real business operation, while the child spans show the downstream work. All of these are linked by the CorrelationID.

In residency aware application, it is good practise:

Keep traces regional
Explicitly control which attributes may be attached

Audit

Audit exists for one purpose:

Can we prove what happened later?

This responsibility is fundamentally different from observability.

In AWS, audit signals typically come from:

Audit systems are built to record everything reliably, not to make a dev comfortable during an incident. They optimise for completeness, immutability, and long-term retention.

In AWS, it usually looks like this:

CloudTrail records control-plane actions (who did what, when, and under which identity)
Trails are delivered to S3 as immutable records
Investigations happen later, often by querying those logs with Athena

Correlation in audit is rarely automatic. It is time-based and identity-based:

This change happened at this time
Under this role or permission set
Shortly before or after this incident

Tools like AWS Audit Manager help by collecting and organising infrastructure evidence. AWS provides authoritative records for infrastructure actions, and those tools together can tell who did what to the platform and when.

What is missing is the intent, and this responsibility lives in the application. This is why an application must emit domain-level audit events. These events explain what the system decided to do in business terms.

Conclusion

Global serverless systems don’t fail because they lack observability. They fail because:

The wrong signals are used for the wrong questions
Responsibilities get blurred
Convenience overrides intent

In a healthy system, the flow looks like this:

Metrics alert you that something is wrong
You identify where and when
Logs and traces help you understand what happened
Audit records explain who changed what and provide defensible proof

DEV Community