Renaldi for AWS Community Builders

Posted on Feb 25

How to Monitor Event Delivery in Amazon EventBridge

#aws #eventbridge #eventdriven #webdev

When I first started using Amazon EventBridge more heavily, I realized something important pretty quickly: it is very easy to build event-driven flows, but much harder to know when delivery is degrading before things break.

This post is specifically about best practices for monitoring event delivery in EventBridge (not just target-side application monitoring). AWS actually has strong guidance here, and the key is to monitor a combination of retry, success, failure, DLQ, and latency metrics together instead of looking at a single number in isolation.

I’ll walk through:

what to monitor
what I alert on (and why)
an architecture pattern that works well
practical alarm design so you do not get spammed all day

Why this matters

AWS recommends monitoring EventBridge delivery behavior because underperforming or undersized targets can cause excessive retries, delivery delays, and permanent delivery failures. They also explicitly recommend combining multiple metrics and setting alarms/dashboards for early detection.

Also, by default, EventBridge retries failed target delivery for up to 24 hours and up to 185 times (with exponential backoff and jitter). If retries are exhausted, the event is dropped unless you configured a DLQ.

That default behavior is great for resilience, but it also means a problem can quietly turn into a backlog / latency issue if you are not watching the right signals.

The architecture I use for monitoring EventBridge delivery

Here is the monitoring architecture pattern I like for production workloads.

Best-practice notes in this architecture

I prefer a DLQ per rule target (or at least per critical target path), because AWS recommends configuring a dead-letter queue for each rule target to avoid losing undelivered events.
I keep CloudWatch alarms on EventBridge metrics and separate alarms on the target service (for example Lambda errors, API 5xx, Step Functions failures). EventBridge tells me about delivery behavior; target metrics tell me what is happening after delivery.
I treat the DLQ as a diagnostics feed, not just a parking lot.

The core EventBridge metrics I monitor (and why)

AWS’s EventBridge best-practices page specifically calls out a small set of metrics for delivery monitoring, and I use those as the foundation.

1) `InvocationAttempts`

This is the total number of times EventBridge attempted to invoke a target, including retries.

Why I care:

It gives me the denominator for delivery health.
It helps me see if retries are inflating traffic to a struggling target.

2) `SuccessfulInvocationAttempts`

This is the number of invocation attempts where EventBridge successfully delivered the event to the target.

Why I care:

It is the numerator for a success-rate metric.
On its own, it can look healthy even when retries are growing, so I always compare it with InvocationAttempts.

3) `RetryInvocationAttempts`

This is the number of invocation attempts that were retries. AWS explicitly says an increase here may be an early indication of an undersized target.

Why I care:

This is usually my earliest warning signal.
I can catch target stress before it becomes permanent failure.

4) `FailedInvocations`

This is the number of invocations that failed permanently (not retried, or not recovered after retry), with some exclusions noted in the metric definition.

Why I care:

This is a direct “something is broken” signal.
If this rises and there is no DLQ, I am at risk of event loss after retries are exhausted.

5) `InvocationsSentToDlq`

This is the number of invocations moved to a dead-letter queue.

Why I care:

This tells me failed deliveries are now becoming operational work.
Even if the system is “surviving,” this still means customers or downstream processes may be impacted.

6) `InvocationsFailedToBeSentToDlq`

This is the nightmare metric: EventBridge tried to move failed invocations to the DLQ but could not (for example due to permissions, unavailable resource, or size issues). AWS documents this metric and notes it is emitted only when non-zero.

Why I care:

This is a critical alarm for me.
It means my failure safety net is not working.

7) `IngestionToInvocationSuccessLatency`

AWS recommends this latency metric for detecting event delivery delays. It measures end-to-end latency from ingestion to successful delivery and can surface delays from retries, timeouts, or slow target responses. It also includes the target’s time to respond successfully.

Why I care:

This catches degraded delivery even when there are no obvious hard failures.
It is great for detecting “everything is technically working, but much slower than usual.”

The most useful derived metric: `SuccessfulInvocationRate`

AWS explicitly recommends combining metrics and even gives a metric math example:

SuccessfulInvocationRate = SuccessfulInvocationAttempts / InvocationAttempts
I strongly agree with this. In practice, I always graph and alert on a success rate, not just raw counts.

CloudWatch metric math expression (example)

SuccessfulInvocationRate = SuccessfulInvocationAttempts / InvocationAttempts

My practical implementation tip

I only evaluate this alarm when there is meaningful traffic. Otherwise, low-volume rules can produce noisy alerts (for example, 1 failed attempt on a rarely used rule).

A common pattern is:

a volume guard (enough attempts in the window)
plus a success-rate threshold

What I alert on (my starting best-practice alarm set)

Below is a practical alert set I use as a starting point. The exact thresholds should be tuned to your workload and business SLOs.

I treat these as starting thresholds, not universal truths.

Alert 1 (Critical): `InvocationsFailedToBeSentToDlq > 0`

Severity: P1 / Critical

Why: The DLQ safety net itself is failing. AWS notes this can happen due to permission issues, unavailable resources, or size limits.

What I do immediately

Check SQS queue existence and region
Check SQS resource policy / permissions
Check encryption / access configuration if applicable
Check message size / payload patterns

Alert 2 (High): `FailedInvocations > 0` on critical rules

Severity: P1 or P2 (depends on rule criticality)

Why: Permanent delivery failure is already happening. AWS documents FailedInvocations as permanently failed invocations.

Best practice

Scope alarms per rule for critical paths
Route alerts to the owning team, not one giant shared channel

Alert 3 (High): `InvocationsSentToDlq > 0`

Severity: P2 (or P1 for critical business flows)

Why: Failures are being captured safely, but delivery is still failing and backlog is building. AWS best-practice guidance specifically calls out watching FailedInvocations and InvocationsSentToDlq.

What I do

Triage DLQ message attributes (ERROR_CODE, ERROR_MESSAGE, RETRY_ATTEMPTS, etc.) to identify the root cause faster. EventBridge includes these attributes in DLQ messages.

Alert 4 (Warning): `RetryInvocationAttempts` rising above baseline

Severity: Warning / P3

Why: AWS says an increase may be an early indication of an undersized target.

This is one of my favorite early-warning alerts.

How I tune it

For steady traffic: alert on absolute count per period
For bursty traffic: alert on retry rate (retry attempts / invocation attempts)
Use a longer evaluation window to avoid false positives during short spikes

Alert 5 (Warning/High): `SuccessfulInvocationRate` below threshold

Severity: Warning or P2

Why: This gives a single health signal that accounts for retries and success together (exactly what AWS recommends via metric math).

Example starting thresholds

Warning: < 99.9% for 10 minutes (high-volume, sensitive systems)
High: < 99% for 5–10 minutes
Critical: < 95% for 5 minutes on critical paths

Again, tune these to your business tolerance.

Alert 6 (Warning): `IngestionToInvocationSuccessLatency` p95/p99 above baseline

Severity: Warning or P2

Why: AWS recommends using this metric to detect delays, and it surfaces retry effects and slow target responses.

My recommendation

Alert on p95 or p99 rather than average
Use a baseline per rule (or per target type), because latency expectations differ a lot

Alert 7 (Warning): `ThrottledRules > 0`

Severity: Warning / P2 if persistent

Why: AWS documents that throttled rule executions may delay invocations.

This is especially useful when traffic grows and quota pressure starts to show up.

My dashboard layout (simple but effective)

I try not to overcomplicate the dashboard. I keep one dashboard section per critical rule / target path, and I always put these together:

Panel group A: Delivery health

InvocationAttempts
SuccessfulInvocationAttempts
RetryInvocationAttempts
FailedInvocations
InvocationsSentToDlq

Panel group B: Delivery quality

SuccessfulInvocationRate (metric math)
IngestionToInvocationSuccessLatency (p50, p95, p99)

Panel group C: Protection / risk

InvocationsFailedToBeSentToDlq
ThrottledRules

This matches AWS’s “combine multiple metrics for a holistic overview” guidance, and in practice it makes troubleshooting much faster.

Practical runbook: how I interpret the pattern combinations

This is the part that saves the most time during incidents.

Pattern 1: Retries rising, success rate still high

Likely meaning: target is stressed but still mostly coping

Metrics pattern:

RetryInvocationAttempts up
SuccessfulInvocationRate mostly stable
latency rising (IngestionToInvocationSuccessLatency)

What I check

target concurrency / scaling
downstream dependency latency
throttling on target service

Pattern 2: Retries rising, success rate dropping, DLQ starts filling

Likely meaning: target is underprovisioned or misconfigured

Metrics pattern:

RetryInvocationAttempts up
SuccessfulInvocationRate down
InvocationsSentToDlq > 0

What I check

target capacity / rate limits
permission changes
endpoint availability / DNS / networking
recent deployments

Pattern 3: Failed invocations and DLQ-send failures

Likely meaning: primary path failing and safety net misconfigured

Metrics pattern:

FailedInvocations > 0
InvocationsFailedToBeSentToDlq > 0

What I do

page immediately
validate DLQ queue, region, permissions, and policy
verify EventBridge can SendMessage to the SQS DLQ

AWS’s DLQ guidance includes permission requirements and examples, and this is the first place I look when DLQ delivery fails.

Common mistakes I see (and try to avoid)

1) Only alerting on `FailedInvocations`

This catches issues late. By the time permanent failures show up, you may already have retries, delays, and customer impact.

Better: also alert on retries and latency.

2) No DLQ configured

AWS explicitly recommends configuring a DLQ for each rule target to avoid losing undelivered events.

3) No alarm on `InvocationsFailedToBeSentToDlq`

This is the “we thought we were safe, but we are not” blind spot.

4) One global alarm for all rules

This creates noisy alerts and slow triage.

Better: alarm per critical rule / target path, and map ownership.

5) Using static latency thresholds with no baseline thinking

Different targets behave differently. A Lambda target and an API destination will not share the same normal latency profile.

A practical starting checklist (copy this into your ops notes)

[ ] Configure DLQ for each critical EventBridge rule target
[ ] Alarm on InvocationsFailedToBeSentToDlq > 0 (critical)
[ ] Alarm on FailedInvocations > 0 for critical rules
[ ] Alarm on InvocationsSentToDlq > 0 for critical rules
[ ] Track RetryInvocationAttempts trend (early warning)
[ ] Create SuccessfulInvocationRate metric math and alert on sustained dips
[ ] Track IngestionToInvocationSuccessLatency p95/p99 and alert on sustained increase
[ ] Add ThrottledRules alert if traffic growth is expected
[ ] Build per-rule dashboards for critical flows
[ ] Document a DLQ triage + replay runbook

Final thoughts

If I had to summarize the best practice in one sentence, it would be this:

Do not monitor EventBridge delivery with a single failure metric. Monitor retries, success rate, DLQ activity, and delivery latency together.

That combination gives me an early warning signal, a confirmed-failure signal, and a customer-impact signal, which is exactly what I want in production.

References

AWS EventBridge monitoring best practices for event delivery (recommended metrics, success-rate metric math, retries/DLQ guidance). (AWS Documentation)
AWS EventBridge monitoring metrics reference (AWS/Events metrics and definitions). (AWS Documentation)
AWS EventBridge retry behavior (default retry duration and count). (AWS Documentation)
AWS EventBridge DLQ guidance (DLQ behavior, limitations, permissions, message attributes). (AWS Documentation)

Why this matters

The architecture I use for monitoring EventBridge delivery

Best-practice notes in this architecture

The core EventBridge metrics I monitor (and why)

1) InvocationAttempts

2) SuccessfulInvocationAttempts

3) RetryInvocationAttempts

4) FailedInvocations

5) InvocationsSentToDlq

6) InvocationsFailedToBeSentToDlq

7) IngestionToInvocationSuccessLatency

The most useful derived metric: SuccessfulInvocationRate

CloudWatch metric math expression (example)

My practical implementation tip

What I alert on (my starting best-practice alarm set)

Alert 1 (Critical): InvocationsFailedToBeSentToDlq > 0

Alert 2 (High): FailedInvocations > 0 on critical rules

Alert 3 (High): InvocationsSentToDlq > 0

Alert 4 (Warning): RetryInvocationAttempts rising above baseline

Alert 5 (Warning/High): SuccessfulInvocationRate below threshold

Alert 6 (Warning): IngestionToInvocationSuccessLatency p95/p99 above baseline

Alert 7 (Warning): ThrottledRules > 0

My dashboard layout (simple but effective)

Panel group A: Delivery health

Panel group B: Delivery quality

Panel group C: Protection / risk

Practical runbook: how I interpret the pattern combinations

Pattern 1: Retries rising, success rate still high

Pattern 2: Retries rising, success rate dropping, DLQ starts filling

Pattern 3: Failed invocations and DLQ-send failures

Common mistakes I see (and try to avoid)

1) Only alerting on FailedInvocations

2) No DLQ configured

3) No alarm on InvocationsFailedToBeSentToDlq

4) One global alarm for all rules

5) Using static latency thresholds with no baseline thinking

A practical starting checklist (copy this into your ops notes)

Final thoughts

References

1) `InvocationAttempts`

2) `SuccessfulInvocationAttempts`

3) `RetryInvocationAttempts`

4) `FailedInvocations`

5) `InvocationsSentToDlq`

6) `InvocationsFailedToBeSentToDlq`

7) `IngestionToInvocationSuccessLatency`

The most useful derived metric: `SuccessfulInvocationRate`

Alert 1 (Critical): `InvocationsFailedToBeSentToDlq > 0`

Alert 2 (High): `FailedInvocations > 0` on critical rules

Alert 3 (High): `InvocationsSentToDlq > 0`

Alert 4 (Warning): `RetryInvocationAttempts` rising above baseline

Alert 5 (Warning/High): `SuccessfulInvocationRate` below threshold

Alert 6 (Warning): `IngestionToInvocationSuccessLatency` p95/p99 above baseline

Alert 7 (Warning): `ThrottledRules > 0`

1) Only alerting on `FailedInvocations`

3) No alarm on `InvocationsFailedToBeSentToDlq`