DEV Community

Cover image for How to Monitor Event Delivery in Amazon EventBridge
Renaldi for AWS Community Builders

Posted on

How to Monitor Event Delivery in Amazon EventBridge

When I first started using Amazon EventBridge more heavily, I realized something important pretty quickly: it is very easy to build event-driven flows, but much harder to know when delivery is degrading before things break.

This post is specifically about best practices for monitoring event delivery in EventBridge (not just target-side application monitoring). AWS actually has strong guidance here, and the key is to monitor a combination of retry, success, failure, DLQ, and latency metrics together instead of looking at a single number in isolation.

I’ll walk through:

  • what to monitor
  • what I alert on (and why)
  • an architecture pattern that works well
  • practical alarm design so you do not get spammed all day

Why this matters

AWS recommends monitoring EventBridge delivery behavior because underperforming or undersized targets can cause excessive retries, delivery delays, and permanent delivery failures. They also explicitly recommend combining multiple metrics and setting alarms/dashboards for early detection.

Also, by default, EventBridge retries failed target delivery for up to 24 hours and up to 185 times (with exponential backoff and jitter). If retries are exhausted, the event is dropped unless you configured a DLQ.

That default behavior is great for resilience, but it also means a problem can quietly turn into a backlog / latency issue if you are not watching the right signals.

The architecture I use for monitoring EventBridge delivery

Here is the monitoring architecture pattern I like for production workloads.

Best-practice notes in this architecture

  • I prefer a DLQ per rule target (or at least per critical target path), because AWS recommends configuring a dead-letter queue for each rule target to avoid losing undelivered events.
  • I keep CloudWatch alarms on EventBridge metrics and separate alarms on the target service (for example Lambda errors, API 5xx, Step Functions failures). EventBridge tells me about delivery behavior; target metrics tell me what is happening after delivery.
  • I treat the DLQ as a diagnostics feed, not just a parking lot.

The core EventBridge metrics I monitor (and why)

AWS’s EventBridge best-practices page specifically calls out a small set of metrics for delivery monitoring, and I use those as the foundation.

1) InvocationAttempts

This is the total number of times EventBridge attempted to invoke a target, including retries.

Why I care:

  • It gives me the denominator for delivery health.
  • It helps me see if retries are inflating traffic to a struggling target.

2) SuccessfulInvocationAttempts

This is the number of invocation attempts where EventBridge successfully delivered the event to the target.

Why I care:

  • It is the numerator for a success-rate metric.
  • On its own, it can look healthy even when retries are growing, so I always compare it with InvocationAttempts.

3) RetryInvocationAttempts

This is the number of invocation attempts that were retries. AWS explicitly says an increase here may be an early indication of an undersized target.

Why I care:

  • This is usually my earliest warning signal.
  • I can catch target stress before it becomes permanent failure.

4) FailedInvocations

This is the number of invocations that failed permanently (not retried, or not recovered after retry), with some exclusions noted in the metric definition.

Why I care:

  • This is a direct “something is broken” signal.
  • If this rises and there is no DLQ, I am at risk of event loss after retries are exhausted.

5) InvocationsSentToDlq

This is the number of invocations moved to a dead-letter queue.

Why I care:

  • This tells me failed deliveries are now becoming operational work.
  • Even if the system is “surviving,” this still means customers or downstream processes may be impacted.

6) InvocationsFailedToBeSentToDlq

This is the nightmare metric: EventBridge tried to move failed invocations to the DLQ but could not (for example due to permissions, unavailable resource, or size issues). AWS documents this metric and notes it is emitted only when non-zero.

Why I care:

  • This is a critical alarm for me.
  • It means my failure safety net is not working.

7) IngestionToInvocationSuccessLatency

AWS recommends this latency metric for detecting event delivery delays. It measures end-to-end latency from ingestion to successful delivery and can surface delays from retries, timeouts, or slow target responses. It also includes the target’s time to respond successfully.

Why I care:

  • This catches degraded delivery even when there are no obvious hard failures.
  • It is great for detecting “everything is technically working, but much slower than usual.”

The most useful derived metric: SuccessfulInvocationRate

AWS explicitly recommends combining metrics and even gives a metric math example:

SuccessfulInvocationRate = SuccessfulInvocationAttempts / InvocationAttempts
I strongly agree with this. In practice, I always graph and alert on a success rate, not just raw counts.

CloudWatch metric math expression (example)

SuccessfulInvocationRate = SuccessfulInvocationAttempts / InvocationAttempts
Enter fullscreen mode Exit fullscreen mode

My practical implementation tip

I only evaluate this alarm when there is meaningful traffic. Otherwise, low-volume rules can produce noisy alerts (for example, 1 failed attempt on a rarely used rule).

A common pattern is:

  • a volume guard (enough attempts in the window)
  • plus a success-rate threshold

What I alert on (my starting best-practice alarm set)

Below is a practical alert set I use as a starting point. The exact thresholds should be tuned to your workload and business SLOs.

I treat these as starting thresholds, not universal truths.


Alert 1 (Critical): InvocationsFailedToBeSentToDlq > 0

Severity: P1 / Critical

Why: The DLQ safety net itself is failing. AWS notes this can happen due to permission issues, unavailable resources, or size limits.

What I do immediately

  • Check SQS queue existence and region
  • Check SQS resource policy / permissions
  • Check encryption / access configuration if applicable
  • Check message size / payload patterns

Alert 2 (High): FailedInvocations > 0 on critical rules

Severity: P1 or P2 (depends on rule criticality)

Why: Permanent delivery failure is already happening. AWS documents FailedInvocations as permanently failed invocations.

Best practice

  • Scope alarms per rule for critical paths
  • Route alerts to the owning team, not one giant shared channel

Alert 3 (High): InvocationsSentToDlq > 0

Severity: P2 (or P1 for critical business flows)

Why: Failures are being captured safely, but delivery is still failing and backlog is building. AWS best-practice guidance specifically calls out watching FailedInvocations and InvocationsSentToDlq.

What I do

  • Triage DLQ message attributes (ERROR_CODE, ERROR_MESSAGE, RETRY_ATTEMPTS, etc.) to identify the root cause faster. EventBridge includes these attributes in DLQ messages.

Alert 4 (Warning): RetryInvocationAttempts rising above baseline

Severity: Warning / P3

Why: AWS says an increase may be an early indication of an undersized target.

This is one of my favorite early-warning alerts.

How I tune it

  • For steady traffic: alert on absolute count per period
  • For bursty traffic: alert on retry rate (retry attempts / invocation attempts)
  • Use a longer evaluation window to avoid false positives during short spikes

Alert 5 (Warning/High): SuccessfulInvocationRate below threshold

Severity: Warning or P2

Why: This gives a single health signal that accounts for retries and success together (exactly what AWS recommends via metric math).

Example starting thresholds

  • Warning: < 99.9% for 10 minutes (high-volume, sensitive systems)
  • High: < 99% for 5–10 minutes
  • Critical: < 95% for 5 minutes on critical paths

Again, tune these to your business tolerance.


Alert 6 (Warning): IngestionToInvocationSuccessLatency p95/p99 above baseline

Severity: Warning or P2

Why: AWS recommends using this metric to detect delays, and it surfaces retry effects and slow target responses.

My recommendation

  • Alert on p95 or p99 rather than average
  • Use a baseline per rule (or per target type), because latency expectations differ a lot

Alert 7 (Warning): ThrottledRules > 0

Severity: Warning / P2 if persistent

Why: AWS documents that throttled rule executions may delay invocations.

This is especially useful when traffic grows and quota pressure starts to show up.


My dashboard layout (simple but effective)

I try not to overcomplicate the dashboard. I keep one dashboard section per critical rule / target path, and I always put these together:

Panel group A: Delivery health

  • InvocationAttempts
  • SuccessfulInvocationAttempts
  • RetryInvocationAttempts
  • FailedInvocations
  • InvocationsSentToDlq

Panel group B: Delivery quality

  • SuccessfulInvocationRate (metric math)
  • IngestionToInvocationSuccessLatency (p50, p95, p99)

Panel group C: Protection / risk

  • InvocationsFailedToBeSentToDlq
  • ThrottledRules

This matches AWS’s “combine multiple metrics for a holistic overview” guidance, and in practice it makes troubleshooting much faster.


Practical runbook: how I interpret the pattern combinations

This is the part that saves the most time during incidents.

Pattern 1: Retries rising, success rate still high

Likely meaning: target is stressed but still mostly coping

Metrics pattern:

  • RetryInvocationAttempts up
  • SuccessfulInvocationRate mostly stable
  • latency rising (IngestionToInvocationSuccessLatency)

What I check

  • target concurrency / scaling
  • downstream dependency latency
  • throttling on target service

Pattern 2: Retries rising, success rate dropping, DLQ starts filling

Likely meaning: target is underprovisioned or misconfigured

Metrics pattern:

  • RetryInvocationAttempts up
  • SuccessfulInvocationRate down
  • InvocationsSentToDlq > 0

What I check

  • target capacity / rate limits
  • permission changes
  • endpoint availability / DNS / networking
  • recent deployments

Pattern 3: Failed invocations and DLQ-send failures

Likely meaning: primary path failing and safety net misconfigured

Metrics pattern:

  • FailedInvocations > 0
  • InvocationsFailedToBeSentToDlq > 0

What I do

  • page immediately
  • validate DLQ queue, region, permissions, and policy
  • verify EventBridge can SendMessage to the SQS DLQ

AWS’s DLQ guidance includes permission requirements and examples, and this is the first place I look when DLQ delivery fails.


Common mistakes I see (and try to avoid)

1) Only alerting on FailedInvocations

This catches issues late. By the time permanent failures show up, you may already have retries, delays, and customer impact.

Better: also alert on retries and latency.


2) No DLQ configured

AWS explicitly recommends configuring a DLQ for each rule target to avoid losing undelivered events.


3) No alarm on InvocationsFailedToBeSentToDlq

This is the “we thought we were safe, but we are not” blind spot.


4) One global alarm for all rules

This creates noisy alerts and slow triage.

Better: alarm per critical rule / target path, and map ownership.


5) Using static latency thresholds with no baseline thinking

Different targets behave differently. A Lambda target and an API destination will not share the same normal latency profile.


A practical starting checklist (copy this into your ops notes)

  • [ ] Configure DLQ for each critical EventBridge rule target
  • [ ] Alarm on InvocationsFailedToBeSentToDlq > 0 (critical)
  • [ ] Alarm on FailedInvocations > 0 for critical rules
  • [ ] Alarm on InvocationsSentToDlq > 0 for critical rules
  • [ ] Track RetryInvocationAttempts trend (early warning)
  • [ ] Create SuccessfulInvocationRate metric math and alert on sustained dips
  • [ ] Track IngestionToInvocationSuccessLatency p95/p99 and alert on sustained increase
  • [ ] Add ThrottledRules alert if traffic growth is expected
  • [ ] Build per-rule dashboards for critical flows
  • [ ] Document a DLQ triage + replay runbook

Final thoughts

If I had to summarize the best practice in one sentence, it would be this:

Do not monitor EventBridge delivery with a single failure metric. Monitor retries, success rate, DLQ activity, and delivery latency together.

That combination gives me an early warning signal, a confirmed-failure signal, and a customer-impact signal, which is exactly what I want in production.


References

  • AWS EventBridge monitoring best practices for event delivery (recommended metrics, success-rate metric math, retries/DLQ guidance). (AWS Documentation)
  • AWS EventBridge monitoring metrics reference (AWS/Events metrics and definitions). (AWS Documentation)
  • AWS EventBridge retry behavior (default retry duration and count). (AWS Documentation)
  • AWS EventBridge DLQ guidance (DLQ behavior, limitations, permissions, message attributes). (AWS Documentation)

Top comments (0)