DEV Community

Cover image for Event-Driven Architecture in Production: Designing with AWS EventBridge and Go
Voskan Voskanyan
Voskan Voskanyan

Posted on

Event-Driven Architecture in Production: Designing with AWS EventBridge and Go

We recently finished moving a set of synchronous REST flows at SolarGenix.ai to an event-driven core. The goal wasn’t to chase buzzwords; we wanted fewer cross-service dependencies, clearer ownership, and predictable delivery under load. Below is what we actually shipped: EventBridge as the bus, Go for publishers/consumers, DynamoDB Streams for projections, and a small set of rules that kept us out of trouble.

Two quick numbers after cutover:

  • Fast-path reads p95: 48 ms -> 11 ms (UI paths now hit a keyed projection or cache).
  • DLQ rate: <0.1% over the last 30 days (and most of those were known poison messages we fixed and replayed).

Why we moved off "sync everything"

We had a handful of handlers that fanned out to multiple services. When one downstream slowed down, the whole user flow did too. We kept REST at the edge (users), but replaced side-effects with events:

  • Before: handler -> service A -> service B -> service C
  • After: handler (200 OK) -> publish proposal.published -> N consumers do their work independently

The day we shipped the first flow, on-call went noticeably quieter.


Bus and naming (obfuscated but real)

  • Bus: app-core-prod (staging mirrors prod names with -stg suffix)
  • Detail-type: proposal.published:v2 (type:major) so we can route by version without parsing JSON
  • Rule example: proposal-published-v2 -> targets emailer, projector, analytics indexer
  • DLQ: one per target, e.g. dlq-proposal-emailer (ownership is obvious)

Event envelope and versioning

We keep a small, stable envelope and version the detail-type:

{
  "id": "01J9Q9W0R3W3S3DAXYVZ8R3M7V",
  "source": "app.solargenix",
  "type": "proposal.published",
  "version": "v2",
  "occurredAt": "2025-10-10T14:33:12Z",
  "data": {
    "proposalId": "pr_0f3b1e",
    "accountId": "acc_7a21",
    "publishedBy": "u_193",
    "currency": "USD"
  }
} 
Enter fullscreen mode Exit fullscreen mode
  • Minor fields -> additive only.
  • Major changes -> dual-publish (v1 + v2) for a sprint; consumers opt in when ready.
  • Schemas live in the EventBridge Schema Registry and we generate Go bindings.

Idempotency: assume duplicates

EventBridge and Streams are at-least-once. We treat duplicates as normal:

  • Every event has a stable id (we use ULID).
  • Consumers write eventId into a small idempotency table (DynamoDB) with a TTL >= retry window.
  • If the key exists, we skip the side effect and move on.

Minimal consumer shape (SQS buffer -> Lambda):

func Handle(ctx context.Context, e events.SQSEvent) error {
    for _, r := range e.Records {
        var ev Event
        if err := json.Unmarshal([]byte(r.Body), &ev); err != nil { return err }

        seen, err := idem.Seen(ctx, ev.ID, 24*time.Hour)
        if err != nil { return err }
        if seen { continue }

        if err := apply(ev); err != nil { return err } // retried by SQS/Lambda
    }
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Delivery guarantees, retries, and DLQs

Each rule target sets an explicit RetryPolicy and a DLQ:

{
  "Name": "proposal-published-v2",
  "EventPattern": { "source": ["app.solargenix"], "detail-type": ["proposal.published:v2"] },
  "Targets": [{
    "Arn": "arn:aws:lambda:...:function:proposal-emailer",
    "RetryPolicy": { "MaximumRetryAttempts": 185, "MaximumEventAgeInSeconds": 86400 },
    "DeadLetterConfig": { "Arn": "arn:aws:sqs:...:dlq-proposal-emailer" }
  }]
}
Enter fullscreen mode Exit fullscreen mode

This is boring on purpose. When something breaks, it breaks locally (one rule/target), not across a chain.


DynamoDB Streams for projections

Where the record of truth lives in DynamoDB (e.g., counters, hot entities), we project changes out with Streams:

  • Streams retention is 24h; consumers must catch up.
  • Ordering is per item, not global. If we care about causal order, we put a version and timestamps in the payload and validate before applying.

Observability: alarms we actually watch

We wired a few CloudWatch alarms that correlate with user pain. Two that paid off immediately:

DLQ depth / age

{
  "AlarmName": "dlq-proposal-emailer-depth",
  "MetricName": "ApproximateNumberOfMessagesVisible",
  "Namespace": "AWS/SQS",
  "Dimensions": [{"Name":"QueueName","Value":"dlq-proposal-emailer"}],
  "Statistic": "Sum",
  "Period": 60,
  "EvaluationPeriods": 3,
  "Threshold": 5,
  "ComparisonOperator": "GreaterThanOrEqualToThreshold"
} 
Enter fullscreen mode Exit fullscreen mode

EventBridge rule failure rate (per target) Metric math on Invocations vs FailedInvocations for proposal-published-v2 -> proposal-emailer. We page when failures > 0 for 5 minutes. That caught a bad template deploy before users did.


Migration notes (what actually happened)

  • We wrapped the old REST handler so it returns 200 quickly and publishes an event.
  • New consumers ran in shadow for a sprint and wrote their results next to the old path so we could diff.
  • We hit one snag: an emailer Lambda assumed global ordering; a burst of duplicate events produced repeat sends. Fix was trivial once we added idempotency with a 24h TTL and keyed by eventId + recipient.

Results after cutover

  • Fast-path p95 reads: 48 ms -> 11 ms (most UI paths now hit a projection or cache).
  • DLQ rate: <0.1% in 30 days; replays are scripted and boring.
  • Deploys: producers and consumers ship independently; on-call load is down because failures isolate to one rule/target.

Minimal Go publisher

func Publish(ctx context.Context, bus, region string, ev Event) error {
    cfg, err := config.LoadDefaultConfig(ctx, config.WithRegion(region))
    if err != nil { return err }
    cli := eventbridge.NewFromConfig(cfg)

    detailType := fmt.Sprintf("%s:%s", ev.Type, ev.Version)
    payload, _ := json.Marshal(ev)

    _, err = cli.PutEvents(ctx, &eventbridge.PutEventsInput{
        Entries: []types.PutEventsRequestEntry{{
            EventBusName: &bus,
            Source:       aws.String(ev.Source),
            DetailType:   aws.String(detailType),
            Time:         aws.Time(ev.OccurredAt),
            Detail:       aws.String(string(payload)),
        }},
    })
    return err
}
Enter fullscreen mode Exit fullscreen mode

EventBridge isn’t magic, but it let us decouple without inventing infrastructure. The patterns above-clean envelopes, type:version, idempotency, per-target DLQs, and a few boring alarms-were enough to make the system predictable.

If there’s interest, I’ll follow up with the exact fan-out rules we use and how we handle backfill/replay safely. If this was useful, follow my profile and the SolarGenix.ai page to catch the next deep dives and benchmarks as they land.

Top comments (0)