Voskan Voskanyan

Posted on Oct 23

Event-Driven Architecture in Production: Designing with AWS EventBridge and Go

#aws #go #eventdriven #programming

We recently finished moving a set of synchronous REST flows at SolarGenix.ai to an event-driven core. The goal wasn’t to chase buzzwords; we wanted fewer cross-service dependencies, clearer ownership, and predictable delivery under load. Below is what we actually shipped: EventBridge as the bus, Go for publishers/consumers, DynamoDB Streams for projections, and a small set of rules that kept us out of trouble.

Two quick numbers after cutover:

Fast-path reads p95: 48 ms -> 11 ms (UI paths now hit a keyed projection or cache).
DLQ rate: <0.1% over the last 30 days (and most of those were known poison messages we fixed and replayed).

Why we moved off "sync everything"

We had a handful of handlers that fanned out to multiple services. When one downstream slowed down, the whole user flow did too. We kept REST at the edge (users), but replaced side-effects with events:

Before: handler -> service A -> service B -> service C
After: handler (200 OK) -> publish proposal.published -> N consumers do their work independently

The day we shipped the first flow, on-call went noticeably quieter.

Bus and naming (obfuscated but real)

Bus: app-core-prod (staging mirrors prod names with -stg suffix)
Detail-type: proposal.published:v2 (type:major) so we can route by version without parsing JSON
Rule example: proposal-published-v2 -> targets emailer, projector, analytics indexer
DLQ: one per target, e.g. dlq-proposal-emailer (ownership is obvious)

Event envelope and versioning

We keep a small, stable envelope and version the detail-type:

{
  "id": "01J9Q9W0R3W3S3DAXYVZ8R3M7V",
  "source": "app.solargenix",
  "type": "proposal.published",
  "version": "v2",
  "occurredAt": "2025-10-10T14:33:12Z",
  "data": {
    "proposalId": "pr_0f3b1e",
    "accountId": "acc_7a21",
    "publishedBy": "u_193",
    "currency": "USD"
  }
}

Minor fields -> additive only.
Major changes -> dual-publish (v1 + v2) for a sprint; consumers opt in when ready.
Schemas live in the EventBridge Schema Registry and we generate Go bindings.

Idempotency: assume duplicates

EventBridge and Streams are at-least-once. We treat duplicates as normal:

Every event has a stable id (we use ULID).
Consumers write eventId into a small idempotency table (DynamoDB) with a TTL >= retry window.
If the key exists, we skip the side effect and move on.

Minimal consumer shape (SQS buffer -> Lambda):

func Handle(ctx context.Context, e events.SQSEvent) error {
    for _, r := range e.Records {
        var ev Event
        if err := json.Unmarshal([]byte(r.Body), &ev); err != nil { return err }

        seen, err := idem.Seen(ctx, ev.ID, 24*time.Hour)
        if err != nil { return err }
        if seen { continue }

        if err := apply(ev); err != nil { return err } // retried by SQS/Lambda
    }
    return nil
}

Delivery guarantees, retries, and DLQs

Each rule target sets an explicit RetryPolicy and a DLQ:

{
  "Name": "proposal-published-v2",
  "EventPattern": { "source": ["app.solargenix"], "detail-type": ["proposal.published:v2"] },
  "Targets": [{
    "Arn": "arn:aws:lambda:...:function:proposal-emailer",
    "RetryPolicy": { "MaximumRetryAttempts": 185, "MaximumEventAgeInSeconds": 86400 },
    "DeadLetterConfig": { "Arn": "arn:aws:sqs:...:dlq-proposal-emailer" }
  }]
}

This is boring on purpose. When something breaks, it breaks locally (one rule/target), not across a chain.

DynamoDB Streams for projections

Where the record of truth lives in DynamoDB (e.g., counters, hot entities), we project changes out with Streams:

Streams retention is 24h; consumers must catch up.
Ordering is per item, not global. If we care about causal order, we put a version and timestamps in the payload and validate before applying.

Observability: alarms we actually watch

We wired a few CloudWatch alarms that correlate with user pain. Two that paid off immediately:

DLQ depth / age

{
  "AlarmName": "dlq-proposal-emailer-depth",
  "MetricName": "ApproximateNumberOfMessagesVisible",
  "Namespace": "AWS/SQS",
  "Dimensions": [{"Name":"QueueName","Value":"dlq-proposal-emailer"}],
  "Statistic": "Sum",
  "Period": 60,
  "EvaluationPeriods": 3,
  "Threshold": 5,
  "ComparisonOperator": "GreaterThanOrEqualToThreshold"
}

EventBridge rule failure rate (per target) Metric math on Invocations vs FailedInvocations for proposal-published-v2 -> proposal-emailer. We page when failures > 0 for 5 minutes. That caught a bad template deploy before users did.

Migration notes (what actually happened)

We wrapped the old REST handler so it returns 200 quickly and publishes an event.
New consumers ran in shadow for a sprint and wrote their results next to the old path so we could diff.
We hit one snag: an emailer Lambda assumed global ordering; a burst of duplicate events produced repeat sends. Fix was trivial once we added idempotency with a 24h TTL and keyed by eventId + recipient.

Results after cutover

Fast-path p95 reads: 48 ms -> 11 ms (most UI paths now hit a projection or cache).
DLQ rate: <0.1% in 30 days; replays are scripted and boring.
Deploys: producers and consumers ship independently; on-call load is down because failures isolate to one rule/target.

Minimal Go publisher

func Publish(ctx context.Context, bus, region string, ev Event) error {
    cfg, err := config.LoadDefaultConfig(ctx, config.WithRegion(region))
    if err != nil { return err }
    cli := eventbridge.NewFromConfig(cfg)

    detailType := fmt.Sprintf("%s:%s", ev.Type, ev.Version)
    payload, _ := json.Marshal(ev)

    _, err = cli.PutEvents(ctx, &eventbridge.PutEventsInput{
        Entries: []types.PutEventsRequestEntry{{
            EventBusName: &bus,
            Source:       aws.String(ev.Source),
            DetailType:   aws.String(detailType),
            Time:         aws.Time(ev.OccurredAt),
            Detail:       aws.String(string(payload)),
        }},
    })
    return err
}

EventBridge isn’t magic, but it let us decouple without inventing infrastructure. The patterns above-clean envelopes, type:version, idempotency, per-target DLQs, and a few boring alarms-were enough to make the system predictable.

If there’s interest, I’ll follow up with the exact fan-out rules we use and how we handle backfill/replay safely. If this was useful, follow my profile and the SolarGenix.ai page to catch the next deep dives and benchmarks as they land.

DEV Community