We recently finished moving a set of synchronous REST flows at SolarGenix.ai to an event-driven core. The goal wasn’t to chase buzzwords; we wanted fewer cross-service dependencies, clearer ownership, and predictable delivery under load. Below is what we actually shipped: EventBridge as the bus, Go for publishers/consumers, DynamoDB Streams for projections, and a small set of rules that kept us out of trouble.
Two quick numbers after cutover:
- Fast-path reads p95: 48 ms -> 11 ms (UI paths now hit a keyed projection or cache).
- DLQ rate: <0.1% over the last 30 days (and most of those were known poison messages we fixed and replayed).
Why we moved off "sync everything"
We had a handful of handlers that fanned out to multiple services. When one downstream slowed down, the whole user flow did too. We kept REST at the edge (users), but replaced side-effects with events:
- Before: handler -> service A -> service B -> service C
- After: handler (200 OK) -> publish proposal.published -> N consumers do their work independently
The day we shipped the first flow, on-call went noticeably quieter.
Bus and naming (obfuscated but real)
- Bus: app-core-prod (staging mirrors prod names with -stg suffix)
- Detail-type: proposal.published:v2 (type:major) so we can route by version without parsing JSON
- Rule example: proposal-published-v2 -> targets emailer, projector, analytics indexer
- DLQ: one per target, e.g. dlq-proposal-emailer (ownership is obvious)
Event envelope and versioning
We keep a small, stable envelope and version the detail-type:
{
"id": "01J9Q9W0R3W3S3DAXYVZ8R3M7V",
"source": "app.solargenix",
"type": "proposal.published",
"version": "v2",
"occurredAt": "2025-10-10T14:33:12Z",
"data": {
"proposalId": "pr_0f3b1e",
"accountId": "acc_7a21",
"publishedBy": "u_193",
"currency": "USD"
}
}
- Minor fields -> additive only.
- Major changes -> dual-publish (v1 + v2) for a sprint; consumers opt in when ready.
- Schemas live in the EventBridge Schema Registry and we generate Go bindings.
Idempotency: assume duplicates
EventBridge and Streams are at-least-once. We treat duplicates as normal:
- Every event has a stable id (we use ULID).
- Consumers write eventId into a small idempotency table (DynamoDB) with a TTL >= retry window.
- If the key exists, we skip the side effect and move on.
Minimal consumer shape (SQS buffer -> Lambda):
func Handle(ctx context.Context, e events.SQSEvent) error {
for _, r := range e.Records {
var ev Event
if err := json.Unmarshal([]byte(r.Body), &ev); err != nil { return err }
seen, err := idem.Seen(ctx, ev.ID, 24*time.Hour)
if err != nil { return err }
if seen { continue }
if err := apply(ev); err != nil { return err } // retried by SQS/Lambda
}
return nil
}
Delivery guarantees, retries, and DLQs
Each rule target sets an explicit RetryPolicy and a DLQ:
{
"Name": "proposal-published-v2",
"EventPattern": { "source": ["app.solargenix"], "detail-type": ["proposal.published:v2"] },
"Targets": [{
"Arn": "arn:aws:lambda:...:function:proposal-emailer",
"RetryPolicy": { "MaximumRetryAttempts": 185, "MaximumEventAgeInSeconds": 86400 },
"DeadLetterConfig": { "Arn": "arn:aws:sqs:...:dlq-proposal-emailer" }
}]
}
This is boring on purpose. When something breaks, it breaks locally (one rule/target), not across a chain.
DynamoDB Streams for projections
Where the record of truth lives in DynamoDB (e.g., counters, hot entities), we project changes out with Streams:
- Streams retention is 24h; consumers must catch up.
- Ordering is per item, not global. If we care about causal order, we put a version and timestamps in the payload and validate before applying.
Observability: alarms we actually watch
We wired a few CloudWatch alarms that correlate with user pain. Two that paid off immediately:
DLQ depth / age
{
"AlarmName": "dlq-proposal-emailer-depth",
"MetricName": "ApproximateNumberOfMessagesVisible",
"Namespace": "AWS/SQS",
"Dimensions": [{"Name":"QueueName","Value":"dlq-proposal-emailer"}],
"Statistic": "Sum",
"Period": 60,
"EvaluationPeriods": 3,
"Threshold": 5,
"ComparisonOperator": "GreaterThanOrEqualToThreshold"
}
EventBridge rule failure rate (per target) Metric math on Invocations vs FailedInvocations for proposal-published-v2 -> proposal-emailer. We page when failures > 0 for 5 minutes. That caught a bad template deploy before users did.
Migration notes (what actually happened)
- We wrapped the old REST handler so it returns 200 quickly and publishes an event.
- New consumers ran in shadow for a sprint and wrote their results next to the old path so we could diff.
- We hit one snag: an emailer Lambda assumed global ordering; a burst of duplicate events produced repeat sends. Fix was trivial once we added idempotency with a 24h TTL and keyed by eventId + recipient.
Results after cutover
- Fast-path p95 reads: 48 ms -> 11 ms (most UI paths now hit a projection or cache).
- DLQ rate: <0.1% in 30 days; replays are scripted and boring.
- Deploys: producers and consumers ship independently; on-call load is down because failures isolate to one rule/target.
Minimal Go publisher
func Publish(ctx context.Context, bus, region string, ev Event) error {
cfg, err := config.LoadDefaultConfig(ctx, config.WithRegion(region))
if err != nil { return err }
cli := eventbridge.NewFromConfig(cfg)
detailType := fmt.Sprintf("%s:%s", ev.Type, ev.Version)
payload, _ := json.Marshal(ev)
_, err = cli.PutEvents(ctx, &eventbridge.PutEventsInput{
Entries: []types.PutEventsRequestEntry{{
EventBusName: &bus,
Source: aws.String(ev.Source),
DetailType: aws.String(detailType),
Time: aws.Time(ev.OccurredAt),
Detail: aws.String(string(payload)),
}},
})
return err
}
EventBridge isn’t magic, but it let us decouple without inventing infrastructure. The patterns above-clean envelopes, type:version, idempotency, per-target DLQs, and a few boring alarms-were enough to make the system predictable.
If there’s interest, I’ll follow up with the exact fan-out rules we use and how we handle backfill/replay safely. If this was useful, follow my profile and the SolarGenix.ai page to catch the next deep dives and benchmarks as they land.
Top comments (0)