DEV Community

Cover image for Building Decoupled Event-Driven Microservices on AWS with SNS, SQS, Lambda, and DynamoDB
Bernard Chika Uwaezuoke
Bernard Chika Uwaezuoke

Posted on

Building Decoupled Event-Driven Microservices on AWS with SNS, SQS, Lambda, and DynamoDB

Introduction

Modern applications often begin as a single service where one request triggers several operations sequentially. An order-processing application, for example, might receive an order, process its payment, update inventory, and send a notification within the same synchronous request.

This design is simple at the beginning, but it becomes increasingly fragile as the application grows.

What happens when the payment provider is slow? What if the notification service is unavailable? Should an inventory failure prevent the application from accepting an otherwise valid order? How do we add a new analytics, fraud detection, or shipping service without modifying the original order service?

These are some of the problems that event-driven microservices are designed to solve.

In this project, I built a serverless, event-driven order-processing system using:

  • Amazon API Gateway
  • AWS Lambda
  • Amazon Simple Notification Service
  • Amazon Simple Queue Service
  • Amazon DynamoDB
  • Amazon CloudWatch
  • AWS Serverless Application Model

The central architectural decision was to use Amazon SNS for event distribution and Amazon SQS for durable, service-specific message buffering.

The result is an architecture in which payment, inventory, and notification services can operate, fail, recover, and scale independently.

Project repository: https://github.com/Donhadley22/aws-sns-sqs-microservices

Note: If the repository remains private, readers will not be able to access the source code until it is made public.

The problem with tightly coupled microservices

Consider the following synchronous workflow:

Client
  |
  v
Order Service
  |
  v
Payment Service
  |
  v
Inventory Service
  |
  v
Notification Service
Enter fullscreen mode Exit fullscreen mode

Although the application has been divided into services, the services are still operationally dependent on one another.

The Order Service cannot finish until Payment responds. Payment might depend on Inventory, and Notification might become part of the same request chain.

This creates several problems.

Increased response time

The client must wait for every downstream operation to complete.

If payment takes two seconds, inventory takes one second, and notification takes three seconds, the request can take six seconds or longer.

Failure propagation

A notification failure could cause the entire order request to fail, even though the order and payment were successful.

Difficult scaling

Payment traffic, notification traffic, and inventory traffic may have different scaling patterns. A synchronous architecture makes it harder to scale them independently.

Difficult service expansion

Adding fraud detection, shipping, analytics, or audit services usually requires modifying the producing service or inserting more synchronous calls.

Tight runtime dependency

Every downstream service must be available at the same time as the upstream service.

This is precisely where messaging services become valuable.

Solution overview

The project uses the following architecture:

Client
   |
   v
Amazon API Gateway
   |
   v
Order Service Lambda
   |
   +------> Orders DynamoDB Table
   |
   v
Amazon SNS Order Events Topic
   |
   +------> Payment SQS Queue ------> Payment Lambda
   |                                     |
   |                                     +------> Payments Table
   |                                     |
   |                                     +------> SNS outcome event
   |
   +------> Inventory SQS Queue ----> Inventory Lambda
   |                                     |
   |                                     +------> Inventory Table
   |                                     |
   |                                     +------> SNS outcome event
   |
   +------> Notification SQS Queue -> Notification Lambda
                                             |
                                             +------> Notifications Table
Enter fullscreen mode Exit fullscreen mode

Each SQS queue also has a dedicated dead-letter queue.

Amazon SNS and Amazon SQS are often used together because SNS can distribute one published event to multiple subscribers, while SQS stores each service's copy until that service is ready to process it. This means the producer and consumers do not need to be available at the same time.

Understanding the responsibility of each service

1. Order Service

The Order Service is exposed through Amazon API Gateway.

It is responsible for:

  • Receiving the HTTP request
  • Validating the order payload
  • Generating an order ID
  • Calculating the order total
  • Writing the order to DynamoDB
  • Publishing an OrderCreated event to Amazon SNS
  • Returning a 202 Accepted response

The service does not directly invoke Payment, Inventory, or Notification.

That is a critical design decision.

The Order Service knows that an order has been created, but it does not need to know every service interested in that event.

2. Payment Service

The Payment Service receives messages from its dedicated Payment SQS queue.

It:

  • Reads the OrderCreated event
  • Processes the payment rule
  • Stores the payment result
  • Publishes either PaymentCompleted or PaymentFailed

Because Payment owns its own queue, a payment-processing slowdown does not block Inventory or prevent the Order Service from accepting new requests.

3. Inventory Service

The Inventory Service independently receives its own copy of the OrderCreated event.

It:

  • Checks the requested items
  • Simulates stock reservation
  • Stores the inventory outcome
  • Publishes either InventoryReserved or InventoryFailed

Payment and Inventory can therefore process the same order in parallel.

Neither service needs to call the other.

4. Notification Service

The Notification Service is interested only in the results generated by Payment and Inventory.

It subscribes to:

PaymentCompleted
PaymentFailed
InventoryReserved
InventoryFailed
Enter fullscreen mode Exit fullscreen mode

The service records a notification for the customer based on the event it receives.

In a production environment, this could be extended to deliver:

  • Email through Amazon SES
  • SMS through Amazon SNS
  • Mobile push notifications
  • Slack or Microsoft Teams notifications
  • Webhook callbacks
  • In-application notifications

Why use both SNS and SQS?

A frequent question from students is:

Why not publish directly to SQS, or trigger every service directly from SNS?

The answer lies in the different responsibilities of the two services.

Amazon SNS provides fan-out

Amazon SNS follows a publish-subscribe model.

The publisher sends one event to a topic. The topic distributes copies to all matching subscriptions.

The Order Service therefore publishes only once:

Order Service -> Order Events SNS Topic
Enter fullscreen mode Exit fullscreen mode

SNS then distributes the event to:

Payment Queue
Inventory Queue
Enter fullscreen mode Exit fullscreen mode

Additional consumers can be added later without changing the Order Service.

For example:

Fraud Detection Queue
Analytics Queue
Shipping Queue
Audit Queue
Enter fullscreen mode Exit fullscreen mode

Amazon SQS provides durability and back-pressure

Amazon SQS stores messages until a consumer successfully processes them or their retention period expires.

This is important because consumers do not always process traffic at the same speed.

Suppose the system receives 10,000 orders during a promotional campaign. The Order Service can continue accepting requests while the Payment queue temporarily accumulates messages.

Payment workers can process the backlog according to their available concurrency.

The queue absorbs the traffic spike instead of allowing the spike to overload the Payment Service.

This is known as back-pressure handling.

Step-by-step event flow

Step 1: The client submits an order

The client sends:

POST /orders
Content-Type: application/json
Enter fullscreen mode Exit fullscreen mode

Example request:

{
  "customerEmail": "student@example.com",
  "currency": "USD",
  "items": [
    {
      "sku": "LAPTOP-001",
      "quantity": 1,
      "price": 850
    },
    {
      "sku": "MOUSE-002",
      "quantity": 2,
      "price": 25
    }
  ],
  "paymentOutcome": "approved",
  "inventoryOutcome": "reserved"
}
Enter fullscreen mode Exit fullscreen mode

API Gateway forwards the request to the Order Service Lambda function.

Step 2: The order is stored

The Order Service generates an order ID and stores the order in the Orders DynamoDB table.

Its initial state changes from:

ORDER_RECEIVED
Enter fullscreen mode Exit fullscreen mode

to:

EVENT_PUBLISHED
Enter fullscreen mode Exit fullscreen mode

after the event is successfully published.

Step 3: The Order Service publishes an event

The event follows a consistent envelope:

{
  "eventId": "unique-event-id",
  "eventType": "OrderCreated",
  "eventVersion": "1.0",
  "source": "order-service",
  "occurredAt": "2026-07-02T10:00:00.000Z",
  "data": {
    "orderId": "unique-order-id",
    "customerEmail": "student@example.com",
    "items": [],
    "totalAmount": 900,
    "currency": "USD"
  }
}
Enter fullscreen mode Exit fullscreen mode

This envelope provides several important fields:

  • eventId supports tracing and deduplication.
  • eventType identifies the business event.
  • eventVersion supports schema evolution.
  • source identifies the producer.
  • occurredAt records when the event happened.
  • data contains the business payload.

A well-defined event contract is essential because events become APIs between services.

Changing an event structure without version control can break several consumers simultaneously.

Step 4: SNS applies subscription filters

The Payment subscription receives only OrderCreated events:

FilterPolicy:
  eventType:
    - OrderCreated
Enter fullscreen mode Exit fullscreen mode

The Inventory subscription uses the same filter:

FilterPolicy:
  eventType:
    - OrderCreated
Enter fullscreen mode Exit fullscreen mode

The Notification subscription receives only outcome events:

FilterPolicy:
  eventType:
    - PaymentCompleted
    - PaymentFailed
    - InventoryReserved
    - InventoryFailed
Enter fullscreen mode Exit fullscreen mode

By default, an SNS subscription receives every message published to its topic. A filter policy limits delivery to messages whose attributes or body match the subscription's conditions.

Filtering prevents irrelevant events from reaching queues and reduces unnecessary Lambda invocations.

Step 5: Lambda processes messages from SQS

Each queue is configured as an event source for its corresponding Lambda function.

For example:

PaymentQueueEvent:
  Type: SQS
  Properties:
    Queue: !GetAtt PaymentQueue.Arn
    BatchSize: 5
    MaximumBatchingWindowInSeconds: 1
    FunctionResponseTypes:
      - ReportBatchItemFailures
Enter fullscreen mode Exit fullscreen mode

The Lambda event source mapping polls SQS and sends messages to the function in batches.

The Payment and Inventory services use a batch size of five, while Notification uses a larger batch because notification records are lightweight.

Handling failures correctly

Building an event-driven system is not only about delivering messages. It is about designing what happens when processing fails.

This project separates failures into two categories.

Business failures

A business failure is a valid result that the application understands.

Examples include:

Payment declined
Product out of stock
Address validation failed
Customer account suspended
Enter fullscreen mode Exit fullscreen mode

These should not normally be sent to a dead-letter queue.

The service should process the message successfully and publish a business outcome such as:

PaymentFailed
InventoryFailed
Enter fullscreen mode Exit fullscreen mode

This allows other services to react to the failure.

For example, Notification can inform the customer that payment was declined.

Technical failures

A technical failure means the message could not be processed because of an unexpected problem.

Examples include:

Database unavailable
Malformed event
Dependency timeout
Permission failure
Uncaught application exception
Enter fullscreen mode Exit fullscreen mode

These failures should be retried.

If processing continues to fail, the message should eventually move to a dead-letter queue.

Dead-letter queues

Every service queue in this project has its own DLQ:

Payment Queue ------> Payment DLQ
Inventory Queue ----> Inventory DLQ
Notification Queue -> Notification DLQ
Enter fullscreen mode Exit fullscreen mode

The redrive policy is configured as follows:

RedrivePolicy:
  deadLetterTargetArn: !GetAtt PaymentDeadLetterQueue.Arn
  maxReceiveCount: 3
Enter fullscreen mode Exit fullscreen mode

After three unsuccessful processing attempts, SQS moves the message to the relevant DLQ.

Dedicated DLQs provide failure isolation. A poison message for Payment does not pollute Inventory or Notification processing.

The DLQ retention period in this project is longer than the source queue retention period, which follows AWS guidance for preserving failed messages long enough for investigation.

CloudWatch alarms monitor the approximate number of visible messages in each DLQ.

In production, those alarms should notify the operations team through:

  • An SNS alert topic
  • Email
  • Slack
  • Microsoft Teams
  • PagerDuty
  • An incident-management platform

A DLQ should never become a graveyard where failed messages are ignored. Every DLQ needs an operational process for investigation, correction, replay, or deletion.

Partial batch failure handling

Assume Lambda receives five SQS messages and only one fails.

Without partial batch handling, the entire batch may be retried, including the four messages that were processed successfully.

This creates:

  • Duplicate work
  • Increased cost
  • Reduced throughput
  • More complicated data handling

The project enables:

FunctionResponseTypes:
  - ReportBatchItemFailures
Enter fullscreen mode Exit fullscreen mode

Each consumer returns only the failed message IDs:

return {
  batchItemFailures: [
    {
      itemIdentifier: record.messageId
    }
  ]
};
Enter fullscreen mode Exit fullscreen mode

Lambda then makes only those failed messages available for retry.

AWS recommends partial batch responses to avoid reprocessing successful messages when one record in an SQS batch fails.

Idempotency is not optional

Amazon SQS standard queues provide at-least-once processing semantics. This means the same message may occasionally be delivered more than once.

Therefore, a consumer must be able to process duplicate messages safely.

Imagine a Payment Service that charges a customer every time it receives a message. If the message is delivered twice, the customer could be charged twice.

That is unacceptable.

In this project, the Payment Service uses a conditional DynamoDB write:

await documentClient.send(
  new PutCommand({
    TableName: PAYMENTS_TABLE,
    Item: paymentResult,
    ConditionExpression: "attribute_not_exists(orderId)"
  })
);
Enter fullscreen mode Exit fullscreen mode

The first message stores the payment.

A duplicate message encounters the existing orderId, preventing a second payment record from being created.

The project also generates deterministic outcome event IDs:

eventId: `${orderCreatedEvent.eventId}:payment-outcome`
Enter fullscreen mode Exit fullscreen mode

If Payment stores its result but temporarily fails while publishing the outcome event, the original SQS message is retried.

On retry:

  • The duplicate database write is recognized.
  • The service safely republishes the outcome.
  • The deterministic event ID allows downstream consumers to detect the duplicate outcome.

Idempotency should be considered at every side-effect boundary:

  • Database writes
  • Payment requests
  • Inventory reservations
  • Emails
  • Webhooks
  • External API calls
  • Event publication

Visibility timeout configuration

When SQS delivers a message to a consumer, the message becomes temporarily invisible to other consumers.

This period is called the visibility timeout.

If the consumer successfully processes and deletes the message, processing is complete.

If the message is not deleted before the timeout expires, it becomes visible and can be processed again.

This project configures:

VisibilityTimeout: 120
Enter fullscreen mode Exit fullscreen mode

The Lambda functions have:

Timeout: 15
Enter fullscreen mode Exit fullscreen mode

The visibility timeout is therefore significantly longer than the function timeout.

AWS recommends setting the queue visibility timeout to at least six times the Lambda function timeout to allow time for retries if processing is throttled.

A visibility timeout that is too short can cause the same message to become visible while the first consumer is still processing it.

A timeout that is unnecessarily long can delay retries after genuine failures.

It should be selected based on actual processing duration, batching configuration, throttling risk, and retry requirements.

Long polling

The queues use:

ReceiveMessageWaitTimeSeconds: 20
Enter fullscreen mode Exit fullscreen mode

This enables long polling.

Long polling allows SQS to wait briefly for a message instead of immediately returning an empty response. AWS documents that long polling can reduce empty and false-empty responses, with a maximum wait time of 20 seconds.

Although Lambda manages polling when SQS is configured as an event source, setting appropriate queue parameters remains useful for queue behavior and for any additional consumers that may poll the queue directly.

Infrastructure as Code with AWS SAM

The entire solution is defined in one AWS SAM template.

The template creates:

  • API Gateway routes
  • Four Lambda functions
  • One SNS topic
  • Three primary SQS queues
  • Three dead-letter queues
  • Four DynamoDB tables
  • SNS subscriptions
  • Queue policies
  • IAM permissions
  • Lambda event source mappings
  • CloudWatch alarms
  • CloudFormation outputs

This makes the architecture repeatable.

Instead of manually creating resources in the AWS Management Console, the environment can be deployed using:

npm run install:all
sam validate --lint
sam build
sam deploy --guided
Enter fullscreen mode Exit fullscreen mode

The SAM build packages each Lambda service and creates the deployment artifacts.

SAM build succeeded

The guided deployment captures the stack name, AWS Region, environment parameter, IAM capability confirmation, and local SAM configuration file.

SAM guided deployment settings

After CloudFormation completes, the stack outputs include the topic, queue, table, and API details created by the template.

SAM deployment completed

The API URL can then be read from the CloudFormation stack outputs and used for test requests.

API URL from CloudFormation outputs

Subsequent deployments use:

sam build
sam deploy
Enter fullscreen mode Exit fullscreen mode

The stack can be removed with:

sam delete --stack-name sns-sqs-microservices
Enter fullscreen mode Exit fullscreen mode

Infrastructure as Code provides several practical advantages:

  • Consistent environments
  • Version-controlled infrastructure
  • Easier peer review
  • Repeatable deployment
  • Reduced configuration drift
  • Easier disaster recovery
  • A foundation for CI/CD automation

The deployed resources are visible across the AWS console. The Lambda functions represent the four services in the workflow.

Lambda functions deployed in AWS

The DynamoDB tables give each service its own persistence boundary.

DynamoDB tables created by the stack

CloudFormation keeps the whole environment grouped as one repeatable stack.

CloudFormation stack status

Testing the architecture

A serious microservices project should test more than the successful path.

This project contains three test scenarios.

Test 1: Successful order

{
  "paymentOutcome": "approved",
  "inventoryOutcome": "reserved"
}
Enter fullscreen mode Exit fullscreen mode

Expected results:

Order stored
Payment approved
Inventory reserved
Payment notification recorded
Inventory notification recorded
No DLQ messages
Enter fullscreen mode Exit fullscreen mode

The successful request returns a 202 Accepted response with the generated orderId, eventId, and EVENT_PUBLISHED status.

Successful order API response

Test 2: Business failure

{
  "paymentOutcome": "declined",
  "inventoryOutcome": "out_of_stock"
}
Enter fullscreen mode Exit fullscreen mode

Expected results:

PaymentFailed published
InventoryFailed published
Failure notifications recorded
No DLQ messages
Enter fullscreen mode Exit fullscreen mode

The services operated correctly. The business simply produced negative outcomes.

Test 3: Technical failure

{
  "simulateTechnicalFailure": "payment"
}
Enter fullscreen mode Exit fullscreen mode

Expected behavior:

Payment processing throws an exception
The message becomes visible again
Lambda retries it
The receive count increases
The message moves to the Payment DLQ after three attempts
The Payment DLQ CloudWatch alarm enters ALARM state
Enter fullscreen mode Exit fullscreen mode

Testing failure behavior is just as important as testing successful processing.

A system is not resilient merely because it has a queue. Resilience depends on retry policy, idempotency, timeout selection, DLQ handling, monitoring, and recovery procedures.

The transactional outbox consideration

The Order Service currently performs two operations:

1. Write the order to DynamoDB
2. Publish OrderCreated to SNS
Enter fullscreen mode Exit fullscreen mode

These operations are not part of one atomic transaction.

Consider this sequence:

Order write succeeds
Application crashes
SNS publication never happens
Enter fullscreen mode Exit fullscreen mode

The order exists, but Payment and Inventory never receive the event.

This is known as the dual-write problem.

For a learning project, the current implementation records an EVENT_PUBLISH_FAILED state so the failure can be identified.

For a production system, I would implement the transactional outbox pattern.

A DynamoDB-based version could work as follows:

  • Store the order and an outbox event in one DynamoDB transaction.
  • Enable DynamoDB Streams.
  • Invoke an event publisher from the stream.
  • Publish the outbox event to SNS.
  • Record successful publication or allow the outbox item to expire.

The transactional outbox pattern addresses inconsistent outcomes when an application must update a database and publish an event as part of the same business operation.

Business benefits

1. Better customer response times

The API does not need to wait for every downstream service to finish.

It can acknowledge that the order was accepted while processing continues asynchronously.

This provides a more responsive customer experience.

2. Reduced blast radius

If Notification fails, Payment and Inventory can continue operating.

If Inventory experiences a backlog, Payment can still process messages.

Failures are isolated to individual queues and consumers.

3. Independent scalability

Each microservice scales according to its own workload.

For example:

Payment Service: 100 concurrent workers
Inventory Service: 40 concurrent workers
Notification Service: 500 concurrent workers
Enter fullscreen mode Exit fullscreen mode

The business does not need to scale the complete application just because one function is under heavy load.

4. Easier business expansion

New consumers can subscribe to the event topic without changing the original publisher.

A business can add:

  • Fraud detection
  • Customer loyalty points
  • Data analytics
  • Shipping fulfilment
  • Audit logging
  • Recommendation engines

This reduces the cost and risk of adding new capabilities.

5. Traffic-spike protection

Queues absorb sudden increases in traffic.

This is particularly valuable for:

  • Flash sales
  • Ticket releases
  • Registration deadlines
  • Payroll processing
  • Month-end billing
  • Marketing campaigns

The producer can accept work while consumers process the backlog at a controlled rate.

6. Improved operational visibility

Each queue exposes useful CloudWatch metrics, including:

  • Visible message count
  • In-flight message count
  • Age of the oldest message
  • Number of messages received
  • Number of messages deleted
  • DLQ depth

These metrics help teams identify which part of a business workflow is slowing down.

7. Cost-efficient serverless operation

The architecture uses managed services and does not require teams to operate message brokers or application servers.

The business pays based largely on usage while AWS manages the underlying infrastructure.

8. Stronger team autonomy

Different teams can own different services.

For example:

Checkout Team -> Order Service
Finance Team -> Payment Service
Operations Team -> Inventory Service
Customer Experience Team -> Notification Service
Enter fullscreen mode Exit fullscreen mode

Teams can deploy and scale their services independently, provided they continue to respect the agreed event contracts.

Where this architecture can be applied

E-commerce order processing

OrderCreated
PaymentProcessed
InventoryReserved
ShipmentCreated
CustomerNotified
Enter fullscreen mode Exit fullscreen mode

This is the most direct application of the project.

Financial transaction processing

TransactionSubmitted
FraudCheckRequested
ComplianceCheckCompleted
LedgerUpdated
CustomerAlerted
Enter fullscreen mode Exit fullscreen mode

Additional controls are required for financial workloads, especially around exactly-once business effects, security, audit trails, and reconciliation.

User registration and onboarding

UserRegistered
IdentityVerificationRequested
WelcomeEmailRequested
CRMProfileCreated
AnalyticsEventRecorded
Enter fullscreen mode Exit fullscreen mode

The registration API can remain responsive while secondary activities happen asynchronously.

Media-processing pipelines

FileUploaded
VirusScanRequested
VideoTranscodingRequested
ThumbnailGenerationRequested
MetadataExtractionRequested
Enter fullscreen mode Exit fullscreen mode

Each processing stage can have its own queue and worker capacity.

Internet of Things workloads

DeviceReadingReceived
AnomalyDetectionRequested
DataArchived
AlertGenerated
DashboardUpdated
Enter fullscreen mode Exit fullscreen mode

Queues help absorb bursts from thousands of devices.

Logistics and fulfilment

PackageCreated
WarehouseAssigned
DriverRequested
TrackingUpdated
CustomerNotified
Enter fullscreen mode Exit fullscreen mode

Different services can react to fulfilment events without direct dependencies.

Insurance claims processing

ClaimSubmitted
DocumentValidationRequested
FraudAssessmentRequested
AdjusterAssigned
CustomerUpdated
Enter fullscreen mode Exit fullscreen mode

Long-running workflows can be divided into independent processing stages.

Healthcare administrative workflows

AppointmentBooked
ReminderRequested
BillingRecordCreated
InsuranceCheckRequested
AuditRecordCreated
Enter fullscreen mode Exit fullscreen mode

Sensitive workloads require appropriate security, privacy, encryption, logging, and regulatory controls.

CI/CD and DevOps automation

BuildCompleted
SecurityScanRequested
ArtifactPublished
DeploymentRequested
NotificationRequested
Enter fullscreen mode Exit fullscreen mode

SNS and SQS can decouple build, scanning, deployment, audit, and notification tasks.

Where this architecture may not be appropriate

Event-driven architecture should not be adopted simply because it is modern.

It introduces:

  • Eventual consistency
  • More infrastructure components
  • More complex debugging
  • Schema-management requirements
  • Duplicate-delivery considerations
  • Distributed tracing challenges
  • More demanding operational procedures

A simple synchronous request may be better when:

  • The client requires an immediate final result.
  • The workflow contains only one fast and reliable dependency.
  • The application is small and unlikely to expand.
  • Eventual consistency is unacceptable.
  • The team lacks the operational maturity to manage distributed systems.

Architecture should solve a genuine business or engineering problem, not merely increase the number of AWS services in a diagram.

Key lessons from the project

Lesson 1: Microservices are not automatically decoupled

Breaking a monolith into multiple HTTP services does not remove coupling.

If every service synchronously depends on the next service, the system remains operationally coupled.

True decoupling requires careful control of dependencies, contracts, data ownership, and failure behavior.

Lesson 2: Give every consumer its own queue

Payment and Inventory should not compete for messages from one shared queue.

With one queue per microservice, each interested service receives its own copy and controls its own retries, backlog, scaling, and DLQ.

Lesson 3: Events should describe facts

Good event names describe something that has happened:

OrderCreated
PaymentCompleted
PaymentFailed
InventoryReserved
Enter fullscreen mode Exit fullscreen mode

Avoid vague names such as:

ProcessData
HandleRequest
RunService
Enter fullscreen mode Exit fullscreen mode

Business-oriented event names make architectures easier to understand.

Lesson 4: Business failures are not technical failures

A declined payment is not a processing exception.

It is a valid business outcome and should be published as an event.

Sending every negative outcome to a DLQ makes operations noisy and removes valuable business meaning.

Lesson 5: At-least-once delivery requires idempotency

Duplicate delivery is not an unusual edge case that can be ignored.

Every consumer that performs a side effect must define how duplicates are detected and handled.

Lesson 6: DLQs need ownership

Creating a DLQ is not enough.

Teams need:

  • Alerts
  • Runbooks
  • Dashboards
  • Investigation procedures
  • Redrive procedures
  • Message-retention policies
  • Service ownership

Lesson 7: Event schemas need governance

An event is a contract.

Production systems should define:

  • Schema ownership
  • Event versioning
  • Required and optional fields
  • Compatibility rules
  • Validation
  • Deprecation procedures
  • Consumer contract tests

Lesson 8: Observability must cross service boundaries

A single request can generate several events across several services.

Use fields such as:

eventId
correlationId
orderId
source
eventType
occurredAt
Enter fullscreen mode Exit fullscreen mode

These values should appear consistently in structured logs so teams can trace a workflow from beginning to end.

Lesson 9: Design failure paths before production

Successful requests are usually the easiest part of the system.

The difficult questions are:

What happens when publication fails?
What happens when a consumer times out?
What happens when the same event arrives twice?
What happens when a DLQ begins growing?
How is a failed event replayed safely?
Enter fullscreen mode Exit fullscreen mode

These questions should be answered during architecture design, not after an incident.

Lesson 10: Start simple, then harden deliberately

This project is intentionally understandable.

A production version can later add:

  • Amazon Cognito or another identity provider
  • AWS WAF
  • AWS KMS customer-managed keys
  • AWS Secrets Manager
  • Amazon SES
  • AWS X-Ray tracing
  • CloudWatch dashboards
  • AWS Lambda Powertools
  • Event schema validation
  • Transactional outbox
  • CI/CD with GitHub Actions and AWS OIDC
  • Multi-account environments
  • Reserved or maximum concurrency
  • Replay and redrive automation

Starting with a clear foundation is better than introducing every possible service before the application has demonstrated the need.

Final thoughts

The primary value of this project is not simply that it uses Amazon SNS and Amazon SQS.

Its value is that it demonstrates several fundamental distributed-system principles:

Producers should not need to know every consumer.
Consumers should own their queues.
Failures should be isolated.
Messages should be retried safely.
Duplicate delivery should not create duplicate business effects.
Business outcomes should be represented as events.
Operations teams should be able to detect and recover failed work.
Enter fullscreen mode Exit fullscreen mode

Amazon SNS handles event distribution.

Amazon SQS provides durability, buffering, retry isolation, and independent consumer control.

AWS Lambda provides serverless processing.

DynamoDB provides service-owned persistence.

AWS SAM makes the complete environment repeatable through Infrastructure as Code.

Together, these services create an architecture that can respond quickly to customers, absorb traffic spikes, isolate failures, scale individual workloads, and support new business capabilities without repeatedly rewriting the original Order Service.

That is the real purpose of decoupling: not to make the architecture diagram more impressive, but to make the business more adaptable and the system more resilient.

Top comments (0)