Bernard Chika Uwaezuoke

Posted on Jul 2

Building Decoupled Event-Driven Microservices on AWS with SNS, SQS, Lambda, and DynamoDB

#serverless #aws #microservices

Introduction

Modern applications often begin as a single service where one request triggers several operations sequentially. An order-processing application, for example, might receive an order, process its payment, update inventory, and send a notification within the same synchronous request.

This design is simple at the beginning, but it becomes increasingly fragile as the application grows.

What happens when the payment provider is slow? What if the notification service is unavailable? Should an inventory failure prevent the application from accepting an otherwise valid order? How do we add a new analytics, fraud detection, or shipping service without modifying the original order service?

These are some of the problems that event-driven microservices are designed to solve.

In this project, I built a serverless, event-driven order-processing system using:

Amazon API Gateway
AWS Lambda
Amazon Simple Notification Service
Amazon Simple Queue Service
Amazon DynamoDB
Amazon CloudWatch
AWS Serverless Application Model

The central architectural decision was to use Amazon SNS for event distribution and Amazon SQS for durable, service-specific message buffering.

The result is an architecture in which payment, inventory, and notification services can operate, fail, recover, and scale independently.

Project repository: https://github.com/Donhadley22/aws-sns-sqs-microservices

Note: If the repository remains private, readers will not be able to access the source code until it is made public.

The problem with tightly coupled microservices

Consider the following synchronous workflow:

Client
  |
  v
Order Service
  |
  v
Payment Service
  |
  v
Inventory Service
  |
  v
Notification Service

Although the application has been divided into services, the services are still operationally dependent on one another.

The Order Service cannot finish until Payment responds. Payment might depend on Inventory, and Notification might become part of the same request chain.

This creates several problems.

Increased response time

The client must wait for every downstream operation to complete.

If payment takes two seconds, inventory takes one second, and notification takes three seconds, the request can take six seconds or longer.

Failure propagation

A notification failure could cause the entire order request to fail, even though the order and payment were successful.

Difficult scaling

Payment traffic, notification traffic, and inventory traffic may have different scaling patterns. A synchronous architecture makes it harder to scale them independently.

Difficult service expansion

Adding fraud detection, shipping, analytics, or audit services usually requires modifying the producing service or inserting more synchronous calls.

Tight runtime dependency

Every downstream service must be available at the same time as the upstream service.

This is precisely where messaging services become valuable.

Solution overview

The project uses the following architecture:

Client
   |
   v
Amazon API Gateway
   |
   v
Order Service Lambda
   |
   +------> Orders DynamoDB Table
   |
   v
Amazon SNS Order Events Topic
   |
   +------> Payment SQS Queue ------> Payment Lambda
   |                                     |
   |                                     +------> Payments Table
   |                                     |
   |                                     +------> SNS outcome event
   |
   +------> Inventory SQS Queue ----> Inventory Lambda
   |                                     |
   |                                     +------> Inventory Table
   |                                     |
   |                                     +------> SNS outcome event
   |
   +------> Notification SQS Queue -> Notification Lambda
                                             |
                                             +------> Notifications Table

Each SQS queue also has a dedicated dead-letter queue.

Amazon SNS and Amazon SQS are often used together because SNS can distribute one published event to multiple subscribers, while SQS stores each service's copy until that service is ready to process it. This means the producer and consumers do not need to be available at the same time.

Understanding the responsibility of each service

1. Order Service

The Order Service is exposed through Amazon API Gateway.

It is responsible for:

Receiving the HTTP request
Validating the order payload
Generating an order ID
Calculating the order total
Writing the order to DynamoDB
Publishing an OrderCreated event to Amazon SNS
Returning a 202 Accepted response

The service does not directly invoke Payment, Inventory, or Notification.

That is a critical design decision.

The Order Service knows that an order has been created, but it does not need to know every service interested in that event.

2. Payment Service

The Payment Service receives messages from its dedicated Payment SQS queue.

It:

Reads the OrderCreated event
Processes the payment rule
Stores the payment result
Publishes either PaymentCompleted or PaymentFailed

Because Payment owns its own queue, a payment-processing slowdown does not block Inventory or prevent the Order Service from accepting new requests.

3. Inventory Service

The Inventory Service independently receives its own copy of the OrderCreated event.

It:

Checks the requested items
Simulates stock reservation
Stores the inventory outcome
Publishes either InventoryReserved or InventoryFailed

Payment and Inventory can therefore process the same order in parallel.

Neither service needs to call the other.

4. Notification Service

The Notification Service is interested only in the results generated by Payment and Inventory.

It subscribes to:

PaymentCompleted
PaymentFailed
InventoryReserved
InventoryFailed

The service records a notification for the customer based on the event it receives.

In a production environment, this could be extended to deliver:

Email through Amazon SES
SMS through Amazon SNS
Mobile push notifications
Slack or Microsoft Teams notifications
Webhook callbacks
In-application notifications

Why use both SNS and SQS?

A frequent question from students is:

Why not publish directly to SQS, or trigger every service directly from SNS?

The answer lies in the different responsibilities of the two services.

Amazon SNS provides fan-out

Amazon SNS follows a publish-subscribe model.

The publisher sends one event to a topic. The topic distributes copies to all matching subscriptions.

The Order Service therefore publishes only once:

Order Service -> Order Events SNS Topic

SNS then distributes the event to:

Payment Queue
Inventory Queue

Additional consumers can be added later without changing the Order Service.

For example:

Fraud Detection Queue
Analytics Queue
Shipping Queue
Audit Queue

Amazon SQS provides durability and back-pressure

Amazon SQS stores messages until a consumer successfully processes them or their retention period expires.

This is important because consumers do not always process traffic at the same speed.

Suppose the system receives 10,000 orders during a promotional campaign. The Order Service can continue accepting requests while the Payment queue temporarily accumulates messages.

Payment workers can process the backlog according to their available concurrency.

The queue absorbs the traffic spike instead of allowing the spike to overload the Payment Service.

This is known as back-pressure handling.

Step-by-step event flow

Step 1: The client submits an order

The client sends:

POST /orders
Content-Type: application/json

Example request:

{
  "customerEmail": "student@example.com",
  "currency": "USD",
  "items": [
    {
      "sku": "LAPTOP-001",
      "quantity": 1,
      "price": 850
    },
    {
      "sku": "MOUSE-002",
      "quantity": 2,
      "price": 25
    }
  ],
  "paymentOutcome": "approved",
  "inventoryOutcome": "reserved"
}

API Gateway forwards the request to the Order Service Lambda function.

Step 2: The order is stored

The Order Service generates an order ID and stores the order in the Orders DynamoDB table.

Its initial state changes from:

ORDER_RECEIVED

to:

EVENT_PUBLISHED

after the event is successfully published.

Step 3: The Order Service publishes an event

The event follows a consistent envelope:

{
  "eventId": "unique-event-id",
  "eventType": "OrderCreated",
  "eventVersion": "1.0",
  "source": "order-service",
  "occurredAt": "2026-07-02T10:00:00.000Z",
  "data": {
    "orderId": "unique-order-id",
    "customerEmail": "student@example.com",
    "items": [],
    "totalAmount": 900,
    "currency": "USD"
  }
}

This envelope provides several important fields:

eventId supports tracing and deduplication.
eventType identifies the business event.
eventVersion supports schema evolution.
source identifies the producer.
occurredAt records when the event happened.
data contains the business payload.

A well-defined event contract is essential because events become APIs between services.

Changing an event structure without version control can break several consumers simultaneously.

Step 4: SNS applies subscription filters

The Payment subscription receives only OrderCreated events:

FilterPolicy:
  eventType:
    - OrderCreated

The Inventory subscription uses the same filter:

FilterPolicy:
  eventType:
    - OrderCreated

The Notification subscription receives only outcome events:

FilterPolicy:
  eventType:
    - PaymentCompleted
    - PaymentFailed
    - InventoryReserved
    - InventoryFailed

By default, an SNS subscription receives every message published to its topic. A filter policy limits delivery to messages whose attributes or body match the subscription's conditions.

Filtering prevents irrelevant events from reaching queues and reduces unnecessary Lambda invocations.

Step 5: Lambda processes messages from SQS

Each queue is configured as an event source for its corresponding Lambda function.

For example:

PaymentQueueEvent:
  Type: SQS
  Properties:
    Queue: !GetAtt PaymentQueue.Arn
    BatchSize: 5
    MaximumBatchingWindowInSeconds: 1
    FunctionResponseTypes:
      - ReportBatchItemFailures

The Lambda event source mapping polls SQS and sends messages to the function in batches.

The Payment and Inventory services use a batch size of five, while Notification uses a larger batch because notification records are lightweight.

Handling failures correctly

Building an event-driven system is not only about delivering messages. It is about designing what happens when processing fails.

This project separates failures into two categories.

Business failures

A business failure is a valid result that the application understands.

Examples include:

Payment declined
Product out of stock
Address validation failed
Customer account suspended

These should not normally be sent to a dead-letter queue.

The service should process the message successfully and publish a business outcome such as:

PaymentFailed
InventoryFailed

This allows other services to react to the failure.

For example, Notification can inform the customer that payment was declined.

Technical failures

A technical failure means the message could not be processed because of an unexpected problem.

Examples include:

Database unavailable
Malformed event
Dependency timeout
Permission failure
Uncaught application exception

These failures should be retried.

If processing continues to fail, the message should eventually move to a dead-letter queue.

Dead-letter queues

Every service queue in this project has its own DLQ:

Payment Queue ------> Payment DLQ
Inventory Queue ----> Inventory DLQ
Notification Queue -> Notification DLQ

The redrive policy is configured as follows:

RedrivePolicy:
  deadLetterTargetArn: !GetAtt PaymentDeadLetterQueue.Arn
  maxReceiveCount: 3

After three unsuccessful processing attempts, SQS moves the message to the relevant DLQ.

Dedicated DLQs provide failure isolation. A poison message for Payment does not pollute Inventory or Notification processing.

The DLQ retention period in this project is longer than the source queue retention period, which follows AWS guidance for preserving failed messages long enough for investigation.

CloudWatch alarms monitor the approximate number of visible messages in each DLQ.

In production, those alarms should notify the operations team through:

An SNS alert topic
Email
Slack
Microsoft Teams
PagerDuty
An incident-management platform

A DLQ should never become a graveyard where failed messages are ignored. Every DLQ needs an operational process for investigation, correction, replay, or deletion.

Partial batch failure handling

Assume Lambda receives five SQS messages and only one fails.

Without partial batch handling, the entire batch may be retried, including the four messages that were processed successfully.

This creates:

Duplicate work
Increased cost
Reduced throughput
More complicated data handling

The project enables:

FunctionResponseTypes:
  - ReportBatchItemFailures

Each consumer returns only the failed message IDs:

return {
  batchItemFailures: [
    {
      itemIdentifier: record.messageId
    }
  ]
};

Lambda then makes only those failed messages available for retry.

AWS recommends partial batch responses to avoid reprocessing successful messages when one record in an SQS batch fails.

Idempotency is not optional

Amazon SQS standard queues provide at-least-once processing semantics. This means the same message may occasionally be delivered more than once.

Therefore, a consumer must be able to process duplicate messages safely.

Imagine a Payment Service that charges a customer every time it receives a message. If the message is delivered twice, the customer could be charged twice.

That is unacceptable.

In this project, the Payment Service uses a conditional DynamoDB write:

await documentClient.send(
  new PutCommand({
    TableName: PAYMENTS_TABLE,
    Item: paymentResult,
    ConditionExpression: "attribute_not_exists(orderId)"
  })
);

The first message stores the payment.

A duplicate message encounters the existing orderId, preventing a second payment record from being created.

The project also generates deterministic outcome event IDs:

eventId: `${orderCreatedEvent.eventId}:payment-outcome`

If Payment stores its result but temporarily fails while publishing the outcome event, the original SQS message is retried.

On retry:

The duplicate database write is recognized.
The service safely republishes the outcome.
The deterministic event ID allows downstream consumers to detect the duplicate outcome.

Idempotency should be considered at every side-effect boundary:

Database writes
Payment requests
Inventory reservations
Emails
Webhooks
External API calls
Event publication

Visibility timeout configuration

When SQS delivers a message to a consumer, the message becomes temporarily invisible to other consumers.

This period is called the visibility timeout.

If the consumer successfully processes and deletes the message, processing is complete.

If the message is not deleted before the timeout expires, it becomes visible and can be processed again.

This project configures:

VisibilityTimeout: 120

The Lambda functions have:

Timeout: 15

The visibility timeout is therefore significantly longer than the function timeout.

AWS recommends setting the queue visibility timeout to at least six times the Lambda function timeout to allow time for retries if processing is throttled.

A visibility timeout that is too short can cause the same message to become visible while the first consumer is still processing it.

A timeout that is unnecessarily long can delay retries after genuine failures.

It should be selected based on actual processing duration, batching configuration, throttling risk, and retry requirements.

Long polling

The queues use:

ReceiveMessageWaitTimeSeconds: 20

This enables long polling.

Long polling allows SQS to wait briefly for a message instead of immediately returning an empty response. AWS documents that long polling can reduce empty and false-empty responses, with a maximum wait time of 20 seconds.

Although Lambda manages polling when SQS is configured as an event source, setting appropriate queue parameters remains useful for queue behavior and for any additional consumers that may poll the queue directly.

Infrastructure as Code with AWS SAM

The entire solution is defined in one AWS SAM template.

The template creates:

API Gateway routes
Four Lambda functions
One SNS topic
Three primary SQS queues
Three dead-letter queues
Four DynamoDB tables
SNS subscriptions
Queue policies
IAM permissions
Lambda event source mappings
CloudWatch alarms
CloudFormation outputs

This makes the architecture repeatable.

Instead of manually creating resources in the AWS Management Console, the environment can be deployed using:

npm run install:all
sam validate --lint
sam build
sam deploy --guided

The SAM build packages each Lambda service and creates the deployment artifacts.

The guided deployment captures the stack name, AWS Region, environment parameter, IAM capability confirmation, and local SAM configuration file.

After CloudFormation completes, the stack outputs include the topic, queue, table, and API details created by the template.

The API URL can then be read from the CloudFormation stack outputs and used for test requests.

Subsequent deployments use:

sam build
sam deploy

The stack can be removed with:

sam delete --stack-name sns-sqs-microservices

Infrastructure as Code provides several practical advantages:

Consistent environments
Version-controlled infrastructure
Easier peer review
Repeatable deployment
Reduced configuration drift
Easier disaster recovery
A foundation for CI/CD automation

The deployed resources are visible across the AWS console. The Lambda functions represent the four services in the workflow.

The DynamoDB tables give each service its own persistence boundary.

CloudFormation keeps the whole environment grouped as one repeatable stack.

Testing the architecture

A serious microservices project should test more than the successful path.

This project contains three test scenarios.

Test 1: Successful order

{
  "paymentOutcome": "approved",
  "inventoryOutcome": "reserved"
}

Expected results:

Order stored
Payment approved
Inventory reserved
Payment notification recorded
Inventory notification recorded
No DLQ messages

The successful request returns a 202 Accepted response with the generated orderId, eventId, and EVENT_PUBLISHED status.

Test 2: Business failure

{
  "paymentOutcome": "declined",
  "inventoryOutcome": "out_of_stock"
}

Expected results:

PaymentFailed published
InventoryFailed published
Failure notifications recorded
No DLQ messages

The services operated correctly. The business simply produced negative outcomes.

Test 3: Technical failure

{
  "simulateTechnicalFailure": "payment"
}

Expected behavior:

Payment processing throws an exception
The message becomes visible again
Lambda retries it
The receive count increases
The message moves to the Payment DLQ after three attempts
The Payment DLQ CloudWatch alarm enters ALARM state

Testing failure behavior is just as important as testing successful processing.

A system is not resilient merely because it has a queue. Resilience depends on retry policy, idempotency, timeout selection, DLQ handling, monitoring, and recovery procedures.

The transactional outbox consideration

The Order Service currently performs two operations:

1. Write the order to DynamoDB
2. Publish OrderCreated to SNS

These operations are not part of one atomic transaction.

Consider this sequence:

Order write succeeds
Application crashes
SNS publication never happens

The order exists, but Payment and Inventory never receive the event.

This is known as the dual-write problem.

For a learning project, the current implementation records an EVENT_PUBLISH_FAILED state so the failure can be identified.

For a production system, I would implement the transactional outbox pattern.

A DynamoDB-based version could work as follows:

Store the order and an outbox event in one DynamoDB transaction.
Enable DynamoDB Streams.
Invoke an event publisher from the stream.
Publish the outbox event to SNS.
Record successful publication or allow the outbox item to expire.

The transactional outbox pattern addresses inconsistent outcomes when an application must update a database and publish an event as part of the same business operation.

Business benefits

1. Better customer response times

The API does not need to wait for every downstream service to finish.

It can acknowledge that the order was accepted while processing continues asynchronously.

This provides a more responsive customer experience.

2. Reduced blast radius

If Notification fails, Payment and Inventory can continue operating.

If Inventory experiences a backlog, Payment can still process messages.

Failures are isolated to individual queues and consumers.

3. Independent scalability

Each microservice scales according to its own workload.

For example:

Payment Service: 100 concurrent workers
Inventory Service: 40 concurrent workers
Notification Service: 500 concurrent workers

The business does not need to scale the complete application just because one function is under heavy load.

4. Easier business expansion

New consumers can subscribe to the event topic without changing the original publisher.

A business can add:

Fraud detection
Customer loyalty points
Data analytics
Shipping fulfilment
Audit logging
Recommendation engines

This reduces the cost and risk of adding new capabilities.

5. Traffic-spike protection

Queues absorb sudden increases in traffic.

This is particularly valuable for:

Flash sales
Ticket releases
Registration deadlines
Payroll processing
Month-end billing
Marketing campaigns

The producer can accept work while consumers process the backlog at a controlled rate.

6. Improved operational visibility

Each queue exposes useful CloudWatch metrics, including:

Visible message count
In-flight message count
Age of the oldest message
Number of messages received
Number of messages deleted
DLQ depth

These metrics help teams identify which part of a business workflow is slowing down.

7. Cost-efficient serverless operation

The architecture uses managed services and does not require teams to operate message brokers or application servers.

The business pays based largely on usage while AWS manages the underlying infrastructure.

8. Stronger team autonomy

Different teams can own different services.

For example:

Checkout Team -> Order Service
Finance Team -> Payment Service
Operations Team -> Inventory Service
Customer Experience Team -> Notification Service

Teams can deploy and scale their services independently, provided they continue to respect the agreed event contracts.

Where this architecture can be applied

E-commerce order processing

OrderCreated
PaymentProcessed
InventoryReserved
ShipmentCreated
CustomerNotified

This is the most direct application of the project.

Financial transaction processing

TransactionSubmitted
FraudCheckRequested
ComplianceCheckCompleted
LedgerUpdated
CustomerAlerted

Additional controls are required for financial workloads, especially around exactly-once business effects, security, audit trails, and reconciliation.

User registration and onboarding

UserRegistered
IdentityVerificationRequested
WelcomeEmailRequested
CRMProfileCreated
AnalyticsEventRecorded

The registration API can remain responsive while secondary activities happen asynchronously.

Media-processing pipelines

FileUploaded
VirusScanRequested
VideoTranscodingRequested
ThumbnailGenerationRequested
MetadataExtractionRequested

Each processing stage can have its own queue and worker capacity.

Internet of Things workloads

DeviceReadingReceived
AnomalyDetectionRequested
DataArchived
AlertGenerated
DashboardUpdated

Queues help absorb bursts from thousands of devices.

Logistics and fulfilment

PackageCreated
WarehouseAssigned
DriverRequested
TrackingUpdated
CustomerNotified

Different services can react to fulfilment events without direct dependencies.

Insurance claims processing

ClaimSubmitted
DocumentValidationRequested
FraudAssessmentRequested
AdjusterAssigned
CustomerUpdated

Long-running workflows can be divided into independent processing stages.

Healthcare administrative workflows

AppointmentBooked
ReminderRequested
BillingRecordCreated
InsuranceCheckRequested
AuditRecordCreated

Sensitive workloads require appropriate security, privacy, encryption, logging, and regulatory controls.

CI/CD and DevOps automation

BuildCompleted
SecurityScanRequested
ArtifactPublished
DeploymentRequested
NotificationRequested

SNS and SQS can decouple build, scanning, deployment, audit, and notification tasks.

Where this architecture may not be appropriate

Event-driven architecture should not be adopted simply because it is modern.

It introduces:

Eventual consistency
More infrastructure components
More complex debugging
Schema-management requirements
Duplicate-delivery considerations
Distributed tracing challenges
More demanding operational procedures

A simple synchronous request may be better when:

The client requires an immediate final result.
The workflow contains only one fast and reliable dependency.
The application is small and unlikely to expand.
Eventual consistency is unacceptable.
The team lacks the operational maturity to manage distributed systems.

Architecture should solve a genuine business or engineering problem, not merely increase the number of AWS services in a diagram.

Key lessons from the project

Lesson 1: Microservices are not automatically decoupled

Breaking a monolith into multiple HTTP services does not remove coupling.

If every service synchronously depends on the next service, the system remains operationally coupled.

True decoupling requires careful control of dependencies, contracts, data ownership, and failure behavior.

Lesson 2: Give every consumer its own queue

Payment and Inventory should not compete for messages from one shared queue.

With one queue per microservice, each interested service receives its own copy and controls its own retries, backlog, scaling, and DLQ.

Lesson 3: Events should describe facts

Good event names describe something that has happened:

OrderCreated
PaymentCompleted
PaymentFailed
InventoryReserved

Avoid vague names such as:

ProcessData
HandleRequest
RunService

Business-oriented event names make architectures easier to understand.

Lesson 4: Business failures are not technical failures

A declined payment is not a processing exception.

It is a valid business outcome and should be published as an event.

Sending every negative outcome to a DLQ makes operations noisy and removes valuable business meaning.

Lesson 5: At-least-once delivery requires idempotency

Duplicate delivery is not an unusual edge case that can be ignored.

Every consumer that performs a side effect must define how duplicates are detected and handled.

Lesson 6: DLQs need ownership

Creating a DLQ is not enough.

Teams need:

Alerts
Runbooks
Dashboards
Investigation procedures
Redrive procedures
Message-retention policies
Service ownership

Lesson 7: Event schemas need governance

An event is a contract.

Production systems should define:

Schema ownership
Event versioning
Required and optional fields
Compatibility rules
Validation
Deprecation procedures
Consumer contract tests

Lesson 8: Observability must cross service boundaries

A single request can generate several events across several services.

Use fields such as:

eventId
correlationId
orderId
source
eventType
occurredAt

These values should appear consistently in structured logs so teams can trace a workflow from beginning to end.

Lesson 9: Design failure paths before production

Successful requests are usually the easiest part of the system.

The difficult questions are:

What happens when publication fails?
What happens when a consumer times out?
What happens when the same event arrives twice?
What happens when a DLQ begins growing?
How is a failed event replayed safely?

These questions should be answered during architecture design, not after an incident.

Lesson 10: Start simple, then harden deliberately

This project is intentionally understandable.

A production version can later add:

Amazon Cognito or another identity provider
AWS WAF
AWS KMS customer-managed keys
AWS Secrets Manager
Amazon SES
AWS X-Ray tracing
CloudWatch dashboards
AWS Lambda Powertools
Event schema validation
Transactional outbox
CI/CD with GitHub Actions and AWS OIDC
Multi-account environments
Reserved or maximum concurrency
Replay and redrive automation

Starting with a clear foundation is better than introducing every possible service before the application has demonstrated the need.

Final thoughts

The primary value of this project is not simply that it uses Amazon SNS and Amazon SQS.

Its value is that it demonstrates several fundamental distributed-system principles:

Producers should not need to know every consumer.
Consumers should own their queues.
Failures should be isolated.
Messages should be retried safely.
Duplicate delivery should not create duplicate business effects.
Business outcomes should be represented as events.
Operations teams should be able to detect and recover failed work.

Amazon SNS handles event distribution.

Amazon SQS provides durability, buffering, retry isolation, and independent consumer control.

AWS Lambda provides serverless processing.

DynamoDB provides service-owned persistence.

AWS SAM makes the complete environment repeatable through Infrastructure as Code.

Together, these services create an architecture that can respond quickly to customers, absorb traffic spikes, isolate failures, scale individual workloads, and support new business capabilities without repeatedly rewriting the original Order Service.

That is the real purpose of decoupling: not to make the architecture diagram more impressive, but to make the business more adaptable and the system more resilient.