Introduction
Modern applications often begin as a single service where one request triggers several operations sequentially. An order-processing application, for example, might receive an order, process its payment, update inventory, and send a notification within the same synchronous request.
This design is simple at the beginning, but it becomes increasingly fragile as the application grows.
What happens when the payment provider is slow? What if the notification service is unavailable? Should an inventory failure prevent the application from accepting an otherwise valid order? How do we add a new analytics, fraud detection, or shipping service without modifying the original order service?
These are some of the problems that event-driven microservices are designed to solve.
In this project, I built a serverless, event-driven order-processing system using:
- Amazon API Gateway
- AWS Lambda
- Amazon Simple Notification Service
- Amazon Simple Queue Service
- Amazon DynamoDB
- Amazon CloudWatch
- AWS Serverless Application Model
The central architectural decision was to use Amazon SNS for event distribution and Amazon SQS for durable, service-specific message buffering.
The result is an architecture in which payment, inventory, and notification services can operate, fail, recover, and scale independently.
Project repository: https://github.com/Donhadley22/aws-sns-sqs-microservices
Note: If the repository remains private, readers will not be able to access the source code until it is made public.
The problem with tightly coupled microservices
Consider the following synchronous workflow:
Client
|
v
Order Service
|
v
Payment Service
|
v
Inventory Service
|
v
Notification Service
Although the application has been divided into services, the services are still operationally dependent on one another.
The Order Service cannot finish until Payment responds. Payment might depend on Inventory, and Notification might become part of the same request chain.
This creates several problems.
Increased response time
The client must wait for every downstream operation to complete.
If payment takes two seconds, inventory takes one second, and notification takes three seconds, the request can take six seconds or longer.
Failure propagation
A notification failure could cause the entire order request to fail, even though the order and payment were successful.
Difficult scaling
Payment traffic, notification traffic, and inventory traffic may have different scaling patterns. A synchronous architecture makes it harder to scale them independently.
Difficult service expansion
Adding fraud detection, shipping, analytics, or audit services usually requires modifying the producing service or inserting more synchronous calls.
Tight runtime dependency
Every downstream service must be available at the same time as the upstream service.
This is precisely where messaging services become valuable.
Solution overview
The project uses the following architecture:
Client
|
v
Amazon API Gateway
|
v
Order Service Lambda
|
+------> Orders DynamoDB Table
|
v
Amazon SNS Order Events Topic
|
+------> Payment SQS Queue ------> Payment Lambda
| |
| +------> Payments Table
| |
| +------> SNS outcome event
|
+------> Inventory SQS Queue ----> Inventory Lambda
| |
| +------> Inventory Table
| |
| +------> SNS outcome event
|
+------> Notification SQS Queue -> Notification Lambda
|
+------> Notifications Table
Each SQS queue also has a dedicated dead-letter queue.
Amazon SNS and Amazon SQS are often used together because SNS can distribute one published event to multiple subscribers, while SQS stores each service's copy until that service is ready to process it. This means the producer and consumers do not need to be available at the same time.
Understanding the responsibility of each service
1. Order Service
The Order Service is exposed through Amazon API Gateway.
It is responsible for:
- Receiving the HTTP request
- Validating the order payload
- Generating an order ID
- Calculating the order total
- Writing the order to DynamoDB
- Publishing an OrderCreated event to Amazon SNS
- Returning a 202 Accepted response
The service does not directly invoke Payment, Inventory, or Notification.
That is a critical design decision.
The Order Service knows that an order has been created, but it does not need to know every service interested in that event.
2. Payment Service
The Payment Service receives messages from its dedicated Payment SQS queue.
It:
- Reads the OrderCreated event
- Processes the payment rule
- Stores the payment result
- Publishes either PaymentCompleted or PaymentFailed
Because Payment owns its own queue, a payment-processing slowdown does not block Inventory or prevent the Order Service from accepting new requests.
3. Inventory Service
The Inventory Service independently receives its own copy of the OrderCreated event.
It:
- Checks the requested items
- Simulates stock reservation
- Stores the inventory outcome
- Publishes either InventoryReserved or InventoryFailed
Payment and Inventory can therefore process the same order in parallel.
Neither service needs to call the other.
4. Notification Service
The Notification Service is interested only in the results generated by Payment and Inventory.
It subscribes to:
PaymentCompleted
PaymentFailed
InventoryReserved
InventoryFailed
The service records a notification for the customer based on the event it receives.
In a production environment, this could be extended to deliver:
- Email through Amazon SES
- SMS through Amazon SNS
- Mobile push notifications
- Slack or Microsoft Teams notifications
- Webhook callbacks
- In-application notifications
Why use both SNS and SQS?
A frequent question from students is:
Why not publish directly to SQS, or trigger every service directly from SNS?
The answer lies in the different responsibilities of the two services.
Amazon SNS provides fan-out
Amazon SNS follows a publish-subscribe model.
The publisher sends one event to a topic. The topic distributes copies to all matching subscriptions.
The Order Service therefore publishes only once:
Order Service -> Order Events SNS Topic
SNS then distributes the event to:
Payment Queue
Inventory Queue
Additional consumers can be added later without changing the Order Service.
For example:
Fraud Detection Queue
Analytics Queue
Shipping Queue
Audit Queue
Amazon SQS provides durability and back-pressure
Amazon SQS stores messages until a consumer successfully processes them or their retention period expires.
This is important because consumers do not always process traffic at the same speed.
Suppose the system receives 10,000 orders during a promotional campaign. The Order Service can continue accepting requests while the Payment queue temporarily accumulates messages.
Payment workers can process the backlog according to their available concurrency.
The queue absorbs the traffic spike instead of allowing the spike to overload the Payment Service.
This is known as back-pressure handling.
Step-by-step event flow
Step 1: The client submits an order
The client sends:
POST /orders
Content-Type: application/json
Example request:
{
"customerEmail": "student@example.com",
"currency": "USD",
"items": [
{
"sku": "LAPTOP-001",
"quantity": 1,
"price": 850
},
{
"sku": "MOUSE-002",
"quantity": 2,
"price": 25
}
],
"paymentOutcome": "approved",
"inventoryOutcome": "reserved"
}
API Gateway forwards the request to the Order Service Lambda function.
Step 2: The order is stored
The Order Service generates an order ID and stores the order in the Orders DynamoDB table.
Its initial state changes from:
ORDER_RECEIVED
to:
EVENT_PUBLISHED
after the event is successfully published.
Step 3: The Order Service publishes an event
The event follows a consistent envelope:
{
"eventId": "unique-event-id",
"eventType": "OrderCreated",
"eventVersion": "1.0",
"source": "order-service",
"occurredAt": "2026-07-02T10:00:00.000Z",
"data": {
"orderId": "unique-order-id",
"customerEmail": "student@example.com",
"items": [],
"totalAmount": 900,
"currency": "USD"
}
}
This envelope provides several important fields:
- eventId supports tracing and deduplication.
- eventType identifies the business event.
- eventVersion supports schema evolution.
- source identifies the producer.
- occurredAt records when the event happened.
- data contains the business payload.
A well-defined event contract is essential because events become APIs between services.
Changing an event structure without version control can break several consumers simultaneously.
Step 4: SNS applies subscription filters
The Payment subscription receives only OrderCreated events:
FilterPolicy:
eventType:
- OrderCreated
The Inventory subscription uses the same filter:
FilterPolicy:
eventType:
- OrderCreated
The Notification subscription receives only outcome events:
FilterPolicy:
eventType:
- PaymentCompleted
- PaymentFailed
- InventoryReserved
- InventoryFailed
By default, an SNS subscription receives every message published to its topic. A filter policy limits delivery to messages whose attributes or body match the subscription's conditions.
Filtering prevents irrelevant events from reaching queues and reduces unnecessary Lambda invocations.
Step 5: Lambda processes messages from SQS
Each queue is configured as an event source for its corresponding Lambda function.
For example:
PaymentQueueEvent:
Type: SQS
Properties:
Queue: !GetAtt PaymentQueue.Arn
BatchSize: 5
MaximumBatchingWindowInSeconds: 1
FunctionResponseTypes:
- ReportBatchItemFailures
The Lambda event source mapping polls SQS and sends messages to the function in batches.
The Payment and Inventory services use a batch size of five, while Notification uses a larger batch because notification records are lightweight.
Handling failures correctly
Building an event-driven system is not only about delivering messages. It is about designing what happens when processing fails.
This project separates failures into two categories.
Business failures
A business failure is a valid result that the application understands.
Examples include:
Payment declined
Product out of stock
Address validation failed
Customer account suspended
These should not normally be sent to a dead-letter queue.
The service should process the message successfully and publish a business outcome such as:
PaymentFailed
InventoryFailed
This allows other services to react to the failure.
For example, Notification can inform the customer that payment was declined.
Technical failures
A technical failure means the message could not be processed because of an unexpected problem.
Examples include:
Database unavailable
Malformed event
Dependency timeout
Permission failure
Uncaught application exception
These failures should be retried.
If processing continues to fail, the message should eventually move to a dead-letter queue.
Dead-letter queues
Every service queue in this project has its own DLQ:
Payment Queue ------> Payment DLQ
Inventory Queue ----> Inventory DLQ
Notification Queue -> Notification DLQ
The redrive policy is configured as follows:
RedrivePolicy:
deadLetterTargetArn: !GetAtt PaymentDeadLetterQueue.Arn
maxReceiveCount: 3
After three unsuccessful processing attempts, SQS moves the message to the relevant DLQ.
Dedicated DLQs provide failure isolation. A poison message for Payment does not pollute Inventory or Notification processing.
The DLQ retention period in this project is longer than the source queue retention period, which follows AWS guidance for preserving failed messages long enough for investigation.
CloudWatch alarms monitor the approximate number of visible messages in each DLQ.
In production, those alarms should notify the operations team through:
- An SNS alert topic
- Slack
- Microsoft Teams
- PagerDuty
- An incident-management platform
A DLQ should never become a graveyard where failed messages are ignored. Every DLQ needs an operational process for investigation, correction, replay, or deletion.
Partial batch failure handling
Assume Lambda receives five SQS messages and only one fails.
Without partial batch handling, the entire batch may be retried, including the four messages that were processed successfully.
This creates:
- Duplicate work
- Increased cost
- Reduced throughput
- More complicated data handling
The project enables:
FunctionResponseTypes:
- ReportBatchItemFailures
Each consumer returns only the failed message IDs:
return {
batchItemFailures: [
{
itemIdentifier: record.messageId
}
]
};
Lambda then makes only those failed messages available for retry.
AWS recommends partial batch responses to avoid reprocessing successful messages when one record in an SQS batch fails.
Idempotency is not optional
Amazon SQS standard queues provide at-least-once processing semantics. This means the same message may occasionally be delivered more than once.
Therefore, a consumer must be able to process duplicate messages safely.
Imagine a Payment Service that charges a customer every time it receives a message. If the message is delivered twice, the customer could be charged twice.
That is unacceptable.
In this project, the Payment Service uses a conditional DynamoDB write:
await documentClient.send(
new PutCommand({
TableName: PAYMENTS_TABLE,
Item: paymentResult,
ConditionExpression: "attribute_not_exists(orderId)"
})
);
The first message stores the payment.
A duplicate message encounters the existing orderId, preventing a second payment record from being created.
The project also generates deterministic outcome event IDs:
eventId: `${orderCreatedEvent.eventId}:payment-outcome`
If Payment stores its result but temporarily fails while publishing the outcome event, the original SQS message is retried.
On retry:
- The duplicate database write is recognized.
- The service safely republishes the outcome.
- The deterministic event ID allows downstream consumers to detect the duplicate outcome.
Idempotency should be considered at every side-effect boundary:
- Database writes
- Payment requests
- Inventory reservations
- Emails
- Webhooks
- External API calls
- Event publication
Visibility timeout configuration
When SQS delivers a message to a consumer, the message becomes temporarily invisible to other consumers.
This period is called the visibility timeout.
If the consumer successfully processes and deletes the message, processing is complete.
If the message is not deleted before the timeout expires, it becomes visible and can be processed again.
This project configures:
VisibilityTimeout: 120
The Lambda functions have:
Timeout: 15
The visibility timeout is therefore significantly longer than the function timeout.
AWS recommends setting the queue visibility timeout to at least six times the Lambda function timeout to allow time for retries if processing is throttled.
A visibility timeout that is too short can cause the same message to become visible while the first consumer is still processing it.
A timeout that is unnecessarily long can delay retries after genuine failures.
It should be selected based on actual processing duration, batching configuration, throttling risk, and retry requirements.
Long polling
The queues use:
ReceiveMessageWaitTimeSeconds: 20
This enables long polling.
Long polling allows SQS to wait briefly for a message instead of immediately returning an empty response. AWS documents that long polling can reduce empty and false-empty responses, with a maximum wait time of 20 seconds.
Although Lambda manages polling when SQS is configured as an event source, setting appropriate queue parameters remains useful for queue behavior and for any additional consumers that may poll the queue directly.
Infrastructure as Code with AWS SAM
The entire solution is defined in one AWS SAM template.
The template creates:
- API Gateway routes
- Four Lambda functions
- One SNS topic
- Three primary SQS queues
- Three dead-letter queues
- Four DynamoDB tables
- SNS subscriptions
- Queue policies
- IAM permissions
- Lambda event source mappings
- CloudWatch alarms
- CloudFormation outputs
This makes the architecture repeatable.
Instead of manually creating resources in the AWS Management Console, the environment can be deployed using:
npm run install:all
sam validate --lint
sam build
sam deploy --guided
The SAM build packages each Lambda service and creates the deployment artifacts.
The guided deployment captures the stack name, AWS Region, environment parameter, IAM capability confirmation, and local SAM configuration file.
After CloudFormation completes, the stack outputs include the topic, queue, table, and API details created by the template.
The API URL can then be read from the CloudFormation stack outputs and used for test requests.
Subsequent deployments use:
sam build
sam deploy
The stack can be removed with:
sam delete --stack-name sns-sqs-microservices
Infrastructure as Code provides several practical advantages:
- Consistent environments
- Version-controlled infrastructure
- Easier peer review
- Repeatable deployment
- Reduced configuration drift
- Easier disaster recovery
- A foundation for CI/CD automation
The deployed resources are visible across the AWS console. The Lambda functions represent the four services in the workflow.
The DynamoDB tables give each service its own persistence boundary.
CloudFormation keeps the whole environment grouped as one repeatable stack.
Testing the architecture
A serious microservices project should test more than the successful path.
This project contains three test scenarios.
Test 1: Successful order
{
"paymentOutcome": "approved",
"inventoryOutcome": "reserved"
}
Expected results:
Order stored
Payment approved
Inventory reserved
Payment notification recorded
Inventory notification recorded
No DLQ messages
The successful request returns a 202 Accepted response with the generated orderId, eventId, and EVENT_PUBLISHED status.
Test 2: Business failure
{
"paymentOutcome": "declined",
"inventoryOutcome": "out_of_stock"
}
Expected results:
PaymentFailed published
InventoryFailed published
Failure notifications recorded
No DLQ messages
The services operated correctly. The business simply produced negative outcomes.
Test 3: Technical failure
{
"simulateTechnicalFailure": "payment"
}
Expected behavior:
Payment processing throws an exception
The message becomes visible again
Lambda retries it
The receive count increases
The message moves to the Payment DLQ after three attempts
The Payment DLQ CloudWatch alarm enters ALARM state
Testing failure behavior is just as important as testing successful processing.
A system is not resilient merely because it has a queue. Resilience depends on retry policy, idempotency, timeout selection, DLQ handling, monitoring, and recovery procedures.
The transactional outbox consideration
The Order Service currently performs two operations:
1. Write the order to DynamoDB
2. Publish OrderCreated to SNS
These operations are not part of one atomic transaction.
Consider this sequence:
Order write succeeds
Application crashes
SNS publication never happens
The order exists, but Payment and Inventory never receive the event.
This is known as the dual-write problem.
For a learning project, the current implementation records an EVENT_PUBLISH_FAILED state so the failure can be identified.
For a production system, I would implement the transactional outbox pattern.
A DynamoDB-based version could work as follows:
- Store the order and an outbox event in one DynamoDB transaction.
- Enable DynamoDB Streams.
- Invoke an event publisher from the stream.
- Publish the outbox event to SNS.
- Record successful publication or allow the outbox item to expire.
The transactional outbox pattern addresses inconsistent outcomes when an application must update a database and publish an event as part of the same business operation.
Business benefits
1. Better customer response times
The API does not need to wait for every downstream service to finish.
It can acknowledge that the order was accepted while processing continues asynchronously.
This provides a more responsive customer experience.
2. Reduced blast radius
If Notification fails, Payment and Inventory can continue operating.
If Inventory experiences a backlog, Payment can still process messages.
Failures are isolated to individual queues and consumers.
3. Independent scalability
Each microservice scales according to its own workload.
For example:
Payment Service: 100 concurrent workers
Inventory Service: 40 concurrent workers
Notification Service: 500 concurrent workers
The business does not need to scale the complete application just because one function is under heavy load.
4. Easier business expansion
New consumers can subscribe to the event topic without changing the original publisher.
A business can add:
- Fraud detection
- Customer loyalty points
- Data analytics
- Shipping fulfilment
- Audit logging
- Recommendation engines
This reduces the cost and risk of adding new capabilities.
5. Traffic-spike protection
Queues absorb sudden increases in traffic.
This is particularly valuable for:
- Flash sales
- Ticket releases
- Registration deadlines
- Payroll processing
- Month-end billing
- Marketing campaigns
The producer can accept work while consumers process the backlog at a controlled rate.
6. Improved operational visibility
Each queue exposes useful CloudWatch metrics, including:
- Visible message count
- In-flight message count
- Age of the oldest message
- Number of messages received
- Number of messages deleted
- DLQ depth
These metrics help teams identify which part of a business workflow is slowing down.
7. Cost-efficient serverless operation
The architecture uses managed services and does not require teams to operate message brokers or application servers.
The business pays based largely on usage while AWS manages the underlying infrastructure.
8. Stronger team autonomy
Different teams can own different services.
For example:
Checkout Team -> Order Service
Finance Team -> Payment Service
Operations Team -> Inventory Service
Customer Experience Team -> Notification Service
Teams can deploy and scale their services independently, provided they continue to respect the agreed event contracts.
Where this architecture can be applied
E-commerce order processing
OrderCreated
PaymentProcessed
InventoryReserved
ShipmentCreated
CustomerNotified
This is the most direct application of the project.
Financial transaction processing
TransactionSubmitted
FraudCheckRequested
ComplianceCheckCompleted
LedgerUpdated
CustomerAlerted
Additional controls are required for financial workloads, especially around exactly-once business effects, security, audit trails, and reconciliation.
User registration and onboarding
UserRegistered
IdentityVerificationRequested
WelcomeEmailRequested
CRMProfileCreated
AnalyticsEventRecorded
The registration API can remain responsive while secondary activities happen asynchronously.
Media-processing pipelines
FileUploaded
VirusScanRequested
VideoTranscodingRequested
ThumbnailGenerationRequested
MetadataExtractionRequested
Each processing stage can have its own queue and worker capacity.
Internet of Things workloads
DeviceReadingReceived
AnomalyDetectionRequested
DataArchived
AlertGenerated
DashboardUpdated
Queues help absorb bursts from thousands of devices.
Logistics and fulfilment
PackageCreated
WarehouseAssigned
DriverRequested
TrackingUpdated
CustomerNotified
Different services can react to fulfilment events without direct dependencies.
Insurance claims processing
ClaimSubmitted
DocumentValidationRequested
FraudAssessmentRequested
AdjusterAssigned
CustomerUpdated
Long-running workflows can be divided into independent processing stages.
Healthcare administrative workflows
AppointmentBooked
ReminderRequested
BillingRecordCreated
InsuranceCheckRequested
AuditRecordCreated
Sensitive workloads require appropriate security, privacy, encryption, logging, and regulatory controls.
CI/CD and DevOps automation
BuildCompleted
SecurityScanRequested
ArtifactPublished
DeploymentRequested
NotificationRequested
SNS and SQS can decouple build, scanning, deployment, audit, and notification tasks.
Where this architecture may not be appropriate
Event-driven architecture should not be adopted simply because it is modern.
It introduces:
- Eventual consistency
- More infrastructure components
- More complex debugging
- Schema-management requirements
- Duplicate-delivery considerations
- Distributed tracing challenges
- More demanding operational procedures
A simple synchronous request may be better when:
- The client requires an immediate final result.
- The workflow contains only one fast and reliable dependency.
- The application is small and unlikely to expand.
- Eventual consistency is unacceptable.
- The team lacks the operational maturity to manage distributed systems.
Architecture should solve a genuine business or engineering problem, not merely increase the number of AWS services in a diagram.
Key lessons from the project
Lesson 1: Microservices are not automatically decoupled
Breaking a monolith into multiple HTTP services does not remove coupling.
If every service synchronously depends on the next service, the system remains operationally coupled.
True decoupling requires careful control of dependencies, contracts, data ownership, and failure behavior.
Lesson 2: Give every consumer its own queue
Payment and Inventory should not compete for messages from one shared queue.
With one queue per microservice, each interested service receives its own copy and controls its own retries, backlog, scaling, and DLQ.
Lesson 3: Events should describe facts
Good event names describe something that has happened:
OrderCreated
PaymentCompleted
PaymentFailed
InventoryReserved
Avoid vague names such as:
ProcessData
HandleRequest
RunService
Business-oriented event names make architectures easier to understand.
Lesson 4: Business failures are not technical failures
A declined payment is not a processing exception.
It is a valid business outcome and should be published as an event.
Sending every negative outcome to a DLQ makes operations noisy and removes valuable business meaning.
Lesson 5: At-least-once delivery requires idempotency
Duplicate delivery is not an unusual edge case that can be ignored.
Every consumer that performs a side effect must define how duplicates are detected and handled.
Lesson 6: DLQs need ownership
Creating a DLQ is not enough.
Teams need:
- Alerts
- Runbooks
- Dashboards
- Investigation procedures
- Redrive procedures
- Message-retention policies
- Service ownership
Lesson 7: Event schemas need governance
An event is a contract.
Production systems should define:
- Schema ownership
- Event versioning
- Required and optional fields
- Compatibility rules
- Validation
- Deprecation procedures
- Consumer contract tests
Lesson 8: Observability must cross service boundaries
A single request can generate several events across several services.
Use fields such as:
eventId
correlationId
orderId
source
eventType
occurredAt
These values should appear consistently in structured logs so teams can trace a workflow from beginning to end.
Lesson 9: Design failure paths before production
Successful requests are usually the easiest part of the system.
The difficult questions are:
What happens when publication fails?
What happens when a consumer times out?
What happens when the same event arrives twice?
What happens when a DLQ begins growing?
How is a failed event replayed safely?
These questions should be answered during architecture design, not after an incident.
Lesson 10: Start simple, then harden deliberately
This project is intentionally understandable.
A production version can later add:
- Amazon Cognito or another identity provider
- AWS WAF
- AWS KMS customer-managed keys
- AWS Secrets Manager
- Amazon SES
- AWS X-Ray tracing
- CloudWatch dashboards
- AWS Lambda Powertools
- Event schema validation
- Transactional outbox
- CI/CD with GitHub Actions and AWS OIDC
- Multi-account environments
- Reserved or maximum concurrency
- Replay and redrive automation
Starting with a clear foundation is better than introducing every possible service before the application has demonstrated the need.
Final thoughts
The primary value of this project is not simply that it uses Amazon SNS and Amazon SQS.
Its value is that it demonstrates several fundamental distributed-system principles:
Producers should not need to know every consumer.
Consumers should own their queues.
Failures should be isolated.
Messages should be retried safely.
Duplicate delivery should not create duplicate business effects.
Business outcomes should be represented as events.
Operations teams should be able to detect and recover failed work.
Amazon SNS handles event distribution.
Amazon SQS provides durability, buffering, retry isolation, and independent consumer control.
AWS Lambda provides serverless processing.
DynamoDB provides service-owned persistence.
AWS SAM makes the complete environment repeatable through Infrastructure as Code.
Together, these services create an architecture that can respond quickly to customers, absorb traffic spikes, isolate failures, scale individual workloads, and support new business capabilities without repeatedly rewriting the original Order Service.
That is the real purpose of decoupling: not to make the architecture diagram more impressive, but to make the business more adaptable and the system more resilient.








Top comments (0)