BrycePC for AWS Community Builders

Posted on Mar 18

Building “The Better Store” — Part 4: Implementing a Microservices Architecture with Cloud Native Patterns and AWS Services

#aws #serverless #microservices #ddd

The previous two articles of this series have provided introductions to Domain Driven Design principles, and how they may be used for defining an appropriate Microservice Architecture for our sample ‘The Better Store’ cloud-native ECommerce system.

This article continues from the proposed DDD tactical design to formulate an implementation of our decomposed microservices, while describing and adopting popular cloud-native patterns to reap advantages that they provide.

The focus here will be defining an implementation which has the following features scope, as defined in the previous Strategic Patterns section:

Scenario 1: OrderPurchased
**When **I submit valid Card details
**And **Payment is approved
**Then **My order details with cart details will be stored in the Order Repository
**And **The order details will be eventually persisted to the reporting database
**And **An electronic Receipt will be emailed to me
**And **A shipping order is sent for Fulfilment
**And **I will be directed back to the store’s home page, with a notice confirming the order number.

This logical flow is represented by the following:

_Figure 1: Logical flow representing completion of the Order Purchased scenario.

So now that we have identified the resources and required interactions between them for implementing flows, we need to determine exactly how to implement these in a way that provides optimal scalability, resilience and performance while considering cost with AWS as the platform of choice. Some questions that we can start asking ourselves while looking at the above high-level implementation designs are:

What is the best backend hosting technology to use for compute and data services that quickly scale on-demand, can be easily scaled across regions for a global implementation, are resilient, provide cost optimization for both development and production environments in terms of compute and operational/maintenance costs.
What are the best methods for enabling communications between services and data stores which provide scalability, resilience and cost optimization that accommodate both development and production environments?
What are the best methods for managing transactions and errors?

Microservice Architecture Patterns

A number of patterns and best practices which help to answer these questions have been defined for microservice architectures; however Chris Richardson provides a great overview and illustration of these at his website https://microservices.io/. A cut-down version of this to illustrate those considered for The Better Store are shown below:

Figure 2. A summary of Microservice Patterns, highlighting key patterns for discussion with The Better Store in yellow.

A description of those used and why by The Better Store are next.

Application Patterns

Decomposition
A. Decompose By Subdomain describes how Domain Driven Design may be used to decompose a business’s domain into decoupled subdomains or Bounded Contexts, each of which may be considered as a candidate for a microservice implementation. This topic has been the focus of our previous articles in this series.

B. The Self Contained Service pattern describes how services within an application are decoupled and can be updated and deployed with minimal risk of impacting other services. It also means services ideally should not synchronously call other services, or resources that they do not own such as shared databases; as in doing so can increases risk of release issues.
Consider for example the following alternative implementations for order fulfilment requests upon receiving payment confirmation from a card merchant. The first illustrates tight coupling between Order and Fulfillment services and use of a shared database. In this topology, changes to either of the Order or Fulfilment services, or the shared database, risk impacting request processing, and consequently customers not receiving their items as expected if not appropriately remediated. Side effects can also include:

Unexpected cloud usage charges; for example an issue causing a request to wait and consequently time out for a downstream service call will also impact its upstream requesting services.
Larger system availability issues due to exhausted resources (for example AWS Lambda concurrency, or database connections) if requests are not able to complete.
Potential issues in error handling when errors are returned to the payment and receipt system, which would need to be considered.

Figure 3: Illustrating tight coupling between Order and Fulfilment services and use of a shared database.

An alternative, decoupled and more resilient solution is shown below, noting that the client does not require data to be returned in a response; its calls may be made asynchronously. This solution offers the following:

Confirm Payment calls from the Payment system are placed on a queue with ‘Guaranteed At Least Once’ delivery; and a successful response is ALWAYS returned to it. The client just needs to know that the message has been delivered successfully.
Both the Order and Fulfilment service receive requests asynchronously; it does not matter if they are not currently running, the messages will wait for them until they are next available.
If either service fails in processing a request then processing may be configured to retry for a set number of times, after-which they may be placed on a “Dead Letter Queue” for appropriate error remediation.
Both Order and Fulfilment services in this way are effectively decoupled, including having their own databases, such that an error introduced to one service should minimise impact for the other.

Figure 4: An alternative decoupled solution

A further note for the Better Store; this pattern has also been taken to describe how each service should own and define all of their application and infrastructure dependencies such as data storage, security resources etc; such that they can be deployed quickly and independently across environments. This includes for example having their own security roles, firewall definitions, databases (see below), SSL certificates and domain records defined; external/shared dependencies are kept to a minimum; such as the VPC in which they reside.
In a future DevSecOps article, Infrastructure as Code (IaC) using AWS Cloudformation will be described, for the creation of fully-encapsulated Cloudformation stacks which may be used to deploy instances of independent microservices, including resources that they require.

The example below illustrates the Order Cloudformation stack, which defines all of its required resources, and shared infrastructure stacks that it depends on.

Figure 5: The Order Service encapsulated for deployment as a self-contained AWS Cloudformation stack.

To conclude this section on Self Contained Services, we can name some candidate AWS services for implementing the resources described:

Guaranteed at-least-once queueing: SQS
Dead Letter Queues: SQS
Asynchronous messaging between services (“Remote Procedure Invocation/RPC”): SNS, EventBridge, DynamoDB Streams
Data stores: DynamoDB, RDS/RDS Aurora
Self-contained IaC: Cloudformation
Compute processing for processes of short duration, with fast scaling capabilities: Lambda

Application Architecture
C. Monolithic: where an application is built and deployed from a single source code repository. This may have advantages for smaller applications and startups while source code is new and small, while offering reduced complexity. However, its continued growth over time without checks can yield a system that is harder to change, scale and deploy, in a phenomenon coined the ‘Big Ball of Mud’ (Foote & Yoder).

D. MSA: as already discussed, is focused on change agility when decomposing applications into multiple services. It cares less about code reuse in contrast to earlier architectures; it may be that duplicate code can sometimes exist between services, but this does allow such code to be modified if required in an application, knowing that other applications will not be affected as a result.

Database Architecture
E. Database per Service: is another decoupling approach recommended for microservices. Traditional relational databases can grow large to support data models shared by multiple services, while they also enforce relational constraints and atomic transactions across tables to maintain data integrity.
Imposing the database/service pattern implies the following:

The database is split into objects specific for each microservice, which means breaking relational constraints and ACID transaction support otherwise provided.
The application’s architecture needs to be refactored to cater for the loss of these constraints to preserve data integrity. Patterns that may assist include, ‘Saga’, and ‘Idempotent Consumer’, which are introduced below. Advantages of the Database per Service again is change agility; any changes that may be required to a database should generally only impact its owning microservice. This greatly-reduces the risk of issues and the amount of regression testing that may otherwise be required when making changes. Furthermore, each service is free to use a database technology that is most suitable for their needs (aka polyglot persistence), for example:

The Order microservice is expected to use AWS DynamoDB, a serverless NoSQL database which scales well for high-demand, and is capable of replicating data across regions for potential future global scalability of the application.
The Reports microservice is expected to use AWS Aurora Serverless, to receive orders in batches, which supports complex relational queries using SQL to provide overnight reports. Its serverless nature is expected to provide cost optimisation for its low intended traffic, while any cold-starts in its activity will not impact users.

F. Saga is a pattern that addresses the problem of how to manage business transactions that span multiple services and/or databases, for example when implementing the Database per Service pattern described above, and including a distributed transaction e.g. via a 2-phase-commit is either too complex or not possible for error handling. It describes a process whereby such transactions are implemented as a sequence of partial transactions against each of the participant databases. If any single step of the transaction fails, then previous changes are to be rolled-back by running copensating transactions in the reverse order.

An example of a choreography-based saga including compensating-rollback of transactions is given below (where system behaviour is asynchronously event-driven):

Figure 6: Saga pattern illustrating compensating actions to roll-back a transaction.

Application Infrastructure Patterns

Communication Patterns
G: Remote Procedure Invocations (RPI): refers to the use of standard synchronous request/reply protocols for inter-service communications, for example via REST, gRPC. These have advantages over Remote Procedure Calls (RPC’s) between services, which are dependent on a specific programming language being used between client and server; such as an SDK call between a NodeJS application and AWS’s NodeJS SDK.

Inter-service communications using standard protocols are sometimes necessary for processing of requests, and RPI’s use of the request/reply pattern allows this to be achieved simply. The pattern however does result in tight-coupling between services involved, as discussed for the Self-Contained Service pattern above.

_Candidate AWS services: API Gateway, AppSync_

H. Messaging: refers to the use of asynchronous message channels for inter-service communications, in-contrast to synchronous Remote Procedure Invocations.
As previously described in the Self-Contained Service pattern described above, use of asynchronous messaging is aimed at decoupling services and increasing overall system availability, such that a change to one service should generally be seamless to services that communicate with it.

The pattern also includes different types of communication; for example:

Notification; a sender sends a message to a recipient, and does not expect a reply.
Request/asynchronous response — where the recipient replies eventually. The sender does not block waiting.
Publish/subscribe — a service sends messages to 0, 1 or more subscribers. These may also be ‘durable consumers’; to guarantee they will receive messages eventually if they are not currently running.

_Candidate AWS services: SQS, EventBridge, SNS._

I. Idempotent Consumer: is a key pattern requires full consideration when implementing microservices; it means that a service must be able to handle requests if received more than once with no side effect; i.e, the outcome of processing a request repeatedly must be the same as if only processed once.

The reason why this pattern is so important is that a number of AWS services guarantee ‘At Least Once’ message delivery to consumers; i.e. no messages will be lost, but duplicates may be received and the consuming service must be able to deal with these.

Such examples include:

SQS may redeliver a message to a consumer if previously consumed but has not been acknowledged as processed, before its Delivery Timeout period has elapsed.
Asynchronous requests may be automatically retried by some AWS services on encountering an error. For example, if an error is thrown from a lambda function, the lambda function will automatically retry processing 2 further times in case it was transient, and if still not successful it will place the request on a Dead Letter Queue if configured.
Failed message deliveries from EventBridge, SQS and SNS all may result in messages being retried, and being delivered to a Dead Letter Queue if a threshold has been exceeded.

The design of idempotent consumers does also have benefits for error handling; for example; if a single request contains 100 records in which only 1 fails; the request can be safely retried following correction of the error for the single record; resending of the other 99 records will not result in any change to the system.

Methods for implementing this pattern may include:

Ensuring that every request has a unique identifier and recording receipt of these in a data store when messages are received and processed. Any subsequent receipts of the messages may be ignored.
For some applications designing requests to contain all state of a request to be processed, such that performing an update in the datastore for the record will not result in any change.
Ensuring that requests have a timestamp included from the receiving system, and only processing a record if this is newer than the timestamp last received by the consumer.

J. Api Gateway: is often implemented in front of a service to act as a single entry point for its clients, to provide the following capabilities:

They define a service’s Published Language (refer to DDD Strategic Patterns) / interface contract via an open-standard specification such as Swagger or OpenAPI, for purposes of providing a shared understanding of required input data for requests and the expected output, between developers and its consumers. It is intended that these specifications provide all the information that its consumers require, the inner workings of the service do not need to be known.
They may serve simply as a proxy layer to underlying services, while offering additional capabilities such as authentication, authorisation, request throttling (e.g. to protect the system from unexpected surges in traffic), WAF and transport-based encryption.

Candidate AWS services: Api Gateway, AppSync (supporting GraphQL)

Observability
K. Metrics: provide a continuous stream of data points over time as a measure of the performance and health of an application and its resources, for monitoring and potential remediation. Example metrics include:

Counts of consumer requests and errors over time
Average, maximum and minimum request durations for request processing (latency) over time
CPU (%) and system memory (e.g. MB) used over time.

Candidate AWS services: Cloudwatch Metrics

L. Log Aggregation: refers to a centralized logging service that aggregates logs from multiple service instances, for easy accessibility and analysis.

Figure 7: Screenshot of AWS Cloudwatch Insights, which allows log groups to be queried using a SQL-like syntax for fast analysis and troubleshooting.

Candidate AWS services: Cloudwatch Logs, Cloudwatch Insights, OpenSearch

M. Distributed Tracing: provides the ability to determine how a single request may traverse across multiple services for its processing within a distributed system, which is made possible by their allocation of a unique trace id when first received.
Distributed tracing provides the following benefits:

It enables developers to understand the flow of processing events for a request.
Can help identify performance bottlenecks at different processing points in the system.

Figure 8: Screenshot of an AWS X-ray trace for processing of a single request.

Candidate AWS services: XRay, Open Telemetry

N. Dashboards: provide a graphical collection of metrics for a defined portion of the source to give a holistic view of its behaviour.

The following provides an example specific to the Order service, providing a view of resources that it contains:

Figure 9: Cloudwatch Dashboard constructed specifically for monitoring resources belonging to the Order service.

Candidate AWS services: Cloudwatch Dashboards, OpenSearch (Kibana), Grafana

O. Alarms: These may be used to provide notifications to IT Staff in cases where manual intervention is required when certain system metric thresholds being exceeded.

Examples include:

Request volumes are higher than the system’s capacity, to cause throttling of some requests (throttling metric > 0).
Asynchronous requests have failed processing following x amount of retries, and have been placed in the configured Dead Letter Queue (which has an alarm threshold > 0 for a defined period).
Synchronous requests to a lambda are failing; where the lambda Error metric threshold is > 0.
Relational database CPU is > 90% for a defined period; vertical scaling may need to be considered.

Candidate AWS services: Cloudwatch Alarms, OpenSearch (Kibana), Grafana, SNS (notifications)

Infrastructure Patterns

Deployment
P. Services/Host or VM: Involves deploying a number of services or potentially an entire system on a single host.
This may initially provide advantages of simplicity and efficient resource utilization in contrast to a Service/VM pattern, but it also has the following disadvantages:

Difficulty in isolating resource usage between services and reduced availability because of this; an errant service, or host issue will impact multiple services.
Potential difficulty/less efficiency in being able to horizontally-scale a single small service, when larger more resource-consuming resources also need to be included.
Maintenance of the underlying Operating System is typically the responsibility of the cloud account holder, including OS updates and security patching.
Horizontal scaling involves instantiating new VM’s, which due to the required startup of their OS and other underlying services can be slow.

Candidate AWS services: EC2 (shared or dedicated hosting)

Q. Service/Host or VM: Involves deploying single services into their own dedicated host VM’s. This provides advantages over Services/Host or VM, in that services are isolated from each other, at the cost of having to maintain and pay for additional hosts or VM’s.
Horizontal scaling of entire VM’s is slow, but potentially faster than if hosting multiple services/VM.

Candidate AWS services: EC2 (shared or dedicated hosting), Beanstalk

R. Service/Container: Involves packaging services as docker images, and deploying them into isolated docker containers.
Benefits of container vs VM deployments include:

Horizontal scaling of docker instances is much faster in contrast to starting new VM’s and their underlying OS’s.
The container image also encapsulates the runtime that the service requires; which provides portability with consistent for deployment of services into different environments. Note unless serverless options are used, maintenance of their underlying host VM including OS is still required.

Candidate AWS services: Beanstalk, ECS, Kubernetes, App Runner

S. Serverless: Refers to the deployment of services to compute platforms which hide their underlying server details; their cloud provider instead assumes responsibility for managing underlying hosts, their associated infrastructure, and OS patching.
Typically the implementor needs to only provide the amount of memory (GB) and/or the number of virtual CPU’s that are to be allocated to a service executable.

Candidate AWS services: ECS (Fargate), Kubernetes (Fargate), Lambda.
Of note, while the first 3 services above provide container-based hosting of services, Lambda provides Function as a Service (FaaS) capabilities; where each implementation provides a single compute function only, which are designed to run transactions of short duration but which can scale very quickly based on consumer demand.

Next, we will look at how some of these patterns may be used for our chosen Use Cases.

Implementation

On consideration of the microservice patterns and candidate AWS services for their implementation, we conclude here in defining an implementation view, as illustrated below:

Figure 10: Implementation view for Order Purchased scenario

Decisions made for this architecture include:

Order, fulfilment and reporting services will be implemented as separate decoupled AWS microservices; each of which will be defined as separate AWS Cloudformation stack instances (tbs-app-order-prod, tbs-app-reports-prod, tbs-app-fulfilment-prod) that are defined in their own GitHub repositories, matching the stack names. This aligns well to the Decompose by Subdomain and Self Contained Service patterns.
The Database per Service pattern will be used such that tbs-app-order and tbs-app-reports services have their own database that is best-suited for their use-case. The tbs-app-order service will implement DynamoDB as a highly-scalable and potentially global database that can accommodate high traffic volumes, where structured data and complex query capabilities are not required. The tbs-app-reports database will implement AWS Aurora Serverless v2, to subscribe to batched order confirmation updates, and allow these to be stored in a structured manner to allow complex queries for monthly reporting. As its traffic volumes are expected to be low and immediate responses are not required by clients (cold-starts of the dabase can be accommodated), scaling down to 0 CPU will be used for cost optimization. Finally, it is expected that running of queries can target the database’s read-only endpoint, to not impact writing to the database (though again this is probably not required for its expected traffic).
Inter-service communications will be asynchronous using messaging where possible, using primarily AWS EventBridge. AWS EventBridge offers similar capabilities to SNS for implementing the publish/subscribe pattern; it is a little bit slower which is not deemed so important to us for asynchronous communications, as it has advantages over SNS including more integration options, content-based subscriptions (i.e. subscribers can select to receive requests based on their content), and event storing (although this and event sourcing is not considered further here; it has its own complexities). SQS is used for guaranteed message delivery, where it is integrated with API Gateway to receive payment confirmation messages from the payment system (Stripe). In this way our webhook that we configure in Stripe to invoke has very high availability; success responses will always be returned to Stripe as messages are placed on the queue for processing.
Aysynchronous services will implement the Idempotent Consumer pattern, to support Guaranteed At Least Once delivery properties of SQS and EventBridge, and the automated retry ( 2 times) behaviour of asynchronously-triggered AWS Lambda functions if they throw an error. Dead Letter Queues (SQS) will be configured for SQS queues, EventBridge and Lambda functions where appropriate, to ensure that for Lambdas; errored requests are not lost, and other services such as EventBridge, that retries do not continue forever!
Synchronous messaging via Remote Procedure Invocation will be implemented as RESTful API’s using API Gateway. Examples of its use include requests from the client website to retrieve and post data, where the client is dependent on information returned in the response.
AWS Cloudwatch will be used for monitoring metrics and logs of services and their associated resources, and providing monitoring dashboards and alarm capabilities. AWS X-ray will be used for distributed tracing of requests received. Note other services such as OpenSearch and Managed Grafana are also available and may provide greater capabilities; Cloudwach has been chosen due to its simplicity for implementation while providing sufficient capabilities at low cost for our needs.
Serverless resources will be used where possible for our implementations, for reduced maintentance that would other be required for managing and patching servers, its generally-faster horizontal scalability and use of the Pay As You Go model which is generally favourable, especially for non-production systems! AWS Lambda is currently used to provide all compute functionality for The Better Store, while all of its request processing workloads are small and of very short duration (i.e. < 10 seconds, where AWS Lambda offers request processing for durations of up to 15 minutes).

To conclude, sample code (as Cloudformation templates and NodeJS implementations) for the services described here and other supporting stacks may be viewed on GitHub.

Coming soon in Part Five: Use of DevSecOps; specifically AWS Cloudformation for defining both applications and infrastructure as code, for fully-automated deployments.

References

“A pattern language for microservices”, Richardson, Chris. _web _(2023)
“Big Ball of Mud”, Foote & Yoder (University of Illinois), _web _(1999)
“The Better Store Documentation”, Cummock, B. _web _(2025)
“The Better Store Github Repository”, Cummock, B. _web _(2025)

Disclaimer: The views and opinions expressed in this article are those of the author only.

DEV Community