Devam Parikh

Posted on Oct 13, 2023

Testing Application Resilience: How to Stop Amazon ElastiCache Cluster and Manage Traffic

#aws #networking #devops

Introduction

As developers, it is crucial to test the resiliency of our applications and understand how they handle failures or disruptions. In this blog post, we will explore a scenario where we need to stop an Amazon ElastiCache cluster to see how our application behaves when Redis is unavailable. Although ElastiCache clusters cannot be stopped, we will discuss alternative approaches to achieve our testing objective.

Understanding Amazon ElastiCache

Amazon ElastiCache for Redis is a powerful in-memory data structure service that provides real-time performance for modern applications. It serves as a cache or a data store, delivering high-speed access to data. ElastiCache uses a synchronous replication mechanism to maintain data consistency across its nodes.

Challenges with Stopping ElastiCache Cluster

Stopping an ElastiCache cluster is not possible due to the synchronous replication mechanism. If we stop a node, the cluster's redundancy is compromised, potentially leading to instability or complete failure. However, we can explore other methods to create scenarios where our application experiences Redis unavailability.

Blocking Incoming Traffic using Security Groups

To simulate Redis unavailability, we can block incoming traffic to the ElastiCache cluster. Security groups act as virtual firewalls, controlling inbound and outbound traffic. By removing all the inbound rules for the ElastiCache cluster, we can prevent any incoming requests from reaching it.

However, it is essential to understand that security groups are stateful[1]. This means that existing connections are not interrupted when security group rules are changed. Thus, our application may still be connected to the ElastiCache cluster.

Addressing the Issue

Two methods can be used to tackle this issue:

1. Restarting the Application: By restarting the application, existing connections will be terminated, forcing the application to establish new connections. This can validate the application's ability to handle Redis unavailability.

2. Using Network ACLs: Network Access Control Lists (ACLs)[2] operate at the subnet level and allow or deny specific inbound or outbound traffic. Unlike security groups, network ACLs are stateless, meaning they don't automatically allow response traffic. Introducing a network ACL that blocks traffic in either direction can break existing connections.

Network ACL in Depth

You can either use the default VPC network ACL or create a custom one with rules similar to security groups for extra VPC security at no extra cost.

The following diagram depicts a VPC with two subnets, each having its network ACL. When traffic enters the VPC (such as from a peered VPC, VPN connection, or the internet) the router directs it to its destination.

Network ACL A controls which traffic can enter subnet 1 and leaves it to destination outside subnet 1. Similarly, network ACL B regulates traffic entering and leaving subnet 2.

Creating a Custom Network ACL

As illustrated in the figure below, this is how I've configured the denial of incoming traffic from my application to the ElastiCache cluster.

A network ACL comprises both inbound and outbound rules, each capable of allowing or denying traffic. These rules are numbered from 1 to 32766.

When determining whether to allow or deny traffic, we evaluate the rules sequentially, starting with the lowest numbered rule. If a rule matches the traffic, it is applied, and no further rules are assessed.

Conclusion

Testing application resilience is essential to ensure smooth operation in challenging scenarios. While stopping an ElastiCache cluster is not feasible due to its replication mechanism, alternative approaches such as blocking incoming traffic using security groups or employing network ACLs can help simulate Redis unavailability. By understanding the statefulness of security groups and the statelessness of network ACLs, we can effectively test our application's behaviour when critical resources are not available.

In summary, remember these key points:

ElastiCache clusters cannot be stopped and rely on synchronous replication for real-time performance.
Security groups are stateful, meaning existing connections persist when rules are modified.
Network ACLs are stateless and can be used to block traffic, potentially breaking existing connections.

Reference:
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-connection-tracking.html
[2] https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html

DEV Community

Testing Application Resilience: How to Stop Amazon ElastiCache Cluster and Manage Traffic

Introduction

Understanding Amazon ElastiCache

Challenges with Stopping ElastiCache Cluster

Blocking Incoming Traffic using Security Groups

Addressing the Issue

Network ACL in Depth

Creating a Custom Network ACL

Conclusion

Top comments (0)

Read next

Amazon Q Developer Tips: No.19 Amazon Q Developer Agents - /doc

SaaS Cost Management: Strategic Tips for Modern Enterprises

Multicloud Cost Management Guide for FinOps Practitioners

A conversation with your architecture