OS Clusters in SAP on AWS landscapes

#saponaws #aws #sap #cluster

Cluster software are designed to provide an automatic high availability and failover capability in an enterprise environment. This ensures that business can operate without disruptions. While the public cloud environments, including AWS, provides resiliency by virtualizing the infrastructure which makes the underlying physical infrastructure failures transparent for the application, the infrastructure is not the only layer that can fail, there is an Operating System, a kernel in-built in that OS that translates requests to underlying layers, the storage, network, and the application itself which can fail and cause disruptions for business processes. It is hence important to ensure design patterns that can tolerate such failures. The whole AWS zone can suffer a network issue or in worst case a natural calamity, which also can lead to a business outage for SAP usage. Let us understand how clusters use specific AWS services in SAP environments.

The standard SAP application for ABAP follows a design that has 2 single point of failures. The first one is named ABAP SAP Central Services (ASCS), which consists of a message server and an enqueue server process. While the message server does the task of receiving and dispatching requests to appropriate application server, the enqueue plays the role of managing locks to ensure that application-level consistency is guaranteed. The second component is Database, which is the heavy component in terms of size and is also key to performance.
To ensure SAP on AWS can continue to perform in a stable manner, there are several cluster designs possible, in this blog we will deep-dive on 2 of them, namely, Pacemaker and Windows Server Failover Cluster (WSFC).

Pacemaker is available for SUSE Linux and RedHat Linux operating systems at a small additional cost with Operating System licensing. Among the other third-party solutions, this offering is popular because of the low cost and low maintenance. In AWS, there are few key items that need to be considered for this.

1.STONITH -The STONITH is an abbreviation for a very interesting term, Shoot The Other Node In The Head, which in literal sense means the same. Its basically a device or a service that is managed as a part of cluster. In AWS, it is named external/ec2 and this is an I/O based fencing agent. I/O based fencing agents are designed to ensure that no I/O can happen from a faulty node and it is detected using Quorum policy.

2.Overlay IP Agent - Its a cluster resource that is just acting as a watchdog and is intended to avoid a split-brain situation when there is a 2-node cluster. This application is also capable of shutting down AWS EC2 instances and update route tables. We will deep-dive on this agent in another article, however for understanding, consider this as an external watchdog that ensures a split-brain situation in a cluster can be avoided.

3.Route Table - The Overlay IP has to be routed to the currently active cluster node (EC2 instance), and this is achieved using the entry in route table attached to the VPC. The route tables are updated by cluster when failover happens.

4.AWS Transit Gateway - A scalable cloud router that will allow traffic from on-premise networks and other VPCs to be routed to Overlay IP.

5.Network Load Balancer - An alternative to Transit Gateway, this mechanism is primarily intended to be used for TCP load balancing scenarios, however the same can be leveraged for routing as well.

6.*EFS / FSx for Lusture / FSx for Windows File Server * - The cluster setups require a shared filesystem between the participating servers which can provide consistent reads/write. In case of AWS, this is available via FSx for Lusture for Unix based operating systems, while for Windows FSx for Windows File Server solves the purpose. The services are completely managed by AWS, and are available across zones. Elastic File Systems, or EFS, provide file services as well which is basically a manager NAS

All these services work in combination as depicted below

The Pacemaker, developed by ClusterLabs (https://clusterlabs.org/pacemaker/) is a cluster resource manager that is capable of detecting and recovering in case of machine and application failures. The various features like quorum, restriction to run resource on same machine, dealing with quorum loss scenarios are designed to provide the resiliency intended for enterprise-scale applications. Pacemaker is developed for Unix based operating systems and is integrated to both Redhat and SUSE Linux, which makes it a default choice for SAP HANA database clusters. In addition, it can be used for scenarios where SAP central services use one of the 2 operating systems as well. On AWS, both OS are supported and available in subscription based as well as bring your own license model.

Windows Server Failover Cluster, a Microsoft cluster management software, that is available natively for Windows servers, is leveraged in SAP landscapes where application layer uses Windows OS based ASCS and ERS server. From an AWS perspective, the design principle is closely aligned to the Pacemaker setup. The Windows setups require an Active Directory integration to work seamlessly and hence this design pattern additionally requires Microsoft Active Directory in same VPC or a peered VPC.

In addition to above cluster solutions, there are also other partner products below that can be setup to provide cluster setups, I will cover a comparison with pros and cons in another blog post for these products.