DEV Community: Ebrahim Gomaa

Container Migration Methodology

Ebrahim Gomaa — Sun, 10 Oct 2021 18:48:07 +0000

Introduction

Businesses have recently developed great interest in Container-based architectures as it's more suitable for agile architectures.

The following chart from Grand View Research shows the growth of the application
container market size in the United States.

In this guide; you as Architect or Ops engineer will find your way through migrating to an AWS container service.

Why migrate to the AWS cloud ?

Agility and product improvement over regular instances
Cost Savings by improving resource utilization and flexible scaling up/down
Elasticity of scaling up/down according to the business needs.
Faster Innovation by hosting control plane to the cloud and releasing more resources for innovation
Deploy globally in minutes via 77 AZs across 24 geographic Regions worldwide

Why migrate to Amazon Container Service ?

DevOps driver using managed container technology to support DevOps on the cloud platform which further improves efficiency and increase customer migration pace to containers
Platform as a Service (PaaS) platform building: build your own containerized platform on the cloud and combine them with your DevOps operation for improved efficiency and flexibility, also reduced complexity by unifying DevOps and Production environments.
Operations simplification using managed container services to relieve the management and operations burden also gain improved efficiency integrating with deeply integrated services
Expectation of the same user experience as native Kubernetes with services like EKS which provide the convenience of hosted services plus the freedom of open source.
Digital transformation which involves the development of digital technologies and support capabilities to create a vibrant digital business model.
IoT/ML innovation Model training and deployment in the cloud
Deep integration with AWS leverage the breadth and depth of AWS cloud technologies, including networking, security, and monitoring
Security and compliance — AWS offers 210 security, compliance and governance-related services and key features, also isolation between containers and granular access permissions per container

AWS Capabilities for Containers

For Serverless computing for containers, use AWS Fargate
If you want to manage your computing environment for containers, use Amazon EC2 (Elastic Cloud Compute)
For deeply-integrated AWS container orchestration, use Amazon ECS (Elastic Container Service)
For managed Kubernetes-in-the-cloud service with zero-refactoring migration, use Amazon EKS (Elastic Kubernetes Service)
For fully-managed container registry (container images library), use Amazon ECR (Elastic Container Registry)
For managing microservice architectures across multiple compute infrastructure services (EC2 - ECS - EKS - Fargate), use AWS App Mesh.

Strategy for Container Management

Before formulating the migration plan, Architects should evaluate the customer's preperation to the migration process to ensure that it will solve their problems, also to provide the basis for the customer to help make decisions throughout the plan. Following is the aspects of evaluation for the customer's preparation

Business Capabilities

Business Target — Targeted benefits from the migration process. Involved rules include business managers, financial managers, budget owners, migration decision makers and stakeholders.
People — Who from the customer's IT personnel will be involved in the migration process , and staffing demanding for the process according to Technical skills. Also training the tech staff for the new technology. Involved rules include HRs, staffing specialists and people managers.
Governance — Evaluating involved teams in the process and people in charge of these teams. Also evaluating the final decision maker for the decision chain. It also involves evaluating the effectiveness of PM tools and communication. For the customer, provide a way for measuring project results (e.g. cost reduction ratio). Rules involved include CIOs, PMs and Enterprise Architects.

Technical Capabilities

Platform
- Cloud Platform — Customer's familiarity with the AWS cloud platform and basic container services within it.
- Container Platform — Customer's familiarity with the AWS cloud container platform and their related skill set.
- Assessment Method — Consider the AWS APN Immersion Day tool.
- Proof of Concept (PoC) — Understand and assess customer's familiarity with the AWS platform and services through a simple PoC. After the demo, the customer needs to know the basics of EKS and the difference between it and other services/platforms, also the best method to understand EKS.
Security Compliance — Evaluate the customer requirements on AWS infrastructure security, IAM rules and RBAC such as node IAM, cluster IAM and Pod execution. Also evaluate requirements for service accounts the customer needs and its minimum permissions.
Operations
- Monitoring — Metrics, tooling and methodology to build the monitoring system.
- Alarm — Customer's requirements for alarms, alarm indicators and impact of alarms on business.
- Analysis — Customer's analysis requirements in the container operations field, such as attacks and error root cause analysis.
- Release Management — Tools and process used by the customer for release management and the migration method or optimization plan for release through the evaluation.
- Disaster Tolerance — Customer's disaster tolerance requirements and recovery plan.

Container Migration Maturity Model

This graph shows how difficult should the migration project will be depending on two factors: Platform operation capabilities (which affects customer's use of the new container platform during and after migration) and the source tech stack. (which affects difficulty and workload of the migration. Also the operation team capabilities and functions determines whether the customer can achieve the migration goal.

Technical challenges of container and orchestration tools :

Monitoring — Monitoring methods change from typical servers to containers and services.
Logging — The need to know which container from which host to collect logs from makes it difficult especially by the increase, decrease and movement of containers.
Troubleshooting — Difficulty to analyze container failures by adopting past behavior.
Security — The impact of rapid development of the community version on security guarantees. Permission management poses new challenges to operations.
Network — Increased difficulty of network planning and design.

Evaluating the source cluster tech stack

Compatible K8s — Migrating from K8s cluster built on AWS or other cloud providers through tools like Kubeadmin to Amazon EKS. Easiest because of infrastructure consistency and similar tooling. Some considerations in case of network plugins, Ingress and image repos.
Variant of K8s — K8s cluster provided by third-party platforms like RedHat OpenShift. Differences from its deployment from the typical K8s cluster adds to the difficulty of the migration process
Heterogenous container orchestration engine — Migrating from other orchestration stack than Kubernetes. Huge difference in design concepts and implementation adds to the difficulty of the migration process
Containerization — Migrating from server deployment to containerized application. This introduces three main risks which are lack of developer support in the customer's IT team, lack of deployment instruction documents and customer's micro-service requirements, so it's considered the most technically difficult.

Mobilization stage

This stage has 5 main goals:

Investigate the goals of migration
Build the migration team
Assign roles and responsibilities
Evaluate migration methods
Formulate migration project plan

Discover

Understanding customer's technical and business goals and migration targets through questionnaires and interviews

Discover Business Information (DBI)
- What is the target application system for migration? Which business unit does it belong to? Do they have any important business activities recently?
- Migration cycle — When does it start? What is the time span? Is there a clear deadline?
- Migration expectations — What is the goal to achieve? Does the customer have any clear metrics?
- Personnel — What is the number of personnel responsible for the migration? What kind of skills do they possess? Which modules are they responsible for?
- Cost — What is the labor cost? Migration cost? Dual environment parallel run cost? Target cluster planning cost?
Discover Technical Information (DTI)
- Where is the platform of the source cluster?
- What is the source cluster’s actual usage of computing, storage, and network resources?
- Is the application stateful or stateless?
- What are the dependencies among applications?
- What is the technology stack used by the platform where the source cluster is located?
- Is there any source container cluster-specific information?

Report

Prepare a research report including the following information about selecting migration methods and output solutions (not limited)

Cluster information
Image repositories
Log collection subsystem
Monitoring subsystem
CI/CD subsystem
Business impact

Choose a migration method

Depending on the container migration maturity model (People - Tooling - Source platform), you can recommend a suitable migration method to the customer. Here we discuss the migration methods of different Source platforms mentioned before

Compatible K8s — If the source cluster is on AWS, it's easy to migrate through AWS tools, no matter stateless or stateful. If on another cloud platform, it's easy to migrate it with AWS tools if stateless. If stateful, you'll need another third-party partner software
Variant Kubernetes — Depends on source platform
Heterogenous container orchestration engine — From the design concept to the specific deployment, different container orchestration engines differ a lot, so this type of migration project becomes very complicated.
Containerization — The most complex, requires refactoring of the whole containerized architecture.
With CI/CD system — Configuring and labeling network and working nodes enables you to automate release processing targeted to EKS.

Planning

Formulating guidance plan for the migration process. Using project management best practices and agile delivery, include the following in your plan

Review project management methods, tools, and capabilities for gap analysis.
Define project management methods and tools, and how to use them in the project.
Define project communication methods and problem escalation mechanisms.
Develop a project task scheduling table, and clarify project risks and solutions.
Decide the composition of the migration team, and clarify the responsibilities of the team.
Outline the resources and costs required to migrate the target environment to AWS.

For technical planning, include the following

Discover the application dependency, which is critical for project prioritization and planning.
Clarify the migration priority of the applications, and select the appropriate application system for migration verification.

AWS Landing Zone

Following configuration will be predefined to you when you use AWS Landing Zone

Account structure — Initial multi-account structure with baseline security
Network structure — Basic network configurations for network isolation, connection between AWS and the local network and user-configurable network access and management options. You should still plan for the EKS Pod IP bool based on the characteristics of the CNI network plug-in
Account security baseline — Settings for AWS CloudTrail - Config - IAM, Cross account access and Amazon VPC - GuardDuty
AWS user access management — Provide a framework for cross account user IAM based on Microsoft Active Directory, centralized cost mangement and reporting. Creation and management of users and permissions for the Amazon EKS cluster.

Skills/Cloud Center of Excellence (CCoE)

This is a group of people experiences with AWS and Amazon EKS experience whom you should train to lead the migration process. You should also design how the CCoE team will lead and perform the migration task.

Operations Model

AWS Basic Environment management — Customers must operate and maintain computing, storage, network, and permissions with managed services to reduce the workload.
Container cluster operations — Worker node management (managed and unmanaged), worker node upgrade methods, dynamic scaling of work nodes, Pod capacity management, application deployment, and so on.
Monitoring — Monitoring the status of hosts, pods and application servers
Logging — Collecting and processing logs of hosts and pods.
Release management — Version control, CI/CD (DevOps)
Change management — Deployment and process description of the change management tools to manage changes in the original process throughout the migration process

Security and compliance

According to customer's needs and following these best practices

Cluster Design
- For cloud infrastructure security, see Secure Cloud Computing Architecture (SCCA) on AWS GovCloud (US)
- IAM Roles — User roles, resource roles, and Pod roles
- Managed or unmanaged Node Groups
- Control SSH login
- EC2 security group — Security group reference and port opening between services
Network
- Network isolation — VPC, subnet, AWS PrivateLink, VPC peering
- Restrict network access to API server endpoint
- Open the private endpoint of the API server
- Protection service load balancers — Network Load Balancer (NLB) or Application Load Balancer (ALB)? ALB ingress or Nginx ingress?
Images
- Build secure images — Content addressable image identifier (CAIID)
- Use vulnerability scanning — Images scanner tools
- Image Repository — Use private image repository
Runtime Security — Restricting Pod permissions
- Namespace — Provide scoping for cluster objects; allow fine-grained cluster object management
- RBAC — Manage the authorization of the cluster based on the least privilege principle with periodic audits to protect customers from external threats and internal misconfiguration or accidents.
- Restrict the runtime permissions — Minimizing capabilities of the running containers to protect from malicious and misbehaving containers
- Pod security strategy — Enforcing K8s and EKS security best practices (e.g. Not running as root - not sharing host node's process or network space - enforcing SELinux)

Migration

Implementing your migration plan with simple application migration requested by the customer to try migration experience. This stage is preceded by designing 4 plans so far : Migration plan (AWS architecture - app architecture - operations process), testing plan, Cutover plan and Rollback plan in case of unsuccessful cutover. The CCoE team should lead the migration process and you can also use automated tooling. It's good to set a checklist (customer-specific) to confirm migration completion before cutover.

Validation

Test migration before cutover

Functional verification
Performance verification
Disaster Recovery

Cutover

Switch the transaction flow to the new system with close watching. If any abnormal behavior was detected, run the rollback plan. This process requires the playbook and the runbook being output before hand. It also needs exercising before performing, and extensive CCoE team support because of their wide experience with diverse migration teams.

Web Application Hosting in the AWS Cloud

Ebrahim Gomaa — Wed, 08 Sep 2021 22:41:15 +0000

An overview of traditional web hosting

This image depicts the traditional architecture of a three-tier web app. In the following sections, we'll show how easily this architecture can be built using AWS.

Web app hosting in the cloud using AWS

After studying the value moving to the cloud and deciding its better for your case, this section helps you architect your application in the cloud using AWS

AWS can solve common web app hosting problems

Cost effectiveness

Leveraging automatic scaling (up/down) based on traffic provisioning to cut out useless capacity at non-peak times, to ensure cost-effective usage of resources.

Fast-response scalability

Fast-responding scalability in case of unexpected loads if compared to the down time encountered by the traditionally hosted apps in case of unexpected peaks.
Managing Different environments

Easily and cost-effectively manage environments (test/beta/staging) to ensure quality of the application at different stages of its development lifecycle. This helps use this parallel fleet optimally when and as needed. You can also use this parallel fleet as staging environment for your new release and leverage Blue-Green Deployment.

An AWS Cloud architecture for web app hosting

DNS services with Amazon Route 53 simplifies domain management
Edge caching with AWS CloudFront to decrease latency of content to users
Edge security for Amazon CloudFront with AWS WAF customer-defined rules to filter malicious traffic (XSS - SQL injections)
Load balancing with Elastic Load Balancing (ELB) spread load over Availability Zones and use AWS Auto Scaling groups for redundancy
DDoS protection with AWS Shield for Network and Transport layers DDoS attacks protection automatically
Firewalls with security groups host-level stateful firewall for both web and app servers.
Caching with AWS ElastiCache leverage Redis and Memcached for lower latency of frequent requests.
Managed Databases with Amazon RDS highly available, multi-AZ DB architecture with 6 possible DB engine
Static Storage and backups with Amazon S3 simple, HTTP-based object storage for backups and assets.

Key Components of AWS Cloud web app hosting architecture

Network Management
- Security groups provide host-level security
- Amazon VPC :
  - enables running resources in an isolated network that you defined.
  - helps create hybrid architecture via hardware VPNs to extend your datacenter using AWS cloud.
  - Works with both IPv4 and IPv6.
Content Delivery
- CDN (Content Delivery Network) provide a network of edge locations to deliver your content in a geo-dispersed fashion though edge caching.
- For dynamic content, CDN retrieves data from the origin server
- You can use CloudFront as a global network of your static, dynamic and streaming content.
- CloudFront is optimized for working with AWS services (like S3 and EC2) with a pay-as-you-go pricing method.
- Any other edge caching solution should work well in the AWS cloud.
Managing public DNS
- Route 53 is a scalable and highly-available AWS-optimized cloud DNS service. It's also fully compliant with IPv6
Host Security
- Use EC2 security groups, which are analogues to firewalls, to limit inbound access to your instance to only specific subnets, IP addresses and resources.
Load Balancing
- Amazon ELB (Elastic Load Balancer) is used to distribute incoming traffic across multiple targets in the same AZ or across multiple AZs.
- It offers 4 types of LBs, all provided with high-availability, scalability and security.
Finding hosts and services
- Most IPs in the AWS are dynamic
- EC2 instances are provided both public and private DNS endpoints that are accessible through the internet
- You should assign a Static IP address (Elastic IP in the AWS terminology) for instances and services that require consistent endpoints, such as primary databases, central file servers, and EC2-hosted load balancers.
Caching
- Amazon ElastiCache is a highly-available and auto scalable in-memory cache web service that's protocol compliant with Memcached and Redis.
DB (Config, Backups and failover)
- Using Amazon RDS
  - Provide access to popular DB engines in the cloud
  - Supports MySQL, PostgreSql, MS SQL Server and Oracle.
  - Easy and flexible scalability of both compute resources and storage capacity
  - Backup with retention periods
  - Multi-AZ deployments for increased availability
  - Read replicas to scale out for heavy read workloads
- Hosting and RDBMS on EC2 instance
  - Install your RDBMS of choice on an EC2 instance
  - Ultimate flexibility of architecture to fit your requirements
  - Amazon EBS for fault-tolerant storage for data and logs.
  - For demanding workloads, you can use Amazon EBS Provisioned IOPS and specify the IOPS required.
- Non-relational DBs
  - Amazon DynamoDB : Cloud-native solution with all AWS goodness out-of-the-box.
  - Amazon DocumentDB : Ready for JSON data at scale. Compatible with MongoDB
  - Amazon KeySpace : Full compatibility with Apache Cassandra
  - Amazon Neptune : Reliable and fully managed graph DB.
  - Amazon QLDB (Quantum Ledger DB) : Fully managed ledger DB with transparent, immutable and cryptographically verifiable transaction log owned by a central authority.
  - Amazon Timestream : Server less time series DB for IoT and operational applications.
  - You may use EC2 to host any other non-relational DB you're working with
Storage and backups
- Use Amazon S3 for static storage like files and media.
- Use Amazon EBS as attachable storage volumes with EC2 instances.
- EBS runs in a different lifecycle from that of the instance it's attached to.
- You can take a snapshot of an EBS volume and store it on S3. Since the changes only are stored, more frequent snapshot will decrease snapshot time
- EBS goes as large as 16TB and the ability to stripe volumes for increased IO performance.
- Use EBS Provisioned IOPS to meet the needs of your IO-intensive workloads - 16k (all instance types) to 64k (Nitro systems) and io2 block express volume type for up to 256k IOPS and max storage 64TB
Automatic Scaling
- Use Auto Scaling alone with CloudWatch and Amazon ELB (Elastic Load Balancer) to scale up/down/in/out your fleet automatically based on monitoring-based results.
- Use Auto Scaling groups to scale different layers of the application independently.
- You can also scale EC2 instances manually using the EC2 API
Additional Security Features
- The large scale of the AWS network helps protect you against DDoS attacks via scaling up your app in response to the large traffic using ELB, CloudFront and Route53
- AWS Shield : Managed service that protects you against various forms of the DDoS attacks. It's standard offering is free and active in your account and protects against common attacks. The advanced offering provides you with a near-real-time visibility into tha attack, integration with other services and the ability to access the AWS DDoS Response Team for large-scale sophisticated attacks.
- AWS WAF (Web app framework) : Works with CloudFront of Application Load Balancer to protect your apps against XSS, SQL injection and DDoS attacks. Also comes with a fully featured API helping your in automation.
- AWS Firewall Manager : Centrally configure and manage firewall rules across your accounts and applications in AWS Organizations
Failover
- Availability Zones are physically separated locations for app redundancy and fault-tolerance. It's recommended to deploy your EC2 instances in multiple AZs. You should make sure that distribution of provisions among AZs guarantees high availability and consistency, which is already managed for you in most of AWS.

Key considerations on using AWS for web app hosting

No more physical appliances

No more HW firewalls, routers, load balancer with your AWS Cloud architecture, just SW solutions.

Firewalls everywhere

Every host is locked down with a firewall. You should analyze traffic between hosts within your architecture to determine which ports needs to be open and create security groups in accordance. You can use network access control lists within Amazon VPC for subnet-level lockdown.

Consider the availability of multiple Datacenters

Think of AZs within an AWS region as separate Datacenters, logically and physically separated. You can use Amazon VPC to keep your resources in the same logical network while leveraging AZs.

Threat hosts as ephemeral and dynamic

No assumption on the host IP, location, ever-lasting availability of the host. Your key to fault-tolerance and high scalability of your apps is the dynamic design that fits with the wild nature of the cloud world.

Consider containers and serverless

Consider modernizing your application using Containers and Serveless technologies, leveraging services like AWS Fargate and AWS Lambda for more agile apps.

Consider automated deployment

Amazon Lightsail : Simple app development VPS with everything needed to build a Web app or website. Ideal for simple workloads and quick deployments.
AWS Elastic Beanstalk : Easy-to-use service for deploying and scaling web apps developed with most of the technologies (Ruby, NodeJS, Docker,..) on familiar servers (Apache, NGINX, ..)
AWS App Runner : Quickly deploy your containerise Web apps at scale effortlessly and without prior infrastructure knowledge required.
AWS Amplify : Framework of tools and services to help front-end web and mobile developers to build scalable products with an AWS-powered backend. Also used for deploying static web apps.

Conclusion

Migration to an AWS cloud architecture requires some consideration and changes, but really pays off.

Amazon Aurora MySQL Database Administrator’s Handbook

Ebrahim Gomaa — Tue, 24 Aug 2021 10:37:46 +0000

Introduction

Aurora MySQL is a managed relational DB engine compatible with MySQL 5.6 & 5.7. You can still use the drivers, connectors and tools you used to with MySQL with (almost) no charge. Aurora MySQL DB clusters provides features like :

One primary Read/Write (RW) instance, 15 replicas of Read-only (RO) instances
Any RO instance can be promoted to RW in case of failure of the primary instance
Dynamic cluster endpoint (i.e. URI or address) always pointing to the primary instance even in case of failover
Reader endpoint including all RO Replicas, updates when replicas are added or removed
Admin can create custom DNS endpoints containing his own configuration of DB instances within a single cluster
Improved scalability using internal connection pools and thread multiplexing for each server
Almost zero-down-time DB restart/recovery
Almost Real-Time metadata accessible by application developers enabling them to create smart drivers and connect directly to the instances based on their roles (RW - RO)

But to get the most out of these perks, DBAs need to learn the best practices, because any sub-optimal configuration for applications, drivers, connector or proxies can lead to unexpected downtime and performance issues. And you can consider this article The Aurora MySQL configuration best practices Cookbook.

DNS Endpoints

Source: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/images/AuroraArch001.png

As you can see from the diagram above, Aurora DB has some Compute instances connected to a Multi-tenant (can serve many clusters), Multi-attach (can have multiple instances attached to it) Storage volume. The compute instances are one primary RW instance (M) and up to 15 RO replicas (R) - per cluster. RO instances can take over the RW instance in case of failure.

But how to connect to these instances in an optimum way ? Here, Aurora supports 4 types of DNS endpoint.

Cluster Endpoint : Following the Primary Instance even in case of failover
Reader Endpoint : Include all RO instances under a single DNS CNAME, so can be used for Round-robin Load Balancing
Instance Endpoints : Connect directly to some instance (RW or RO)
Custom Endpoints : User-defined DNS endpoints containing a selected group on instance within a single cluster

You can use any of the 4 types where you see suitable to reach the optimum configuration.

Connection Handling in Aurora MySQL and MySQL

MySQL Community Edition dedicated one OS thread from the mysqld process for each connection (one-thread-per-connection). This leads to many scalability issues (in case of large number of user connections) like high memory usage even if some connections are idle, also the huge context-switching overhead between multiple threads.

As a solution, Aurora MySQL supports a thread pool approach (group of threads ready for any connection on-demand usage). Those threads are never dedicated to any single connection usage. Threads are multiplexed, that is, when a thread is being used by a connection and it's not actively executing (e.g. waiting for IO), the thread can switch to another connection to do useful work; thus gaining best utilization and serving many connections with just a few threads. The thread pool also scales up and down automatically according to usage, no manual configuration required.

Although thread pooling reduces server-side cost of maintaining connections, it comes with the limitation of setting up and terminating the connections, especially when this connection has session-level configuration (like SET variable_name = value ). This process involves an exchange of several network packets. For busy workloads with short-lived connections (like Online Transaction Processing), consider using an application-side connection pool.

Common Misconceptions

❌ No need for application-side connection pool when a server-side connection pool is used

As mentioned before, server-side pooling has the limitation that it doesn't eliminate the overhead of setting up and terminating the connection. So if your application is doing very frequent opening/closing connections, and few statements are executed per connection; then you need application-side pooling. Even if your connections are long-lived, you may benefit from app-side pooling to avoid large bursts of new connection attempts i.e. connection surges. You can use tcpdump tool to monitor your connection and compare overhead packets versus useful processing packets to help you take the decision.
❌ Idle connections don't use memory

Incorrect! Both OS and database processes allocate in-memory descriptors for each connection. Although Aurora MySQL typically uses less memory than MySQL CE, this overhead in Aurora MySQL is non-zero. So basically, avoid opening too more connection inn your app-side pool than you need.
❌ Downtime depends entirely on DB stability and features

Incorrect! Your app design and configuration also matters. For this, read the next section to know how your practices can help user traffic recover faster following a DB event.

Best Practices
Using Smart Drivers

Although Aurora MySQL Cluster and Reader endpoints abstracts (hides) the topology of the cluster, taking the topology into account while designing your connector helps greatly in eliminating delays occur because of DNS updates. For this reason, Aurora MySQL provides a near-real-time Metadata table ( INFORMATION_SCHEMA.REPLICA_HOST_STATUS ) carrying information about the instances in the cluster and their roles and can be queried from any instance in the cluster.

Example query against the metadata table. Source : the original paper

Smart drivers are drivers/connector that utilize this table to improve queries, not only depending on high-level DNS endpoint, also round-robin load-balancing read-only connections to the reader instances. Example for this is The MariaDB Connector/J for Java.

Note that using the smart connector doesn't compensate for the rest of best practices, you still need to manage some other stuff - following in the article - to reach to the optimal connection. Also note that theses connector that has Aurora-specific features may not be officially verified by AWS and needs to be up-to-date as they encounter much more updates than the barebones connectors.
DNS Caching

Rule of thumb: DNS Caching TTL of Aurora endpoints is 5 seconds. Your configuration should NEVER further exceed this limit. Caching may occur on network layer, the OS or your application, so make sure all caching layers will not exceed the TTL limit.
Exceeding the TTL limit means having outdated DNS data, which may lead to reaching to a demoted primary instance as if it is still the primary instance, connection failure to the reader instances after scaling up/down due to usage of old IPs or unequal utilization of traffic among reader instances.
Connection management and pooling
- Always keep the connections closed and don't rely on the development language/framework to close them automatically as there may be scenarios where this is not the case.
- If you can't rely on client-interactive applications to close idle connection use interactive_timeout and wait_timeout MySQL variables to keep idle connection wait time suitable.
- As mentioned before, use connection pooling to protect your DB against surges, also if you make thousands of short-lived connections per second. If your framework doesn't support connection pooling, mind using connection proxies like ProxySQL.
- Best practices with managing connection pools and proxies :
  - Check the health of the borrowed connection before using. This can be as easy as SELECT 1 or show the value of @innodb_read_only variables to further know the role of the Aurora instance you're communicating with - true if it's a reader instance.
  - Periodically health-check the connections
  - Recycle ALL connections periodically by closing and reopening new ones. This helps save resources and prevent runaway queries and zombie connections (connection with abandoned clients).
Connection Scaling

Scaling up, number of connections increases proportional to number of application server instances, given that you dedicated a fixed number of connections per server. This may limit DB scalability in some extreme cases, as most of the connections are typically idle, yet taking up server resources.

To solve this case you may reduce number of connections per server to the minimum applicable, although this solution doesn't scale well as your app grows up. A much better solution is to use proxy between the application servers and the DB. It comes with many features out of the box, like configurable fixed number of connections, query caching, connection buffering and load balancing. Proxies like ProxySQL, ScaleArc and MaxScale are compatible with MySQL protocol. For further scalability and availability you may use multiple proxy instances behind the same DNS endpoint.
Transaction Management and Autocommit

Autocommit mode ensures that every statement runs in its own transaction which is commited automatically. This mode is recommended because disabling this mode means that the transaction is open and may stay open for long, blocking garbage collection mechanism and filling the garbage collection backlog, leading to excessive storage consumption and query slowness.

It's recommended to always use the autocommit mode and double-check it's enabled in the applications and the DB sides, especially at the applications as it may not be enabled by default. You should also manage transactions manually using START/BEGIN TRANSACTION and COMMIT/ROLLBACK as soon as you finish. These recommendations apply whenever you're using innoDB.

Also you can monitor transaction time using information_schema.innodb_trx table. trx_started is the starting time of the transaction so you can use it to calculate its age and investigate it if age is in the order of minutes.

For garbage collection backlog monitoring, use trx_rseg_history_len counter in the information_schema.innodb_metrics table. If in the order of ten thousands, g.c. is delayed. If millions, the situation is dangerous and needs investigate.

Note that garbage collection in Aurora is cluster-wide, meaning that any performance issue will affect all the instances, so you need to monitor all the instances.
Connection Handshakes

Usually opening a new DB session involve executing many set up statements, like setting up session variables, which affects greatly the applications sensitive to latency. You should be aware of internal operations of the driver using Aurora Advanced Audit, General Query Log or network-level packet trace using tcpdump. You know the purpose of each statement and its effect on subsequent queries. If you find that number of roundtrips taken for handshake operations are significant relative to actual work, consider disabling some handshake statements or using connection pooling.
Load Balancing with the Reader Endpoint

DNS resolution of Reader endpoint is load balanced in round robin fashion for every new connection, this means that for the same connection, all the queries are executed against the same instance. This may lead to unequal usage of Read Replicas, long initial delay for newly added instances and applications keep sending traffic to stopped instances in case of DNS caching. Be sure to use DNS caching best practices mentioned before.
Designing for Fault Tolerance and Quick Recovery

When you scale your application up, you're more likely to add more instances (db, application, ...) also to face more issues. You should design your application to be resilient in facing these situations. You should keep your application up-to-date with the failover of Aurora primary instances (occurs within 30 seconds after the failure). You should also keep up to date with the new Reader instances created to start sending traffic to them, as well as the removed instances to stop sending traffic to them. Not following best practices may lead to longer downtime.
Server Configuration
- Configuration Variable max_connections
  
  This variable limits number of connections per Aurora DB instance. The best practice is to keep this slightly higher than what you expect to open, but beware if you're using performance_schema as its memory usage increases proportional to the value of this variable, s it may lead to OOM isues on smaller instances, like T2 and T3 instances with less than 8GB memory. In this case you may need to disable performance_schema or keep max_connections to the default.
- Configuration Variable max_connect_errors
  
  This variable controls number of successive failed connection requests for a given client host. The client is shown the following error on exceeding this limit
```
Host '*host_name*' is blocked because of many connection 
errors ...
```
  A common incorrect practice is keeping this variable very high to avoid client connectivity issues. However, this is dangerous as it may hide serious issues with the applications that needs developer action, or even worse, DDoS attacks trying to take down the system.
  
  If your client application is facing the "host is blocked" problem, use aborted_connects diagnostic counters along with host_cache table to identify and fix the problem in your application.
  
  Note that this variable has no effect if skip_name_resolve is set to 1 (default).

Conclusion

Aurora is really great ❤️, however, you still need to apply best practices to ensure smooth integration, reduced downtime and scalability. This article will help you apply these best practices with little to no engineering effort.

DEV Community: Ebrahim Gomaa

Container Migration Methodology

Introduction

Why migrate to the AWS cloud ?

Why migrate to Amazon Container Service ?

AWS Capabilities for Containers

Strategy for Container Management

Business Capabilities

Technical Capabilities

Container Migration Maturity Model

Technical challenges of container and orchestration tools :

Evaluating the source cluster tech stack

Mobilization stage

Discover

Report

Choose a migration method

Planning

AWS Landing Zone

Skills/Cloud Center of Excellence (CCoE)

Operations Model

Security and compliance

Migration

Validation

Cutover

Web Application Hosting in the AWS Cloud

An overview of traditional web hosting

Web app hosting in the cloud using AWS

AWS can solve common web app hosting problems

An AWS Cloud architecture for web app hosting

Key Components of AWS Cloud web app hosting architecture

Key considerations on using AWS for web app hosting

No more physical appliances

Firewalls everywhere

Consider the availability of multiple Datacenters

Threat hosts as ephemeral and dynamic

Consider containers and serverless

Consider automated deployment

Conclusion

Amazon Aurora MySQL Database Administrator’s Handbook

Introduction

DNS Endpoints

Connection Handling in Aurora MySQL and MySQL

Common Misconceptions

Best Practices

Conclusion