Amazon Aurora MySQL Database Administrator’s Handbook

#aws #database #devops

Introduction

Aurora MySQL is a managed relational DB engine compatible with MySQL 5.6 & 5.7. You can still use the drivers, connectors and tools you used to with MySQL with (almost) no charge. Aurora MySQL DB clusters provides features like :

One primary Read/Write (RW) instance, 15 replicas of Read-only (RO) instances
Any RO instance can be promoted to RW in case of failure of the primary instance
Dynamic cluster endpoint (i.e. URI or address) always pointing to the primary instance even in case of failover
Reader endpoint including all RO Replicas, updates when replicas are added or removed
Admin can create custom DNS endpoints containing his own configuration of DB instances within a single cluster
Improved scalability using internal connection pools and thread multiplexing for each server
Almost zero-down-time DB restart/recovery
Almost Real-Time metadata accessible by application developers enabling them to create smart drivers and connect directly to the instances based on their roles (RW - RO)

But to get the most out of these perks, DBAs need to learn the best practices, because any sub-optimal configuration for applications, drivers, connector or proxies can lead to unexpected downtime and performance issues. And you can consider this article The Aurora MySQL configuration best practices Cookbook.

DNS Endpoints

Source: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/images/AuroraArch001.png

As you can see from the diagram above, Aurora DB has some Compute instances connected to a Multi-tenant (can serve many clusters), Multi-attach (can have multiple instances attached to it) Storage volume. The compute instances are one primary RW instance (M) and up to 15 RO replicas (R) - per cluster. RO instances can take over the RW instance in case of failure.

But how to connect to these instances in an optimum way ? Here, Aurora supports 4 types of DNS endpoint.

Cluster Endpoint : Following the Primary Instance even in case of failover
Reader Endpoint : Include all RO instances under a single DNS CNAME, so can be used for Round-robin Load Balancing
Instance Endpoints : Connect directly to some instance (RW or RO)
Custom Endpoints : User-defined DNS endpoints containing a selected group on instance within a single cluster

You can use any of the 4 types where you see suitable to reach the optimum configuration.

Connection Handling in Aurora MySQL and MySQL

MySQL Community Edition dedicated one OS thread from the mysqld process for each connection (one-thread-per-connection). This leads to many scalability issues (in case of large number of user connections) like high memory usage even if some connections are idle, also the huge context-switching overhead between multiple threads.

As a solution, Aurora MySQL supports a thread pool approach (group of threads ready for any connection on-demand usage). Those threads are never dedicated to any single connection usage. Threads are multiplexed, that is, when a thread is being used by a connection and it's not actively executing (e.g. waiting for IO), the thread can switch to another connection to do useful work; thus gaining best utilization and serving many connections with just a few threads. The thread pool also scales up and down automatically according to usage, no manual configuration required.

Although thread pooling reduces server-side cost of maintaining connections, it comes with the limitation of setting up and terminating the connections, especially when this connection has session-level configuration (like SET variable_name = value ). This process involves an exchange of several network packets. For busy workloads with short-lived connections (like Online Transaction Processing), consider using an application-side connection pool.

Common Misconceptions

❌ No need for application-side connection pool when a server-side connection pool is used

As mentioned before, server-side pooling has the limitation that it doesn't eliminate the overhead of setting up and terminating the connection. So if your application is doing very frequent opening/closing connections, and few statements are executed per connection; then you need application-side pooling. Even if your connections are long-lived, you may benefit from app-side pooling to avoid large bursts of new connection attempts i.e. connection surges. You can use tcpdump tool to monitor your connection and compare overhead packets versus useful processing packets to help you take the decision.
❌ Idle connections don't use memory

Incorrect! Both OS and database processes allocate in-memory descriptors for each connection. Although Aurora MySQL typically uses less memory than MySQL CE, this overhead in Aurora MySQL is non-zero. So basically, avoid opening too more connection inn your app-side pool than you need.
❌ Downtime depends entirely on DB stability and features

Incorrect! Your app design and configuration also matters. For this, read the next section to know how your practices can help user traffic recover faster following a DB event.

Best Practices
Using Smart Drivers

Although Aurora MySQL Cluster and Reader endpoints abstracts (hides) the topology of the cluster, taking the topology into account while designing your connector helps greatly in eliminating delays occur because of DNS updates. For this reason, Aurora MySQL provides a near-real-time Metadata table ( INFORMATION_SCHEMA.REPLICA_HOST_STATUS ) carrying information about the instances in the cluster and their roles and can be queried from any instance in the cluster.

Example query against the metadata table. Source : the original paper

Smart drivers are drivers/connector that utilize this table to improve queries, not only depending on high-level DNS endpoint, also round-robin load-balancing read-only connections to the reader instances. Example for this is The MariaDB Connector/J for Java.

Note that using the smart connector doesn't compensate for the rest of best practices, you still need to manage some other stuff - following in the article - to reach to the optimal connection. Also note that theses connector that has Aurora-specific features may not be officially verified by AWS and needs to be up-to-date as they encounter much more updates than the barebones connectors.
DNS Caching

Rule of thumb: DNS Caching TTL of Aurora endpoints is 5 seconds. Your configuration should NEVER further exceed this limit. Caching may occur on network layer, the OS or your application, so make sure all caching layers will not exceed the TTL limit.
Exceeding the TTL limit means having outdated DNS data, which may lead to reaching to a demoted primary instance as if it is still the primary instance, connection failure to the reader instances after scaling up/down due to usage of old IPs or unequal utilization of traffic among reader instances.
Connection management and pooling
- Always keep the connections closed and don't rely on the development language/framework to close them automatically as there may be scenarios where this is not the case.
- If you can't rely on client-interactive applications to close idle connection use interactive_timeout and wait_timeout MySQL variables to keep idle connection wait time suitable.
- As mentioned before, use connection pooling to protect your DB against surges, also if you make thousands of short-lived connections per second. If your framework doesn't support connection pooling, mind using connection proxies like ProxySQL.
- Best practices with managing connection pools and proxies :
  - Check the health of the borrowed connection before using. This can be as easy as SELECT 1 or show the value of @innodb_read_only variables to further know the role of the Aurora instance you're communicating with - true if it's a reader instance.
  - Periodically health-check the connections
  - Recycle ALL connections periodically by closing and reopening new ones. This helps save resources and prevent runaway queries and zombie connections (connection with abandoned clients).
Connection Scaling

Scaling up, number of connections increases proportional to number of application server instances, given that you dedicated a fixed number of connections per server. This may limit DB scalability in some extreme cases, as most of the connections are typically idle, yet taking up server resources.

To solve this case you may reduce number of connections per server to the minimum applicable, although this solution doesn't scale well as your app grows up. A much better solution is to use proxy between the application servers and the DB. It comes with many features out of the box, like configurable fixed number of connections, query caching, connection buffering and load balancing. Proxies like ProxySQL, ScaleArc and MaxScale are compatible with MySQL protocol. For further scalability and availability you may use multiple proxy instances behind the same DNS endpoint.
Transaction Management and Autocommit

Autocommit mode ensures that every statement runs in its own transaction which is commited automatically. This mode is recommended because disabling this mode means that the transaction is open and may stay open for long, blocking garbage collection mechanism and filling the garbage collection backlog, leading to excessive storage consumption and query slowness.

It's recommended to always use the autocommit mode and double-check it's enabled in the applications and the DB sides, especially at the applications as it may not be enabled by default. You should also manage transactions manually using START/BEGIN TRANSACTION and COMMIT/ROLLBACK as soon as you finish. These recommendations apply whenever you're using innoDB.

Also you can monitor transaction time using information_schema.innodb_trx table. trx_started is the starting time of the transaction so you can use it to calculate its age and investigate it if age is in the order of minutes.

For garbage collection backlog monitoring, use trx_rseg_history_len counter in the information_schema.innodb_metrics table. If in the order of ten thousands, g.c. is delayed. If millions, the situation is dangerous and needs investigate.

Note that garbage collection in Aurora is cluster-wide, meaning that any performance issue will affect all the instances, so you need to monitor all the instances.
Connection Handshakes

Usually opening a new DB session involve executing many set up statements, like setting up session variables, which affects greatly the applications sensitive to latency. You should be aware of internal operations of the driver using Aurora Advanced Audit, General Query Log or network-level packet trace using tcpdump. You know the purpose of each statement and its effect on subsequent queries. If you find that number of roundtrips taken for handshake operations are significant relative to actual work, consider disabling some handshake statements or using connection pooling.
Load Balancing with the Reader Endpoint

DNS resolution of Reader endpoint is load balanced in round robin fashion for every new connection, this means that for the same connection, all the queries are executed against the same instance. This may lead to unequal usage of Read Replicas, long initial delay for newly added instances and applications keep sending traffic to stopped instances in case of DNS caching. Be sure to use DNS caching best practices mentioned before.
Designing for Fault Tolerance and Quick Recovery

When you scale your application up, you're more likely to add more instances (db, application, ...) also to face more issues. You should design your application to be resilient in facing these situations. You should keep your application up-to-date with the failover of Aurora primary instances (occurs within 30 seconds after the failure). You should also keep up to date with the new Reader instances created to start sending traffic to them, as well as the removed instances to stop sending traffic to them. Not following best practices may lead to longer downtime.
Server Configuration
- Configuration Variable max_connections
  
  This variable limits number of connections per Aurora DB instance. The best practice is to keep this slightly higher than what you expect to open, but beware if you're using performance_schema as its memory usage increases proportional to the value of this variable, s it may lead to OOM isues on smaller instances, like T2 and T3 instances with less than 8GB memory. In this case you may need to disable performance_schema or keep max_connections to the default.
- Configuration Variable max_connect_errors
  
  This variable controls number of successive failed connection requests for a given client host. The client is shown the following error on exceeding this limit
```
Host '*host_name*' is blocked because of many connection 
errors ...
```
  A common incorrect practice is keeping this variable very high to avoid client connectivity issues. However, this is dangerous as it may hide serious issues with the applications that needs developer action, or even worse, DDoS attacks trying to take down the system.
  
  If your client application is facing the "host is blocked" problem, use aborted_connects diagnostic counters along with host_cache table to identify and fix the problem in your application.
  
  Note that this variable has no effect if skip_name_resolve is set to 1 (default).

Conclusion

Aurora is really great ❤️, however, you still need to apply best practices to ensure smooth integration, reduced downtime and scalability. This article will help you apply these best practices with little to no engineering effort.