DEV Community

Devops Makeit-run
Devops Makeit-run

Posted on • Originally published at make-it.run

MySQL Memory Management and Replication Best Practices for High-Load Environments

Understanding MySQL’s Memory Use and OOM Risks:

MySQL heavily utilizes caching and memory buffers, with the InnoDB buffer pool commonly set to 70–80% of system RAM.

In high-load, replicated setups, multiple buffer pools, caches, and per-connection settings can amplify the memory footprint.

Over-provisioning these relative to physical RAM risks triggering the Linux OOM killer, which often terminates the mysqld process first due to its high memory usage.

DBAs should conservatively size innodb_buffer_pool_size, be mindful of cumulative buffer allocations, and use monitoring tools (like Performance Schema, pmap) to inspect real-time usage.

  1. Why MySQL Often Gets Killed and OS Configuration:

Linux’s OOM killer targets high-memory processes, so MySQL is frequently the victim. Adjusting oom_score_adj can minimize the risk of mysqld being killed, but this may merely shift risks to other processes. Kernel memory overcommit settings (vm.overcommit_memory) also matter; never-overcommit mode (=2) transforms OOM crashes into immediate allocation errors, which can be safer.

On containers or VMs, ensure cgroup limits are properly set to avoid OOM within resource-constrained environments.

  1. Tuning MySQL Buffers & Multi-database Considerations: In addition to tuning the InnoDB buffer pool, carefully configure other buffers like key_buffer_size (if using MyISAM), disable query cache unless strictly needed, and keep per-session cache (e.g., sort_buffer_size, join_buffer_size) values low for high concurrency setups. Manage tmp_table_size and max_heap_table_size to avoid excessive temp table memory. Large numbers of tables/databases increase the data dictionary memory; use innodb_dict_size_limit accordingly.

Monitoring Memory Usage Effectively: Utilize SHOW GLOBAL STATUS and Performance Schema’s memory tables (like memory_summary_global_by_event_name) for component-level memory tracking.

Particularly with Group Replication, monitor for hidden memory consumers in plugin caches, and ensure limits like group_replication_message_cache_size are set if available.

Version choice matters due to feature improvements for memory control.

Architecture Best Practices – Replication Setup and GTIDs

Recommended replication topology is active/passive (one writable master, multiple read replicas) with a clearly defined failover plan.

GTIDs (Global Transaction IDs) should be enabled for streamlined failover and data consistency. Although GTIDs require care with binlog settings and backup restores, they are modern best practice for robust replication management.

  1. Replication Modes – Async, Semi-sync, and Chained Topologies:

Default replication is asynchronous (max performance, but risk of data loss).

Semi-synchronous mode is a compromise: the master waits for at least one replica’s acknowledgment, trading a bit of latency for higher durability.

For large clusters, chained replication (replicating to intermediaries) can alleviate network/CPU load on the primary.

Turn off binlog on leaf replicas where failover promotion is not needed, to save disk/CPU and reduce replication lag.

Multi-region strategies use async links for distant replicas.

  1. Disk and Binary Log Management: Disk I/O is often a bottleneck. Separate data/log files and binary/relay logs across distinct (ideally SSD) volumes to avoid contention.

Tune sync_binlog for performance (higher for speed; 1 for max safety). Consider enabling binary log compression in MySQL 8.0.20+ if I/O-bound. Replicas should not write binary logs unless they propagate replication. Optimize disk layout and use fast storage for both data and logs.

Multi-threaded Replication and Crash Safety: Modern MySQL supports parallel replication using multiple SQL threads, greatly reducing replication lag in multi-core environments.

Enable and tune parallel workers, but beware CPU contention and ensure up-to-date MySQL patches to avoid bugs.

For crash safety, use

relay_log_info_repository=TABLE
relay_log_recovery=1
Enter fullscreen mode Exit fullscreen mode


to recover gracefully after replica crashes.

Monitor parallel replication and adjust thread count based on observed lag and CPU utilization.

Query, Schema, and I/O Tuning for Replication

Optimize master queries to minimize heavy or long-running transactions, as large changes slow down all replicas.

Use online schema change tools for large table changes.

Index appropriately and consider sharding for high-write tables.

SSDNVMe
Use high-performance storage (SSDNVMe), monitor disk latency, and, on the cloud, provision IOPS appropriately.

Adjust sync settings (like innodb_flush_log_at_trx_commit) for performance vs durability balance.

Use chained replication and connection throttling to reduce replay overload.

Monitoring, Alerting, and Summary of Best Practices

Continuously monitor both system and MySQL-level memory and replication health, using tools like Percona PMM or Prometheus.

Percona PMMPrometheus

Set up alerts for OOM events, high replication lag, or memory spikes.

Best Practices

right-size memory and OS tuning, use optimized replication topology and disk layout, tune binary logs, and configure crash-safe replication settings.

Regular monitoring and thoughtful configuration help maintain a stable and performant high-load MySQL replication environment.

Top comments (0)