In a high availability MySQL HeatWave DB system, automatic failover is a critical mechanism that ensures business continuity when the primary instance becomes unavailable. Understanding how to identify when a failover has occurred and how to properly test failover scenarios is essential for database administrators managing mission-critical applications.
This guide walks you through recognizing failover events through error messages, understanding the failover process, and implementing best practices for testing failover scenarios without disrupting production systems.
Understanding Failover in High Availability
What is Failover?
Failover is the automatic process where one of the secondary MySQL instances is promoted to become the new primary instance when the current primary fails or becomes unavailable. The promotion happens automatically without manual intervention, minimizing downtime and ensuring continuous availability.
Common Causes of Failover
Failover can be triggered by several scenarios:
- Availability Domain Failure: Complete outage of the data center hosting the primary instance
- Primary Database Instance Failure: Hardware failure, software crash, or critical system error on the primary instance
- Network Connectivity Issues: Loss of network connectivity between the primary instance and the MySQL HeatWave Service management layer
- Block Storage Issues: Persistent storage failures affecting the primary instance's ability to read or write data
- Maintenance Activities: Certain planned maintenance operations that affect the primary instance
The Failover Process
When failover occurs:
- MySQL HeatWave Service detects the primary instance failure
- One of the secondary instances is automatically selected for promotion
- The selected secondary is promoted to primary status
- The DB system endpoint is reassigned to the newly promoted primary instance
- The IP address remains the same, but all existing connections are closed
- Client applications must reconnect to resume operations
- If a HeatWave cluster was attached, it's automatically recreated and data is reloaded
Identifying a Failover: Error Messages and Indicators
Primary Indicators of Failover
1. Console Notification
The most straightforward way to verify if a failover has occurred is through the OCI Console:
Steps to Check:
- Open the navigation menu
- Select Databases → HeatWave MySQL → DB Systems
- Choose your compartment from the List Scope
- Click the name of your DB system to open the DB System Details page
If a failover has occurred, a message is displayed stating: "Current placement (<DomainName>) differs from preferred placement, due to failover or maintenance activity"
This message indicates that the current primary instance is not in your originally preferred location due to automatic failover.
2. MySQL Event Notifications
When a failover happens, a MySQL - Automatic Recovery event is emitted on the DB system with the additionalDetails.isFailover property set to true
Configure event rules to receive notifications when these events occur.
Error Messages During Failover
When a failover is in progress or has just occurred, applications may encounter various error messages. Understanding these errors helps you identify failover events and implement appropriate retry logic.
Error 1290 (HY000): Read-Only Mode Error
Error Message:
ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option
so it cannot execute this statement
What It Means:
This error occurs when attempting to execute write operations (INSERT, UPDATE, DELETE, DDL statements) on an instance that is in super-read-only mode
When You'll See It:
- During the brief window when a primary is being demoted to secondary
- If your application still has connections to the old primary after failover
- When attempting writes to a secondary instance
What to Do:
- Implement connection retry logic in your application
- Close existing connections and reconnect to get routed to the new primary
- Ensure your application uses the DB system endpoint, not instance-specific IPs
Error 3100 (HY000): Replication Hook Error
Error Message:
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'
What It Means:
This error indicates the system is starting the failover process, and transaction commits are being blocked or rolled back.
Context:
This error can also indicate transaction size errors when the message includes "Transaction of size X exceeds specified limit"
When You'll See It:
- At the beginning of a failover event
- When large transactions exceed the group replication transaction size limit
- During the brief period when the primary instance is being transitioned
What to Do:
- Transactions in progress will be rolled back automatically
- Implement exponential backoff retry logic
- Reconnect and retry the transaction
- For transaction size errors, break large transactions into smaller batches
Error 2013 (HY000): Connection Lost During Query
Error Message:
ERROR 2013 (HY000): Lost connection to MySQL server during query
What It Means:
The connection to the MySQL server was lost while executing a query, typically because the primary instance became unavailable during failover.
When You'll See It:
- During active failover when the primary instance goes offline
- When long-running queries are interrupted by failover
- During network disruptions affecting the primary
What to Do:
- Close the broken connection
- Establish a new connection to the DB system endpoint
- Retry the query on the new primary
- Implement idempotent query patterns where possible
Error Log Indicators
MySQL error logs provide detailed information about failover events. You can query the Performance Schema to view these logs.
Viewing Error Logs
Connect to your DB system using MySQL Shell or MySQL Client and run:
SELECT * FROM performance_schema.error_log
WHERE SUBSYSTEM = 'Repl'
ORDER BY LOGGED DESC
LIMIT 50;
Key Error Codes Indicating Failover
MY-013213 and MY-011507: Primary Election Change
These errors indicate that a failover has occurred and a secondary instance has been promoted to primary through a Group Replication primary election change
Example Log Entry:
| date_time | 35 | System | MY-013214 | Repl |
Plugin group_replication reported: 'Starting group operation local execution: Primary election change'
| date_time | 0 | System | MY-011507 | Repl |
Plugin group_replication reported: 'Plugin 'group_replication' has been started'
MY-011608: Transaction Size Error
This error indicates a transaction size error where the transaction exceeds the group_replication_transaction_size_limit
Example Log Entry:
[ERROR] [MY-011608] [Repl] Plugin group_replication reported:
'Error on session 423. Transaction of size 90386497 exceeds specified limit 85899345.
To increase the limit please adjust group_replication_transaction_size_limit option.'
MY-011566: Super Read-Only Mode Change
This error indicates that super_read_only has been switched off when an instance is promoted from secondary to primary, or switched on when demoted from primary to secondary.
What It Indicates:
- An instance role change has occurred
- The instance is transitioning between primary and secondary status
- Part of the normal failover process
Additional Failover Indicators
Binary Log Position Differences
After a failover, the current binary log file name and position of the new primary may be different from the old primary
This is because binary logs of each instance are managed independently. Each transaction may be written to different binary log files and positions across instances.
To Check Binary Log Position:
SHOW MASTER STATUS;
Compare before and after failover to identify changes.
Connection String Behavior
While the DB system endpoint IP address remains constant during failover, all existing connections are terminated and must be re-established. Monitor your application's connection pool for sudden spikes in connection closures and new connection attempts.
Testing Failover: Process and Best Practices
Why Test Failover?
Regular failover testing is essential for:
- Validating your disaster recovery procedures
- Ensuring applications handle failovers gracefully
- Training operations teams on recovery processes
- Identifying configuration issues before real failures occur
- Meeting compliance and audit requirements
- Building confidence in your high availability architecture
Switchover vs. Failover
Failover: Automatic, unplanned promotion of a secondary to primary due to primary instance failure.
Switchover: Manual, planned promotion of a secondary to primary for testing, maintenance, or optimization purposes.
When you perform a switchover, the preferred placement and current placement change to the newly selected placement of the primary instance, and the DB system endpoint IP address does not change
Use switchover to safely test failover scenarios without actually causing a failure.
Step-by-Step Failover Testing Process
Phase 1: Set Up a Test DB System
Never test failover on production systems during business hours.
-
Create a Non-Production HA DB System
- Use a development or staging environment
- Configure it identically to production (same shape, configuration, HA settings)
- Ensure the system has high availability enabled
-
Import Representative Data
- Load a production-like dataset
- Include tables with primary keys (HA requirement)
- Consider data sensitivity and masking requirements
-
Deploy a Sample Application
- Use a test application that mimics production workload
- Implement connection pooling and retry logic
- Add monitoring and logging capabilities
- Configure the application to use the DB system endpoint
Phase 2: Establish Baseline Monitoring
Before triggering switchover, establish baseline metrics:
-
Application Performance Metrics
- Query response times
- Transaction throughput
- Connection pool statistics
- Error rates
-
Database Metrics
- Active connections
- Current binary log position
- Replication lag (should be minimal in HA systems)
- Current primary instance location
-
System Health Indicators
- CPU and memory utilization
- Storage I/O metrics
- Network latency between availability domains
Phase 3: Execute Switchover
Switchover causes a short period of downtime while the primary instance is redirected to the newly promoted instance, requiring all database connections to be reopened
Using the Console:
- Navigate to your DB System details page
- Click the Actions menu
- Select Switchover
- Choose the target availability domain or fault domain (different from current)
- Confirm the switchover operation
Using OCI CLI:
oci mysql db-system update \
--db-system-id <YOUR_DB_SYSTEM_OCID> \
--availability-domain <TARGET_AD> \
--force
Modern Switchover Behavior (for systems created/upgraded after February 25, 2025):
New transactions are blocked when switchover is initiated, running transactions are allowed to complete, and uncommitted transactions are rolled back when the connection to the previous primary is broken
Phase 4: Monitor the Switchover
During Switchover (typically 30-60 seconds):
-
Watch Application Behavior
- Monitor for connection errors (Error 1290, 2013, 3100)
- Observe connection pool drain and refill
- Track any failed transactions
- Verify automatic retry mechanisms
Check Error Logs
SELECT * FROM performance_schema.error_log
WHERE LOGGED > DATE_SUB(NOW(), INTERVAL 5 MINUTE)
ORDER BY LOGGED DESC;
- Verify HeatWave Cluster Status (if applicable) When current placement changes in a switchover, HeatWave cluster is detached from the previous primary instance and either reused (same AD) or deleted and recreated (different AD)
SHOW GLOBAL STATUS LIKE 'rapid_plugin_bootstrapped';
-
Monitor Database Management Metrics
- Use OCI Database Management console
- Track performance metrics during transition
- Observe connection patterns
After Switchover:
-
Verify New Primary Location
- Check OCI Console for current placement
- Confirm switchover to target AD/FD succeeded
- Verify "Current placement" matches target
-
Validate Application Recovery
- All connections successfully re-established
- Transaction throughput returns to baseline
- Query performance is consistent
- No persistent error conditions
-
Check Data Consistency
- Verify recent transactions completed successfully
- Confirm no data loss occurred
- Test read and write operations
- Validate application state consistency
Phase 5: Document and Analyze
Key Metrics to Record:
| Metric | Before Switchover | During Switchover | After Recovery |
|---|---|---|---|
| Total Downtime | N/A | ___ seconds | N/A |
| Connections Dropped | 0 | ___ | 0 |
| Failed Transactions | 0 | ___ | 0 |
| Recovery Time | N/A | ___ seconds | N/A |
| HeatWave Reload Time | N/A | ___ seconds | N/A |
Analysis Questions:
- Did the application handle failover gracefully?
- Were retry mechanisms effective?
- Did any transactions fail permanently?
- How quickly did the system return to normal operation?
- Were there any unexpected issues?
- Did monitoring and alerting work as expected?
Switchover to Return to Preferred Placement
After testing, you can switch back to the original preferred placement:
Option 1: Switchover Back
Perform another switchover to return the primary to the original AD/FD.
Option 2: Update Preferred Placement
You can switchover to the existing primary instance to change the preferred placement to the current placement with no downtime
This updates your preference to match the current state without any service interruption.
Best Practices for Handling Failover
1. Test on Non-Production Systems First
Critical Rule: Never test failover on production systems during business hours or peak traffic periods.
Why This Matters:
- Switchover causes brief downtime (30-60 seconds)
- Connections are forcibly closed
- Uncommitted transactions are rolled back
- Long-running queries may be aborted
- Application behavior under failover may be unpredictable
Best Practice:
- Maintain a dedicated test environment that mirrors production
- Perform regular failover drills quarterly or bi-annually
- Test during maintenance windows for production validation
- Document all test results and improvements made
2. Use Switchover to Simulate Failover
Recommendation: Use manual switchover operations to test disaster recovery procedures rather than artificially creating failures.
Benefits:
- Controlled, predictable process
- Ability to schedule during low-traffic periods
- Minimizes risk of actual data loss
- Provides clean before/after states for analysis
- Allows testing of application behavior when the primary instance changes to a different availability domain
Testing Scenarios:
- Same AD, Different Fault Domain: Minimal impact, tests intra-datacenter failover
- Different AD: Simulates datacenter failure, tests cross-AD latency and behavior
- With Active HeatWave Cluster: Tests analytics workload recovery
- Under Load: Tests failover during peak transaction volume
3. Monitor Error Logs and Track Aborted Transactions
What to Monitor:
Error Logs:
-- Check for failover-related errors in the last hour
SELECT LOGGED, ERROR_CODE, SUBSYSTEM, DATA
FROM performance_schema.error_log
WHERE SUBSYSTEM IN ('Repl', 'Server')
AND LOGGED > DATE_SUB(NOW(), INTERVAL 1 HOUR)
ORDER BY LOGGED DESC;
Transaction Status:
-- Monitor transaction metrics
SHOW GLOBAL STATUS LIKE 'Com_%';
SHOW GLOBAL STATUS LIKE 'Aborted_%';
HeatWave Cluster Status (if applicable):
-- Verify HeatWave is operational
SHOW GLOBAL STATUS LIKE 'rapid%';
Best Practice:
- Set up automated monitoring alerts for failover events
- Log all errors during switchover testing
- Track transaction rollback rates
- Monitor connection pool health
- Review application logs for retry attempts and eventual success
4. Implement Robust Application-Level Retry Logic
Connection Retry Pattern:
import time
import mysql.connector
from mysql.connector import Error
def connect_with_retry(max_retries=5, base_delay=1):
"""
Connect to MySQL with exponential backoff retry
"""
for attempt in range(max_retries):
try:
connection = mysql.connector.connect(
host='<DB_SYSTEM_ENDPOINT>',
database='your_database',
user='admin',
password='your_password',
connect_timeout=10
)
return connection
except Error as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Connection failed: {e}. Retrying in {delay}s...")
time.sleep(delay)
Transaction Retry Pattern:
def execute_with_retry(connection, query, max_retries=3):
"""
Execute query with retry on connection errors
"""
for attempt in range(max_retries):
try:
cursor = connection.cursor()
cursor.execute(query)
connection.commit()
return cursor
except mysql.connector.Error as e:
if e.errno in [1290, 2013, 3100]: # Failover-related errors
if attempt < max_retries - 1:
connection = connect_with_retry()
continue
raise
Key Principles:
- Always use the DB system endpoint, never instance-specific IPs
- Implement exponential backoff (1s, 2s, 4s, 8s, 16s)
- Limit retry attempts to avoid infinite loops
- Make transactions idempotent when possible
- Log all retry attempts for monitoring
5. Move Primary Back After Unplanned Failover
After an automatic failover occurs in production:
The current placement differs from the preferred placement, and a message is displayed indicating "Current placement differs from preferred placement, due to failover or maintenance activity"
Decision Points:
Option A: Leave in New Location
- System is stable and operational
- Performance is acceptable
- No immediate reason to change
- Update preferred placement to match current
Option B: Switchover Back to Preferred Location
- Original location has specific advantages (latency, compliance)
- Want to maintain consistent placement for operational procedures
- Schedule switchover during next maintenance window
- Perform during low-traffic period
How to Move Back:
-
Verify original location is healthy
- Check that the original AD/FD has recovered
- Ensure no ongoing infrastructure issues
-
Schedule maintenance window
- Choose low-traffic period
- Notify stakeholders of brief downtime
- Prepare rollback plan
Execute switchover
oci mysql db-system update \
--db-system-id <OCID> \
--availability-domain <ORIGINAL_AD> \
--force
-
Monitor recovery
- Verify applications reconnect successfully
- Check HeatWave cluster reload (if applicable)
- Confirm performance returns to baseline
Best Practice: Don't rush to move back immediately after failover. Validate the stability of the current state first, then plan a controlled switchover during a maintenance window.
6. Configure Connection Timeouts Appropriately
Recommended Settings:
-- Connection timeout (seconds to wait for initial connection)
SET GLOBAL connect_timeout = 10;
-- Interactive timeout (idle connection timeout)
SET GLOBAL interactive_timeout = 28800;
-- Wait timeout (connection wait time)
SET GLOBAL wait_timeout = 28800;
For Applications Using PrivateLink:
- Set shorter
connect_timeoutvalues (5 seconds) - Implement aggressive retry logic
- Failovers and switchovers take longer when using Query PrivateLink endpoints
7. Test with HeatWave Cluster
If your DB system has a HeatWave cluster attached:
Additional Considerations:
-
Cluster Behavior During Switchover
- If new primary is in the same AD as previous primary, the existing HeatWave cluster is reused and reattached
- If new primary is in a different AD, the existing HeatWave cluster is deleted and a new one created in the same AD as the new primary
-
Data Reload
- Tables are automatically reloaded from HeatWave Storage Layer
- Monitor reload progress:
SHOW GLOBAL STATUS LIKE 'rapid%'; - Test analytics queries after reload completes
-
Long-Running Queries
- When current placement changes in a switchover, long-running queries offloaded to HeatWave cluster can get aborted
- Reconnect and rerun queries after cluster is ready
- Implement query retry logic for analytics workloads
Advanced Monitoring and Alerting
Setting Up Event Rules for Failover Notifications
OCI Events Configuration:
- Create an Event Rule:
{
"eventType": ["com.oraclecloud.mysql.recoveryevent"],
"data": {
"additionalDetails": {
"isFailover": "true"
}
}
}
- Configure Actions:
- Send notification to OCI Notifications topic
- Trigger OCI Functions for automated response
- Log to OCI Logging for audit trail
- Integrate with PagerDuty, Slack, or other alerting systems
Creating Custom Monitoring Dashboards
Key Metrics to Track:
-
High Availability Metrics
- Current vs. Preferred Placement
- Time since last placement change
- Number of failover events (historical)
- Replication lag (should be minimal)
-
Application Health
- Connection success rate
- Transaction throughput
- Query latency (P50, P95, P99)
- Error rates by type
-
Database Performance
- Active connections
- Buffer pool hit ratio
- InnoDB row lock waits
- Slow query count
-
HeatWave Cluster (if applicable)
- Cluster status
- Node availability
- Memory utilization
- Data scanned per query
Troubleshooting Common Failover Issues
Issue 1: Applications Not Reconnecting After Failover
Symptoms:
- Persistent connection errors after failover
- Application appears "stuck" or unresponsive
- Connection pool shows zero active connections
Causes:
- Application using cached, instance-specific IP addresses
- No retry logic implemented
- Connection pool not detecting stale connections
Solutions:
- Ensure application uses DB system endpoint hostname, not IP
- Configure connection pool with connection validation:
// HikariCP example
config.setConnectionTestQuery("SELECT 1");
config.setValidationTimeout(3000);
- Implement retry logic with exponential backoff
- Set appropriate connection timeout values
Issue 2: Frequent Transaction Rollbacks
Symptoms:
- High rate of Error 3100 messages
- Transactions consistently failing
- Data inconsistencies
Causes:
- Transactions too large for shape's memory
- Long-running transactions spanning failover window
- Non-idempotent transaction design
Solutions:
- Break large transactions into smaller batches
- Reduce transaction duration
- Use appropriate shape for transaction size requirements
- Design idempotent operations that can be safely retried
Issue 3: HeatWave Cluster Not Ready After Failover
Symptoms:
- Analytics queries failing after failover
-
rapid_plugin_bootstrappedshows NO - Tables appear unloaded
Causes:
- Cluster still reloading data from Storage Layer
- Cluster deleted and recreating (different AD switchover)
- Storage Layer issues
Solutions:
- Wait for automatic reload to complete:
SHOW GLOBAL STATUS LIKE 'rapid_plugin_bootstrapped';
- Monitor reload progress:
SELECT * FROM performance_schema.rpd_nodes;
- If reload fails, manually reload tables:
CALL sys.heatwave_load(JSON_ARRAY('schema.table'), NULL);
Identifying and testing failover in MySQL HeatWave high availability DB systems is a critical skill for database administrators and DevOps teams. By understanding the error messages that indicate failover, implementing proper monitoring, and regularly testing switchover scenarios in non-production environments, you can ensure your applications handle failover gracefully with minimal disruption.
Key takeaways:
- Recognize failover indicators: Watch for specific error codes (1290, 3100, 2013) and console messages
- Use switchover for testing: Never test on production; use manual switchover to simulate failover safely
- Monitor thoroughly: Track error logs, transaction status, and application behavior during and after failover
- Implement retry logic: Applications must handle connection failures and transaction rollbacks gracefully
- Plan recovery: After automatic failover, assess whether to switch back during a maintenance window
By following these best practices and regularly testing your disaster recovery procedures, you can build confidence in your high availability architecture and ensure your applications remain resilient in the face of infrastructure failures.
Additional Resources:
Top comments (0)