Relational databases migration to AWS S3 data lake using AWS DMS Part I –
AWS Database Migration Service is a cloud service to migrate relational databases, NoSQL databases, data warehouses and all other type of data stores into AWS Cloud or between cloud and on-premises setups efficiently and securely. DMS supports several types of source and target databases such as Oracle, MS SQL Server, MySQL, Postgres SQL, Amazon Aurora, AWS RDS, Redshift and S3 etc.
Observations during the data migration –
With the designing and creating AWS S3 data lake and data warehouse in AWS Redshift, the data sources were from on-premises for Oracle, MS SQL Server, MySQL, Postgres SQL and MongoDB for relational databases. AWS DMS was used here for the initial full load and daily incremental data transfer from these sources into AWS S3. With this series of posts want to explain on the various challenges faced during the actual data migration with different relational databases.
Modified date not populated properly at the source
AWS DMS is used for full load and change data capture from source databases. AWS DMS captures changed record based on the transaction logs; but modified date column updated properly can help to apply deduplication logic, extract latest modified record for given row on the target in S3.
In case modified data is not available for a table or it is not updated properly, AWS DMS provides an option of transformation rules to add a new column while extracting data from the source database. Here AR_H_CHANGE_SEQ header helps to add a new column with value as unique incrementing number from source database which consists of a timestamp and an auto incrementing number.
Below code example adds a new column as DMS_CHANGE_SEQ to the target, which has unique incrementing number from the source. This is 35 digit unique number with first 16 digits for the timestamp and next 19 digits for record id number incremented by the database.
{
"rule-type": "transformation",
"rule-id": "2",
"rule-name": "2",
"rule-target": "column",
"object-locator": {
"schema-name": "%",
"table-name": "%"
},
"rule-action": "add-column",
"value": "DMS_CHANGE_SEQ",
"expression": "$AR_H_CHANGE_SEQ",
"data-type": {
"type": "string",
"length": 100
}
}
Enabling Supplemental logging for Oracle as source
For Oracle as source database, to capture ongoing changes, AWS DMS needs that minimum supplemental logging is enabled on the source database. Accordingly, this will include additional information, columns in the redo logs to identify the changes at the source.
Supplemental logging can be enabled for primary, unique keys, set of columns or all the columns. Supplemental logging for all columns captures all the columns for given changed record on the source tables and helps to overwrite in target in this case AWS S3 layer.
Supplemental logging all columns will increase the redo logs size, as all the columns for the table are logged into the logs. One need to configure redo and archive logs accordingly to consider additional information in it.
AWS DMS Performance issue - Network bandwidth
Initial full load from the on-premises sources for Oracle, MS SQL Server etc. worked fine and change data capture too for most of the time.
There used to be moderate number of transactions most of the time of the day in given month, except end of business day process daily post-midnight and month end activities. It was observed that DMS migration tasks were out of sync or failed during this time due to high latency issue.
There could be different reasons for the high latency such as source and target latency, load on source and target systems, replication instance configuration, network speed and bandwidth etc.
In this case source and target systems were not having heavy load and replication instance was also well configured to handle the incoming data and was having minimum load on it.
After reviewing the source, target and replication instance metrics in the logs and found below observations –
CDCLatencySource - The gap, in seconds, between the last event captured from the source endpoint and current system time stamp of the AWS DMS instance.
Generally, CDCIncomingchanges (i.e. The total number of change events at a point-in-time that are waiting to be applied to the target) goes on increasing from zero to thousands during reconciliation activities early morning.
Similarly, CDCLatencySource (i.e. The gap, in seconds, between the last event captured from the source endpoint and current system time stamp of the AWS DMS instance) goes on increasing from zero to few thousands upto 10-12K seconds during daily post-midnight reconciliation activities. This value was up to 40K during month end activities.
Upon further logs analysis and reviewing other metrics, It was observed that –
AWS DMS metrics NetworkReceiveThroughput is to understand the incoming traffic on the DMS Replication instance for both customer database and DMS traffic. This metrics helps to understand the network related issues, if any between source database and DMS replication instance.
Network bandwidth issue between source and AWS network
It was observed that Network receive throughput was max up to 30MB/s (250Mb/s) for the VPN connection between source and AWS and being shared for other applications too.
A connectivity between source and target databases is critical for the successful data migration. One should make sure sufficient bandwidth between on-premises or other cloud source databases and AWS environment is setup before performing the actual data migration.
A VPN tunnel such as AWS Site-to-Site VPN or Oracle Cloud Infrastructure (OCI) Site-to-Site VPN (Oracle AWS) can provide throughput up to 1.25Gbps. This would be sufficient for small tables migration or tables with less DML traffics migration.
For the large data migrations with heavy transactions per seconds on the tables, one should consider AWS Direct connect. It provides an option to create dedicated private connection with 1 Gbps, 10Gbps etc. bandwidth supported.
Conclusion –
This is part I of the multi-part series for the relational databases migration challenges using AWS DMS and their solutions implemented. AWS DMS network speed performance related issue and other factors would be explained in next posts. Most of these challenges mentioned in this series could happen during the database migration process and these solutions can be referred.
Top comments (0)