Relational databases migration to AWS environment into S3 data lake using AWS DMS Part V:
This is the fifth blog in a multi-part series on considerations and observations during the migration of relational databases to an AWS S3 data lake using the AWS DMS service.
In the previous blogs, the following considerations were discussed:
- Part I – Oracle DB CDC processing and network bandwidth
- Part II – MS SQL Server LOB and Datetime columns
- Part III – MongoDB load balancing and deleted records
- Part IV – Network bandwidth issue in details during the migrations
This blog will focus mainly on full load data migration settings, configurations from RDBMS source Oracle, and it could be applicable for other RDBMS sources too.
Problem statement:
With major Relational databases in scope for the data migration to AWS cloud, actual data migration was the critical task of this requirement. For small and medium sized tables in all the databases, full load data migration was happening very quickly.
Regarding the Oracle data migration considering large Oracle tables in scope and heavy transaction rate on the source, full load tasks were taking quite a long time, more than 1 day for 12-15 large tables in Oracle and with ETL transformations, data validations it uses to take 2-3 days for the complete process.
Source tables in scope for this Oracle tables data loading were as follows:
Total tables : 120
Large to very large tables : 15
Rows in large tables : 100M – 500M
Full load tasks for the tables in DMS was configured as shown below for one of the tables –
{
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "fin",
"table-name": "charge_dtls"
},
"rule-action": "include",
"filters": []
}
}
Also, these complete data loading and processing was performed for 3-4 times during the development phase due to various reasons such as follows–
DMS tasks were failed as source was not available for few hours over the weekend for maintenance activities
Modified timestamp to process changed records not populated correctly in source tables
Change in business requirements and reload due to data issues etc.
These frequent data reload had added overall delay for 2-3 days for each complete data loading and processing for the large tables and additional cost incurred.
Solution overview:
Upon further analysis and reviewing the full load tasks configured and overall time required for the full load tasks completion, it was decided to use parallel load options with full load tasks for improving the load completion.
The full load process is straight-forward to understand. Data is extracted from the source in a bulk extract manner and loaded directly into the target. You can specify the number of tables to extract and load it in parallel mode on the AWS DMS console under Advanced Settings.
To speed up migration and make it more efficient, one can use parallel load for selected relational tables, views, and collections. In other words, it can migrate a single segmented table, view, or collection using several threads in parallel. To do this, AWS DMS splits a full-load task into threads, with each table segment allocated to its own thread.
Parallel load helps in speeding the migration process and make it more efficient. With the parallel options, AWS DMS splits single segmented table into different threads with each thread can run in parallel.
One can use a parallel load option for data from sources Oracle, Microsoft SQL Server, MySQL, Sybase, and IBM Db2 LUW sources based on partitions and sub partitions. Doing this can improve overall full load duration.
One can configure a parallel full load of partitioned data sources to Amazon S3 targets. This approach improves the overall load times for migrating partitioned data from supported source database engines to the S3 target.
To improve the load times of partitioned source data, one can create S3 target subfolders mapped to the partitions of every table in the source database. These partition-bound subfolders allow AWS DMS to run parallel processes to populate each subfolder on the target.
To configure a parallel full load of an S3 target, S3 supports three parallel-load rule types for the table-settings rule of table mapping:
• partitions-auto
• partitions-list
• ranges
Parallel load comes in different configurations for loading data parallelly using separate threads and here parallel load ranges for column value boundaries used with specific column as follows
{
"rule-type": "table-settings",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "fin",
"table-name": "charge_dtls"
},
"parallel-load": {
"type": "ranges",
"columns": [
"charge_Id"
],
"boundaries": [
[
"1000000"
],
[
"2000000"
],
[
"3000000"
],
[
"4000000"
],
[
"5000000"
],
[
"6000000"
],
[
"7000000"
],
[
"8000000"
],
[
"9000000"
],
[
"10000000"
]
]
}
}
Here charge_dtls table is loaded using parallel load with ranges option on the Id column.
First one need to identify a column to create boundaries and a column with primary, unique or index created should be a good choice for the boundaries.
This parallel ranges helps in a scenario for table without data partitions as in this case here.
It helped in our case to reduce overall full load time for tables by 50-60% and overall reduction in the data loading, ETL jobs and data validation activity to complete within 1.5 days against 2-3 days before.
Full load task in DMS can be configured with other parallel load configuration options such as –
- Partitions-auto: for the tables with partitions already defined while table creations. E.g. suppose above table is defined with partition on charge_year while creation would be configured as follows –
{
"rule-type": "table-settings",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "fin",
"table-name": "charge_dtls"
},
"parallel-load": {
"type": "partitions-auto"
}
}
- Partitions-list: Above table can be loaded using partitions list on the same partition column charge_year to filter and fetch data from specific partition values, as shown below to fetch data only for partition values 2024 and 2025 for charge_year column
{
"rule-type": "table-settings",
"rule-id": "1",
"rule-name": "1",
"object-locator": {
"schema-name": "fin",
"table-name": "charge_dtls"
},
"parallel-load": {
"type": "partitions-list",
“partitions”: [“2024”, “2025”]
}
}
MaxFullLoadSubTasks is settings for the number of tables or table segments to run in parallel with max value as 49 and default value set as 8.
It needs to be configured accordingly for running full load using parallel configurations or multiple small tables to run in parallel.
Conclusion:
In this blog, it was discussed about the parallel load options available during full load data migration for improving full load performance and reducing the required data loading time. Most of the times these options were not considered, and data loading is completed using default configurations.
Top comments (0)