Manish Kumar

Posted on May 18

Cross‑Account Database Migration with AWS DMS: A Production‑Ready Architecture

#aws #dms #architecture

This blog presents an objective, engineer‑first view of cross‑account AWS DMS setups for multi‑account AWS environments. It covers:

Why do you need cross‑account DMS?
Connectivity patterns: VPC peering, public‑internet, Transit Gateway, and PrivateLink.
Terraform‑based, multi‑account IaC with provider aliases and workspaces.
Validation, monitoring, alerts, costs, and common DMS limits.

1. Why Cross‑Account DMS Is Needed

Organizations that run multiple AWS accounts often keep source and target databases in different accounts for reasons such as:

Security and governance (least‑privilege, separation of duties).
Billing and charge‑back (each team owns its own account).
Compliance (ISO, SOC, PCI environments strictly isolated).
Mergers and migrations (green‑field target account vs legacy source account).

AWS DMS itself does not care about the account boundary; all it requires is:

Network reachability from the DMS replication instance in one account to the source and target endpoints.
Enough IAM permissions and security‑group rules to read and write data.

The core challenge is therefore networking and permissions, not DMS configuration.

2. Connectivity Patterns Overview

For cross‑account DMS, you typically choose one of four patterns:

VPC peering between the DMS‑account VPC and the source‑DB‑account VPC.
Public‑internet‑based with tight IP allowlists.
Transit Gateway‑based routing across many accounts.
PrivateLink (if using services that expose endpoint‑services).

This blog focuses first on VPC peering and public‑internet patterns, then expands on Transit Gateway and PrivateLink.

3. Pattern 1 — VPC Peering

3.1 Architecture

Account A (DMS account): hosts the DMS replication instance in a private VPC.
Account B (Source account): hosts the source database (for example, Aurora MySQL) in its own VPC.
A VPC peering connection links the two VPCs.
Route tables in both VPCs route each other’s CIDRs via the peering connection.
Security groups on the database allow inbound traffic from the DMS VPC CIDR or the DMS security‑group ID.

All traffic stays inside AWS’s backbone; no public IP exposure is required.

3.2 Console Steps

Step 1: Create VPC peering (Account A)

VPC → Peering Connections → Create Peering Connection.
Name: dms‑src‑db‑peering.
VPC Requester: VPC where DMS runs.
VPC accepter: select Another account and enter Account B’s account ID plus the VPC ID of the source DB.

Step 2: Accept peering (Account B)

In Account B’s VPC console, accept the pending peering request.
Wait for status Active.

Step 3: Update route tables

In Account A, add a route to the source‑DB VPC CIDR via the peering connection.
In Account B, add a route to the DMS VPC CIDR via the same peering connection.

Step 4: Configure security groups (Account B)

In the DB’s security group, add an inbound rule for the database port (3306, 5432, 5439, etc.) with the DMS VPC CIDR as the source.
In Account A, ensure the DMS security group permits outbound traffic to the DB VPC CIDR.

Step 5: Create DMS resources (Account A)

Create a DMS replication instance in the DMS VPC.
Create Source and Target endpoints.
Create a Replication task with full‑load‑and‑cdc or one‑time full‑load.

4. Pattern 2 — Public Internet

If VPC peering is not allowed (CIDR overlap, governance, or cross‑account restriction), you can still use DMS over the public internet, provided you tighten the security model.

4.1 How it works

The DMS replication instance in Account A is in a public or NAT‑gateway‑backed subnet.
The source DB in Account B allows inbound traffic only from the DMS instance’s public IP or the NAT gateway’s public IP.
Optionally, an SNS + Lambda stack in Account B updates the DB security group automatically when the DMS instance IP changes.

4.2 Security‑group automation example

A typical pattern:

SNS topic in Account A publishes replication‑instance‑creation events.
Lambda function in Account B is subscribed to that topic.
Lambda:
- Calls dms:describe‑replication‑instances to get the public IP.
- Calls ec2:authorize‑security‑group‑ingress and revoke‑security‑group‑ingress on the DB security group.

5. Transit Gateway Option

5.1 When Transit Gateway Is Better Than Peering

Transit Gateway is preferable to VPC peering in large, multi‑account environments because:

It avoids an N‑squared mesh of VPC peers.
Centralized routing and policy enforcement simplify compliance and change control.
You can integrate VPN, Direct Connect, and multiple accounts through a single hub.

Tradeoffs vs peering:

Peering is simpler for one‑off or few‑account links but does not scale.
Transit Gateway scales well but adds cost and routing‑table complexity.

5.2 Scaling and Route Management

In a Transit Gateway‑based setup:

Each account’s VPC is attached to the Transit Gateway (TGW).
TGW route tables define which VPCs can talk to which others (for example, “DMS VPCs” → “DB VPCs”).
You can use route tables with tags (for example, Environment = Prod) to control connectivity at scale rather than per‑VPC.

Each additional VPC attachment typically costs per‑hour and per‑GB of data processed through the TGW, so over‑provisioning attachments or allowing unnecessary traffic can increase cost quickly.

5.3 Overlapping VPCs and Cross‑Account Complexity

CIDR overlaps are still a problem: Transit Gateway will not allow overlapping CIDRs in the same routing domain.
Best practice: assign non‑overlapping CIDR ranges at the organization level (for example, 10.0.x.0/24 per account).
When multiple teams share a Transit Gateway, ensure resource‑based permissions and tagging are used to prevent unauthorized VPC attachments.

6. PrivateLink Option

PrivateLink is useful when you want to expose the source or target database as a VPC endpoint service and let DMS reach it without routing through the public internet or a VPC‑peering mesh.

The database runs in a Network Load Balancer‑exposed VPC endpoint service.
DMS VPC creates a VPC endpoint pointing to that service.
Traffic is private, and the NLB can be in a different account than the DMS instance.

PrivateLink is ideal for SaaS‑style databases or centrally managed data platforms, but it adds cost and operational complexity compared with direct VPC peering or Transit Gateway.

7. Terraform Implementation (Enterprise Multi‑Account Pattern)

7.1 Multi‑account providers

For cross‑account DMS, configure separate AWS providers:

# main.tf
provider "aws" {
  alias = "account_a"

  region = "us-east-1"
  profile = "account-a"
}

provider "aws" {
  alias = "account_b"

  region = "us-east-1"
  profile = "account-b"
}

Then, in account‑specific modules:

# account_a/dms.tf
resource "aws_dms_replication_instance" "cross_account" {
  provider = aws.account_a

  replication_instance_identifier = "dms-replication-instance-cross-account"
  replication_instance_class      = "dms.t3.medium"
  allocated_storage               = 50
  vpc_security_group_ids          = [aws_security_group.dms_sg.id]
  replication_subnet_group_id     = aws_dms_replication_subnet_group.dms_subnet.id
}

resource "aws_dms_endpoint" "src_db" {
  provider = aws.account_a

  endpoint_identifier = "src-db"
  endpoint_type       = "source"
  engine_name         = "mysql"
  server_name         = "aurora-cluster.cluster-xyz.us-east-1.rds.amazonaws.com"
  port                = 3306
  username            = "myuser"
  password            = "mypassword"
}

resource "aws_dms_replication_task" "cross_account_task" {
  provider = aws.account_a

  replication_task_identifier = "cross-account-migration-task"
  replication_instance_arn    = aws_dms_replication_instance.cross_account.replication_instance_arn
  source_endpoint_arn         = aws_dms_endpoint.src_db.endpoint_arn
  target_endpoint_arn         = aws_dms_endpoint.tgt_redshift.endpoint_arn
  migration_type              = "full-load-and-cdc"
  table_mappings              = jsonencode({
    rules = [
      {
        rule-type  = "selection"
        rule-id    = "1"
        rule-name  = "1"
        object-locator = {
          schema-name = "%"
          table-name  = "%"
        }
        rule-action = "include"
      }
    ]
  })
}

7.2 Remote state and workspaces

Use remote state (for example, terraform { backend "s3" { ... } }) to coordinate changes between Account A and Account B.
Use Terraform workspaces (or environment‑specific stacks) to separate prod and staging without duplicating configuration.
In CI/CD pipelines, run plan and apply separately for each account, checking that VPC peering, TGW attachments, and security‑group rules are aligned.

8. Troubleshooting

This section assumes you have already reviewed the core connectivity and IAM steps above.

8.1 Basic prereqs

Before deep‑troubleshooting:

Confirm DMS replication instance is available.
Confirm source and target databases are reachable from an EC2 test instance in the same VPC.
Check CloudWatch DMS log groups for ERR_‑prefixed messages.

Console:
DMS → Replication instances → select instance → Events tab.

CLI:

aws dms describe-replication-instances \
  --replication-instance-identifier dms-replication-instance-cross-account \
  --profile account-a

8.2 Validation and connectivity commands

From a test instance in the DMS‑account VPC (or the DMS instance itself if allowed):

TCP connectivity to MySQL/Aurora:

nc -zv aurora-cluster.cluster-xyz.us-east-1.rds.amazonaws.com 3306

PostgreSQL:

nc -zv pg-source.cluster-xyz.us-east-1.rds.amazonaws.com 5432

DNS resolution:

nslookup aurora-cluster.cluster-xyz.us-east-1.rds.amazonaws.com
dig aurora-cluster.cluster-xyz.us-east-1.rds.amazonaws.com

If nc or telnet fail, but nslookup succeeds, the problem is likely security‑group or route‑table, not DNS.

8.3 Security‑group and VPC checks (CLI)

Security group rules (Account B):

aws ec2 describe-security-groups \
  --group-ids sg-0x1y2z3w4v5u6t7s8 \
  --profile account-b

Route tables (Accounts A and B):

aws ec2 describe-route-tables \
  --route-table-ids rtb-0a1b2c3d4e5f6g7h8 \
  --query 'RouteTables.Routes' \
  --profile account-a

Public IP of DMS instance or NAT gateway (Account A):

aws ec2 describe-network-interfaces \
  --filters "Name=description,Values=*DMS Replication Instance*" \
  --query 'NetworkInterfaces.Association.PublicIp' \
  --profile account-a

If the public IP changes, update the DB‑side security group accordingly or use Lambda automation.

8.4 IAM and DMS‑specific issues

Symptoms:

Endpoint test succeeds, but task fails to start.
Errors like Premigration assessment failed or Can’t start replication task.

Check:

IAM role attached to the DMS replication instance has permissions for source and target engines.
Cross‑account Secrets Manager or S3 policies allow the DMS role from Account A to read the secret or S3 bucket in Account B.

CLI:

aws iam get-role \
  --role-name dms-replication-instance-role \
  --profile account-a

Also check CloudWatch logs for ERR_ messages and inspect terraform plan to ensure the DMS IAM role is not accidentally mismatched across accounts.

9. Monitoring & Alerting

9.1 CloudWatch metrics to watch

For each DMS task, monitor:

Metric	What it tells you
`CDCLatencySource` / `CDCLatencyTarget`	Seconds between change on source and target .
`FullLoadRows` / `FullLoadDuration`	Progress of full‑load phase.
`StorageConsumption`	How much storage your task uses on the replication instance.
`CPUUtilization`	CPU load on the DMS instance.
`FreeStorageSpace`	Bytes of storage remaining.

You can view these in the DMS console (Tasks → select task → Monitoring table) or via CloudWatch.

9.2 CloudWatch alarms

Create alarms for:

High CDC latency (for example, above 30 seconds for realtime‑style workloads).
High CPUUtilization (for example, >70% for prolonged periods).
Low FreeStorageSpace (for example, <10 GB).
Task status change to Failed or Stopped.

Example CLI alarm for CDC latency:

aws cloudwatch put-metric-alarm \
  --alarm-name "dms-cross-account-cdc-latency-high" \
  --alarm-description "CDC latency too high on cross‑account migration task" \
  --metric-name OverallCDCLatency \
  --namespace AWS/DMS \
  --statistic Average \
  --period 300 \
  --threshold 30 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 3 \
  --alarm-actions arn:aws:sns:us-east-1:111122223333:dms-alerts \
  --profile account-a

9.3 EventBridge notifications

Configure EventBridge rules to capture DMS‑related events:

DMS replication task state changes.
DMS instance state changes.

This lets you send notifications to SNS, SQS, or Lambda for automated remediation or escalation.

10. Production Best Practices

Minimize DMS‑instance lifetime: spin up the replication instance only for the migration window; delete it afterward to save cost.
Pre‑test connectivity and permissions from a test EC2 instance before creating DMS resources.
Use VPC peering or TGW instead of public‑internet for CDC‑style workloads.
Automate security‑group updates with Lambda+SNS for IP‑based patterns.
Validate IAM roles and policies across accounts early in the pipeline.

11. Cost Considerations

11.1 DMS instance pricing

AWS DMS charges per hour for replication instances (for example, dms.t3.medium, dms.c5.large, etc.), plus storage and, in some cases, per‑hour fees for serverless variants.

On‑demand: price per hour scales with compute and storage (from roughly a few cents per hour for t3.micro to several dollars per hour for r6i.32xlarge‑class instances).
Serverless: billed per‑hour‑of‑usage plus per‑GB of data‑stored, which can be attractive for short‑lived migrations but can become costly for long‑running CDC.

Best practice: size the DMS instance based on the volume of changes per second and expected migration window, then shrink or stop it once migration completes.

11.2 NAT Gateway costs

If the DMS instance is in a private subnet, it typically uses a NAT gateway for outbound internet traffic:

NAT Gateway pricing: hourly charge plus per‑GB egress data.
For large migrations, egress traffic through the NAT gateway can add up, especially if the DMS instance also talks to public S3 buckets or Secrets Manager endpoints.

To reduce cost, use VPC endpoints (for S3, Secrets Manager, etc.) instead of NAT‑gateway‑based public‑internet traffic.

11.3 Transit Gateway vs VPC peering costs

VPC peering has no direct hourly charge, but it does not scale to many accounts and requires manual route‑table maintenance.
Transit Gateway:
- Hourly charge per TGW and per attachment (for example, about \$0.05/hour per VPC attachment in some regions).
- Data‑processing charge per GB sent through the TGW (for example, \$0.02/GB).
For a handful of cross‑account links, VPC peering is usually cheaper; for tens to hundreds of accounts, TGW can be more cost‑effective and operationally simpler despite the per‑attachment and per‑GB fees.

11.4 CloudWatch log and monitoring costs

CloudWatch Logs: DMS streams logs to CloudWatch, which you can retain for a configured period.
CloudWatch Alarms and Metrics: alarms are cheap per metric, but if you create hundreds of DMS‑task‑specific alarms, costs grow.
EventBridge: rules are low‑cost per‑million events, but complex routing can increase complexity.

Best practice: use log‑retention policies (for example, 30 days) and reuse alarm templates for CDC‑latency and CPU across tasks.

11.5 Cross‑AZ transfer and CDC runtime impact

Data transferred across Availability Zones (for example, DMS instance in us‑east‑1a and DB in us‑east‑1b) is usually free within the same region, but it still adds latency.
For long‑running CDC workloads, CDC latency is driven by:
- Quantity and size of transactions.
- Schema design (large tables, LOBs, triggers, etc.).
- DMS instance size and network path.
Larger DMS instances reduce latency but increase hourly cost; for some workloads, it may be cheaper to run CDC for a longer period on a smaller instance than upgrade to a large instance for a short burst.

12. Common DMS Limits / Pitfalls

AWS DMS enforces quotas per account and region; exceeding them causes API errors or resource creation failures.

12.1 Key limits

Tasks per replication instance: up to 200 tasks.
Tasks per account: up to 600 tasks.
Replication instances per account: up to 60 instances.
Endpoints per account: up to 1,000 endpoints.
API request throttling: 200 requests per second, with 8‑request‑per‑second refresh rate.

If you run many parallel migrations, plan to distribute tasks across multiple DMS instances or accounts.

12.2 CDC and large‑table behavior

CDC latency can spike if the DMS instance cannot keep up with the transaction rate.
Large tables can slow down full‑load; DMS may split the table into multiple sessions, but that still consumes memory and network.
LOB handling (for example, very large CLOBs or BLOBs) can exhaust DMS storage or cause CDC delays.

12.3 Unsupported features and engines

DMS does not support all engines and all features:

Some SQL Server row‑size changes above 8000 bytes can be mishandled.
Some advanced features such as cross‑engine triggers, complex stored‑procedure logic, and certain column‑types may not migrate correctly.

Always run a test‑migration and inspect CDC behavior before cutover.

13. Conclusion

Cross‑account DMS deployments are a straightforward conceptually but require careful attention to networking, IAM, and cost.

VPC peering is the simplest option for one‑to‑one account links.
Transit Gateway scales better for multi‑account enterprises but adds cost.
Public‑internet‑based setups work when peering is not allowed, but they demand strict IP‑based security‑group rules and automation.
Terraform with multi‑account providers, remote state, and workspaces can codify and repeat this pattern across environments.
Monitoring, alarms, and EventBridge notifications protect production CDC workloads from latency spikes or failures.
Cost‑aware engineering—right‑sizing DMS instances, minimizing NAT‑gateway usage, and respecting DMS limits—ensures that migrations are both fast and affordable.

DEV Community