Using Mapping Tables to Merge Data with Auto-Number Keys Referenced by Other Tables

Introduction

The problem and solution can be demonstrated with the following picture:

Fig. 1 Expected result

We have two instances of databases with similar structures that were not designed to be combined. The tables use auto-number primary keys, which are also used as foreign keys — a common model that works fine until the data needs to be merged.

Now, we want to combine all this data into a target database. The goal is to retain both the original target parents with their children, plus the source parents with their children. Since we use an auto-number primary key, each source row in the target DB will get new IDs (e.g., 3 and 4), and these new values must be used for corresponding child rows.

We’ll discuss a solution based on source-ID-to-target-ID mapping tables. This approach was recently implemented in the ID-Remap Merge feature of KS DB Merge Tools. We’ll also briefly cover alternative solutions. The implementation details will focus on several popular relational DBMSs — Oracle, MySQL, SQL Server, PostgreSQL, and SQLite — but the approach can generally be applied to any relational DBMS.

Our Solution

When we insert source parent rows into the target database, we need to keep a mapping of the original parent ID from the source DB to its new value in the target DB. Then, we use this mapping to assign the correct parent ID values for child rows.

The examples below are provided for SQL Server, followed by clarifications for other DBMSs.

First, let’s create a mapping table in the target DB:

CREATE TABLE dbo.MapParent (
  SourceID int primary key,
  TargetID int not null);

Next, we loop through source Parent rows and generate INSERT statements for the target DB, keeping new parent ID values in the mapping:

INSERT INTO dbo.Parent (Name)
OUTPUT
  1, -- source ParentID
  INSERTED.ParentID
INTO dbo.MapParent
VALUES ('Source parent #1');

Result of this statement in the target DB will be a new Parent table row with new auto-number ID and new row in the dbo.MapParent table:

Fig. 6 MapParent table

The OUTPUT statement to extract the inserted ID value is specific to SQL Server. For other DBMSs, we can use the following methods:

Oracle:

DECLARE target_id NUMBER(19);
BEGIN
  INSERT INTO "Parent" ("Name")
  VALUES ('Source parent #1')
  RETURNING "ParentID" INTO target_id;

  INSERT INTO "MapParent" VALUES (1, target_id);

  COMMIT;
END;

MySQL / MariaDB:

INSERT INTO `Parent` (`Name`)
  VALUES ('Source parent #1');

INSERT INTO `MapParent`
  VALUES (2, LAST_INSERT_ID());

PostgreSQL:

INSERT INTO "Parent" ("Name")
 VALUES ('Source parent #1');

INSERT INTO "MapParent"
 VALUES (2, lastval());

SQLite:

INSERT INTO "Parent" ("Name")
  VALUES ('Source parent #1');
INSERT INTO "MapParent"
  VALUES (2, last_insert_rowid());

Going Further

Since we use auto-numbers for primary keys in our database, our Child primary key is most likely also an auto-number. But what if our Child primary key is used as a foreign key for another table, say GrandChild? We need to use the same approach: create a MapChild table with SourceID and TargetID columns to keep the mapping for child rows and populate it during INSERT INTO Child.

In real-world applications, we may have complex objects at the application level that are represented in the database as many rows in many related tables linked by foreign keys. In our Parent-Child-GrandChild example, we must ensure that we merge a parent row with all its children, and all sub-children of each child, and so on. We can obtain this information from the database metadata:

SQL Server: sys.foreign_keys and sys.foreign_key_columns
Oracle: SYS.ALL_CONSTRAINTS and SYS.ALL_CONS_COLUMNS
MySQL/MariaDB: INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE
PostgreSQL: INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE views or underlying pg_constraint and other system catalogs
SQLite: PRAGMA foreign_key_list(table-name) can be used to get parent relations; child relations can be retrieved based on parent relations for all tables, and all tables can be taken from the sqlite_master table.

General Algorithm for Merging a Parent with All Levels of Children

Set the “current” table to the source parent that we want to merge with all children. Later, the current table will be replaced with child tables, but the steps for the current table will not change; this is a recursive algorithm.
Create a mapping table for the current table.
Read the required rows from the source DB for the current table. If the current table is a parent table, “required” means the rows needed for the business goal (e.g., importing a specific parent or parents matching certain criteria). Otherwise, if the current table is not a parent table, we deal with its children and need to process only those children related to our parent. This requires temporary storage of a list of IDs for each processed table. For example, if ParentID 2 is referenced by ChildIDs 3 and 4, then when processing the GrandChild table, we only need its rows with ChildID 3 and 4.
For each row, generate INSERT statements, populating the target table and mapping table from step 2. If the current table is one of the child tables, then foreign key values referencing previously processed parents must be replaced with a SELECT TargetID FROM Mapping subquery.
Find all direct child tables (those with foreign keys to the current table).
For each table from step 5, change the current table to that child and repeat all steps starting from step 2.

As a result, we will generate a SQL script that creates mapping tables and inserts all required source data from the parent and all its children. Executing such a script will give us the expected result shown in Fig. 1. If necessary, we can add DROP TABLE statements for the mapping tables because they are no longer needed for INSERTs. However, these tables can be useful for further troubleshooting, they can be used to identify rows merged from the source database. They can also be used to divide a complex merge into parts, for example, merging reference data first, verifying the merge result, and then proceeding with the rest of the data using the mapping tables created for the reference data.

Caveats

Circular References: Unfortunately, foreign key dependencies may form cycles. For example, our schema could evolve as follows (pseudo-SQL without data types, to demonstrate relations between tables and the historical order of schema changes):

CREATE TABLE Parent (ParentID PRIMARY KEY, Name);
CREATE TABLE Child (ChildID PRIMARY KEY, ParentID NOT NULL REFERENCES Parent(ParentID), Name);
ALTER TABLE Parent ADD COLUMN FavoriteNephewChildID NULL REFERENCES Child(ChildID);

The word “Nephew” in the column name is intentional to emphasize that this ID should belong to the child of another parent. Our algorithm can’t be used here because when we process the Parent table, the Child mapping does not exist yet. And that's not just a mistake in the algorithm. Logically, it may happen that we need to merge a parent referencing a child that is not yet merged, and vice versa. The solution is to split the Parent table merge into two steps:

First, merge Parent rows without FavoriteNephewChildID (set it to NULL) and generate a mapping for Parent.
Next, merge Child rows using the Parent mapping and generate a mapping for Child.
Finally, update Parent.FavoriteNephewChildID using the Parent and Child mappings.

In this example, we have two tables referencing each other, but in general, there could be longer cycles, such as T2 referencing T1, T3 referencing T2, T4referencing T3, and finally T1 referencing T4. The solution will be the same: merge one of the nullable foreign key columns as NULL first and then update it in an additional step.

Other Considerations:

Triggers: Processed tables in the target database may have INSERT triggers that change related data in other tables. We need to decide whether this is desired. If not (e.g., we want to import these related tables separately), additional actions will be required, depending on the DBMS, which may or may not support trigger disabling.
Calculated Columns: Calculated columns need to be excluded from INSERT statements generated in step 4 of our algorithm.
Changed Columns: It may happen that the source and target databases use slightly different schemas (e.g., different versions of applications), and the processed target tables are missing columns present in the source. In this case, we need to generate INSERTs only for common columns. Of course, if the source table is missing a NOT NULL column present in the target, the entire merge operation may not be possible.

Conclusion

Combining databases with auto-number primary keys presents a number of challenges. By using the recursive algorithm discussed in this article, you can confidently merge parent-child relationships and handle more complex data hierarchies without losing integrity or causing conflicts.