DEV Community

Konstantin Semenenkov
Konstantin Semenenkov

Posted on • Originally published at Medium

Using Mapping Tables to Merge Data with Auto-Number Keys Referenced by Other Tables

Introduction

The problem and solution can be demonstrated with the following picture:

Fig. 1 Expected result
Fig. 1 Expected result

We have two instances of databases with similar structures that were not designed to be combined. The tables use auto-number primary keys, which are also used as foreign keys — a common model that works fine until the data needs to be merged.

Now, we want to combine all this data into a target database. The goal is to retain both the original target parents with their children, plus the source parents with their children. Since we use an auto-number primary key, each source row in the target DB will get new IDs (e.g., 3 and 4), and these new values must be used for corresponding child rows.

We’ll discuss a solution based on source-ID-to-target-ID mapping tables. This approach was recently implemented in the ID-Remap Merge feature of KS DB Merge Tools. We’ll also briefly cover alternative solutions. The implementation details will focus on several popular relational DBMSs — Oracle, MySQL, SQL Server, PostgreSQL, and SQLite — but the approach can generally be applied to any relational DBMS.

Other Solutions

It’s clear that auto-number primary and foreign keys are not suitable for such operations. A more natural solution would be to use GUIDs, which are unique across all databases and can be copied between databases as-is, preserving referential integrity and avoiding conflicts with existing data:

Fig. 2 Using GUIDs
Fig. 2 Using GUIDs

However, this solution is not suitable for our task because we can’t change the database design. It may be a legacy database, or there may be other reasons why we can’t modify the structure of the target database.

Another approach requiring a redesign would be to change the key structure by adding a DatabaseID value and changing keys to be composite keys consisting of that DatabaseID and an auto-number:

Fig. 3 Composite keys
Fig. 3 Composite keys

We also don’t consider some tricky ways, like to import source records with negative IDs:

Fig. 4 Negative IDs
Fig. 4 Negative IDs

Or shifting ID values to some number space not used by target IDs:

Fig. 5 Shifting IDs
Fig. 5 Shifting IDs

These methods can work under certain circumstances, but they are generally less applicable. Negative IDs can be used only once, and shifting IDs creates gaps in ID values, reducing the range of available IDs.

Our Solution

When we insert source parent rows into the target database, we need to keep a mapping of the original parent ID from the source DB to its new value in the target DB. Then, we use this mapping to assign the correct parent ID values for child rows.

The examples below are provided for SQL Server, followed by clarifications for other DBMSs.

First, let’s create a mapping table in the target DB:

CREATE TABLE dbo.MapParent (
  SourceID int primary key,
  TargetID int not null);
Enter fullscreen mode Exit fullscreen mode

Next, we loop through source Parent rows and generate INSERT statements for the target DB, keeping new parent ID values in the mapping:

INSERT INTO dbo.Parent (Name)
OUTPUT
  1, -- source ParentID
  INSERTED.ParentID
INTO dbo.MapParent
VALUES ('Source parent #1');
Enter fullscreen mode Exit fullscreen mode

Result of this statement in the target DB will be a new Parent table row with new auto-number ID and new row in the dbo.MapParent table:

Fig. 6 MapParent table
Fig. 6 MapParent table

The OUTPUT statement to extract the inserted ID value is specific to SQL Server. For other DBMSs, we can use the following methods:

  • Oracle:
DECLARE target_id NUMBER(19);
BEGIN
  INSERT INTO "Parent" ("Name")
  VALUES ('Source parent #1')
  RETURNING "ParentID" INTO target_id;

  INSERT INTO "MapParent" VALUES (1, target_id);

  COMMIT;
END;

Enter fullscreen mode Exit fullscreen mode
  • MySQL / MariaDB:
INSERT INTO `Parent` (`Name`)
  VALUES ('Source parent #1');

INSERT INTO `MapParent`
  VALUES (2, LAST_INSERT_ID());

Enter fullscreen mode Exit fullscreen mode
  • PostgreSQL:
INSERT INTO "Parent" ("Name")
 VALUES ('Source parent #1');

INSERT INTO "MapParent"
 VALUES (2, lastval());

Enter fullscreen mode Exit fullscreen mode
  • SQLite:
INSERT INTO "Parent" ("Name")
  VALUES ('Source parent #1');
INSERT INTO "MapParent"
  VALUES (2, last_insert_rowid());
Enter fullscreen mode Exit fullscreen mode

Going Further

Since we use auto-numbers for primary keys in our database, our Child primary key is most likely also an auto-number. But what if our Child primary key is used as a foreign key for another table, say GrandChild? We need to use the same approach: create a MapChild table with SourceID and TargetID columns to keep the mapping for child rows and populate it during INSERT INTO Child.

In real-world applications, we may have complex objects at the application level that are represented in the database as many rows in many related tables linked by foreign keys. In our Parent-Child-GrandChild example, we must ensure that we merge a parent row with all its children, and all sub-children of each child, and so on. We can obtain this information from the database metadata:

  • SQL Server: sys.foreign_keys and sys.foreign_key_columns
  • Oracle: SYS.ALL_CONSTRAINTS and SYS.ALL_CONS_COLUMNS
  • MySQL/MariaDB: INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE
  • PostgreSQL: INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE views or underlying pg_constraint and other system catalogs
  • SQLite: PRAGMA foreign_key_list(table-name) can be used to get parent relations; child relations can be retrieved based on parent relations for all tables, and all tables can be taken from the sqlite_master table.

General Algorithm for Merging a Parent with All Levels of Children

  1. Set the “current” table to the source parent that we want to merge with all children. Later, the current table will be replaced with child tables, but the steps for the current table will not change; this is a recursive algorithm.
  2. Create a mapping table for the current table.
  3. Read the required rows from the source DB for the current table. If the current table is a parent table, “required” means the rows needed for the business goal (e.g., importing a specific parent or parents matching certain criteria). Otherwise, if the current table is not a parent table, we deal with its children and need to process only those children related to our parent. This requires temporary storage of a list of IDs for each processed table. For example, if ParentID 2 is referenced by ChildIDs 3 and 4, then when processing the GrandChild table, we only need its rows with ChildID 3 and 4.
  4. For each row, generate INSERT statements, populating the target table and mapping table from step 2. If the current table is one of the child tables, then foreign key values referencing previously processed parents must be replaced with a SELECT TargetID FROM Mapping subquery.
  5. Find all direct child tables (those with foreign keys to the current table).
  6. For each table from step 5, change the current table to that child and repeat all steps starting from step 2.

As a result, we will generate a SQL script that creates mapping tables and inserts all required source data from the parent and all its children. Executing such a script will give us the expected result shown in Fig. 1. If necessary, we can add DROP TABLE statements for the mapping tables because they are no longer needed for INSERTs. However, these tables can be useful for further troubleshooting, they can be used to identify rows merged from the source database. They can also be used to divide a complex merge into parts, for example, merging reference data first, verifying the merge result, and then proceeding with the rest of the data using the mapping tables created for the reference data.

Caveats

Circular References: Unfortunately, foreign key dependencies may form cycles. For example, our schema could evolve as follows (pseudo-SQL without data types, to demonstrate relations between tables and the historical order of schema changes):

  1. CREATE TABLE Parent (ParentID PRIMARY KEY, Name);
  2. CREATE TABLE Child (ChildID PRIMARY KEY, ParentID NOT NULL REFERENCES Parent(ParentID), Name);
  3. ALTER TABLE Parent ADD COLUMN FavoriteNephewChildID NULL REFERENCES Child(ChildID);

The word “Nephew” in the column name is intentional to emphasize that this ID should belong to the child of another parent. Our algorithm can’t be used here because when we process the Parent table, the Child mapping does not exist yet. And that's not just a mistake in the algorithm. Logically, it may happen that we need to merge a parent referencing a child that is not yet merged, and vice versa. The solution is to split the Parent table merge into two steps:

  • First, merge Parent rows without FavoriteNephewChildID (set it to NULL) and generate a mapping for Parent.
  • Next, merge Child rows using the Parent mapping and generate a mapping for Child.
  • Finally, update Parent.FavoriteNephewChildID using the Parent and Child mappings.

In this example, we have two tables referencing each other, but in general, there could be longer cycles, such as T2 referencing T1, T3 referencing T2, T4referencing T3, and finally T1 referencing T4. The solution will be the same: merge one of the nullable foreign key columns as NULL first and then update it in an additional step.

Other Considerations:

  • Triggers: Processed tables in the target database may have INSERT triggers that change related data in other tables. We need to decide whether this is desired. If not (e.g., we want to import these related tables separately), additional actions will be required, depending on the DBMS, which may or may not support trigger disabling.
  • Calculated Columns: Calculated columns need to be excluded from INSERT statements generated in step 4 of our algorithm.
  • Changed Columns: It may happen that the source and target databases use slightly different schemas (e.g., different versions of applications), and the processed target tables are missing columns present in the source. In this case, we need to generate INSERTs only for common columns. Of course, if the source table is missing a NOT NULL column present in the target, the entire merge operation may not be possible.

Conclusion

Combining databases with auto-number primary keys presents a number of challenges. By using the recursive algorithm discussed in this article, you can confidently merge parent-child relationships and handle more complex data hierarchies without losing integrity or causing conflicts.

References

Top comments (0)