DEV Community: Konstantin Semenenkov

Using Mapping Tables to Merge Data with Auto-Number Keys Referenced by Other Tables

Konstantin Semenenkov — Thu, 05 Sep 2024 16:26:50 +0000

Introduction

The problem and solution can be demonstrated with the following picture:

Fig. 1 Expected result

We have two instances of databases with similar structures that were not designed to be combined. The tables use auto-number primary keys, which are also used as foreign keys — a common model that works fine until the data needs to be merged.

Now, we want to combine all this data into a target database. The goal is to retain both the original target parents with their children, plus the source parents with their children. Since we use an auto-number primary key, each source row in the target DB will get new IDs (e.g., 3 and 4), and these new values must be used for corresponding child rows.

We’ll discuss a solution based on source-ID-to-target-ID mapping tables. This approach was recently implemented in the ID-Remap Merge feature of KS DB Merge Tools. We’ll also briefly cover alternative solutions. The implementation details will focus on several popular relational DBMSs — Oracle, MySQL, SQL Server, PostgreSQL, and SQLite — but the approach can generally be applied to any relational DBMS.

Our Solution

When we insert source parent rows into the target database, we need to keep a mapping of the original parent ID from the source DB to its new value in the target DB. Then, we use this mapping to assign the correct parent ID values for child rows.

The examples below are provided for SQL Server, followed by clarifications for other DBMSs.

First, let’s create a mapping table in the target DB:

CREATE TABLE dbo.MapParent (
  SourceID int primary key,
  TargetID int not null);

Next, we loop through source Parent rows and generate INSERT statements for the target DB, keeping new parent ID values in the mapping:

INSERT INTO dbo.Parent (Name)
OUTPUT
  1, -- source ParentID
  INSERTED.ParentID
INTO dbo.MapParent
VALUES ('Source parent #1');

Result of this statement in the target DB will be a new Parent table row with new auto-number ID and new row in the dbo.MapParent table:

Fig. 6 MapParent table

The OUTPUT statement to extract the inserted ID value is specific to SQL Server. For other DBMSs, we can use the following methods:

Oracle:

DECLARE target_id NUMBER(19);
BEGIN
  INSERT INTO "Parent" ("Name")
  VALUES ('Source parent #1')
  RETURNING "ParentID" INTO target_id;

  INSERT INTO "MapParent" VALUES (1, target_id);

  COMMIT;
END;

MySQL / MariaDB:

INSERT INTO `Parent` (`Name`)
  VALUES ('Source parent #1');

INSERT INTO `MapParent`
  VALUES (2, LAST_INSERT_ID());

PostgreSQL:

INSERT INTO "Parent" ("Name")
 VALUES ('Source parent #1');

INSERT INTO "MapParent"
 VALUES (2, lastval());

SQLite:

INSERT INTO "Parent" ("Name")
  VALUES ('Source parent #1');
INSERT INTO "MapParent"
  VALUES (2, last_insert_rowid());

Going Further

Since we use auto-numbers for primary keys in our database, our Child primary key is most likely also an auto-number. But what if our Child primary key is used as a foreign key for another table, say GrandChild? We need to use the same approach: create a MapChild table with SourceID and TargetID columns to keep the mapping for child rows and populate it during INSERT INTO Child.

In real-world applications, we may have complex objects at the application level that are represented in the database as many rows in many related tables linked by foreign keys. In our Parent-Child-GrandChild example, we must ensure that we merge a parent row with all its children, and all sub-children of each child, and so on. We can obtain this information from the database metadata:

SQL Server: sys.foreign_keys and sys.foreign_key_columns
Oracle: SYS.ALL_CONSTRAINTS and SYS.ALL_CONS_COLUMNS
MySQL/MariaDB: INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE
PostgreSQL: INFORMATION_SCHEMA.REFERENTIAL_CONSTRAINTS and INFORMATION_SCHEMA.KEY_COLUMN_USAGE views or underlying pg_constraint and other system catalogs
SQLite: PRAGMA foreign_key_list(table-name) can be used to get parent relations; child relations can be retrieved based on parent relations for all tables, and all tables can be taken from the sqlite_master table.

General Algorithm for Merging a Parent with All Levels of Children

Set the “current” table to the source parent that we want to merge with all children. Later, the current table will be replaced with child tables, but the steps for the current table will not change; this is a recursive algorithm.
Create a mapping table for the current table.
Read the required rows from the source DB for the current table. If the current table is a parent table, “required” means the rows needed for the business goal (e.g., importing a specific parent or parents matching certain criteria). Otherwise, if the current table is not a parent table, we deal with its children and need to process only those children related to our parent. This requires temporary storage of a list of IDs for each processed table. For example, if ParentID 2 is referenced by ChildIDs 3 and 4, then when processing the GrandChild table, we only need its rows with ChildID 3 and 4.
For each row, generate INSERT statements, populating the target table and mapping table from step 2. If the current table is one of the child tables, then foreign key values referencing previously processed parents must be replaced with a SELECT TargetID FROM Mapping subquery.
Find all direct child tables (those with foreign keys to the current table).
For each table from step 5, change the current table to that child and repeat all steps starting from step 2.

As a result, we will generate a SQL script that creates mapping tables and inserts all required source data from the parent and all its children. Executing such a script will give us the expected result shown in Fig. 1. If necessary, we can add DROP TABLE statements for the mapping tables because they are no longer needed for INSERTs. However, these tables can be useful for further troubleshooting, they can be used to identify rows merged from the source database. They can also be used to divide a complex merge into parts, for example, merging reference data first, verifying the merge result, and then proceeding with the rest of the data using the mapping tables created for the reference data.

Caveats

Circular References: Unfortunately, foreign key dependencies may form cycles. For example, our schema could evolve as follows (pseudo-SQL without data types, to demonstrate relations between tables and the historical order of schema changes):

CREATE TABLE Parent (ParentID PRIMARY KEY, Name);
CREATE TABLE Child (ChildID PRIMARY KEY, ParentID NOT NULL REFERENCES Parent(ParentID), Name);
ALTER TABLE Parent ADD COLUMN FavoriteNephewChildID NULL REFERENCES Child(ChildID);

The word “Nephew” in the column name is intentional to emphasize that this ID should belong to the child of another parent. Our algorithm can’t be used here because when we process the Parent table, the Child mapping does not exist yet. And that's not just a mistake in the algorithm. Logically, it may happen that we need to merge a parent referencing a child that is not yet merged, and vice versa. The solution is to split the Parent table merge into two steps:

First, merge Parent rows without FavoriteNephewChildID (set it to NULL) and generate a mapping for Parent.
Next, merge Child rows using the Parent mapping and generate a mapping for Child.
Finally, update Parent.FavoriteNephewChildID using the Parent and Child mappings.

In this example, we have two tables referencing each other, but in general, there could be longer cycles, such as T2 referencing T1, T3 referencing T2, T4referencing T3, and finally T1 referencing T4. The solution will be the same: merge one of the nullable foreign key columns as NULL first and then update it in an additional step.

Other Considerations:

Triggers: Processed tables in the target database may have INSERT triggers that change related data in other tables. We need to decide whether this is desired. If not (e.g., we want to import these related tables separately), additional actions will be required, depending on the DBMS, which may or may not support trigger disabling.
Calculated Columns: Calculated columns need to be excluded from INSERT statements generated in step 4 of our algorithm.
Changed Columns: It may happen that the source and target databases use slightly different schemas (e.g., different versions of applications), and the processed target tables are missing columns present in the source. In this case, we need to generate INSERTs only for common columns. Of course, if the source table is missing a NOT NULL column present in the target, the entire merge operation may not be possible.

Conclusion

Combining databases with auto-number primary keys presents a number of challenges. By using the recursive algorithm discussed in this article, you can confidently merge parent-child relationships and handle more complex data hierarchies without losing integrity or causing conflicts.

References

Implementing UI Automation Testing for Desktop Applications Dealing With Different DBMS

Konstantin Semenenkov — Wed, 05 Jun 2024 20:22:39 +0000

This article can be useful for:

Those who participate in UI automation testing of desktop apps. Perhaps someone will be interested in the real experience of building a testing system.
Someone who is making software that needs to use different relational database management systems (DBMSs).

A brief history of the topic

Just a few words about the project discussed here. This section does not contain any technical details, so it can be skipped if you're in a hurry. The only technical information in this section is the list of DBMSs related to our discussion.

The project is a number of diff and merge tools for popular relational DBMSs. Initially created for internal use in Access development, it was primarily used to compare module and form code between different versions of the same Access project. Later, a fork was made for SQL Server, using the same UI to compare schema, data, and programming stuff. Further forks were developed for MySQL, SQLite, and PostgreSQL. With tools for working with metadata from different DBMSs, a tool was created for Cross-DBMS scenarios, focusing mainly on data. Having created a tool for Cross-DBMS, I realized that it sorely lacks Oracle support. Thus, for Cross-DBMS a kernel for working with Oracle was implemented, along with a separate tool for Oracle Database.

All tools are made using Visual Studio Community Edition and utilize the .NET Framework and WPF.

About the tests

After one of the disastrous releases, where changes in one area unexpectedly broke another, it became clear that UI automation tests were necessary. Although there were unit tests for some components, they did not check the functionality of the entire application. Manually testing all the functionality of each release would be too time-consuming, and besides, humans (at least I) are lazy and can make mistakes. If it can be automated, then it should be automated. The first tests were built using the TestStack.White library. Now, since this library is no longer supported, there is a smooth migration to the FlaUI library.

It was decided to use the SpecFlow BDD framework, which conveniently describes the steps for using the application and the expected results. A typical test looks like this:

Scenario: 01 Sample data diff test
  Given [System] Set template folder to '04_DataDiff'
    And [App] Start with 'dd1;dd2' loaded
   When [HomeTab] Click on the 'left' panel summary 'Total' for 'Table definitions'
    And [ObjectList] On the 'left' panel click on 'Compare data (all records)' for 'Client'
   Then [DataDiff] 'right' panel totals is '2;0;1;1', per page '2;0;1;1'

For each step, code is written that uses the UI automation library to check and manipulate the UI. However, over time, there was a complication with parameterized tests, where the same steps needed to be used for different cases. SpecFlow allows describing parameterized tests using Scenario Outline, where each case is described by a single row in the Examples table. Thus, for one case, we can only a set of scalar values. However, this was insufficient to describe the expected result of a complex UI consisting of multiple grids, each needing to be described by a separate table. For such tests, another approach was developed: each case is described by a row in an Excel table, with columns describing the actions and the expected UI state. Since an Excel cell can contain multiline text, a special syntax was adopted to describe multiple UI grids in one cell, for example:

[Columns (Name; Nullability)]
- id; NOT NULL
- name; NULL

[Constraints (Type; Columns)]
- PRIMARY KEY; id

From a one-dimensional list of parameters in SpecFlow, we moved to a four-dimensional set of parameters: a set of Excel cells, each containing a set of tables, with each table being a set of rows and columns. Using Excel or its alternatives is convenient as there are ready-to-use tools for viewing and editing. But there are significant drawbacks associated with the xlsx file format, such as difficulty tracking test history, comparing them, and understanding the volume of existing tests. Therefore, in the future, these scenarios are planned to be moved from Excel to some text format, most likely XML, requiring the development of a UI for editing these scenarios.

A typical test proceeds as follows:

Prepare one or two database instances. Only one instance is used if the test checks the application's behavior when only one database is opened, or when the same database is compared to itself. In most cases, databases are described as scripts. For tests described using Excel, the script is assembled from spreadsheet content on the fly. For applications working with Access or SQLite, ready-made DB files are sometimes used instead of scripts. Specifically for Access, not all database objects can be created with an SQL script, even when only dealing with tables. For SQLite, ready-made files are used to perform a more complete end-to-end testing cycle, which is particularly important for protected files. Where possible:
- Databases are created in parallel to speed up the process.
- Database files are created on a RAM Drive, sometimes with the entire database server placed there. Using a RAM Drive not only speeds up test execution but also prolongs the life of SSDs or HDDs.
The application is launched and opens the databases. If possible, and if not crucial for the test, specifying the database is done via the command line to save time clicking through database open dialogs.
The test clicks through the application and checks UI elements according to the scenario steps. When the application generates and executes SQL scripts, the text of these scripts is extracted and goes to the output of the test execution console. This often helping to understand the problem without debugging the test.
If the test fails, a screenshot of the application is saved.
The application is closed, and the databases are deleted. In the most cases database deletion occurs in the background, saving some time. Deletion is controlled by a switch; sometimes databases are retained for further analysis and debugging. Tests for AccdbMerge also check for any non-closed Access processes, which are used by the application to process database files.

Steps 1 and 2 can take significant time. Creating one database with two tables for two different cases and launching the application once can be significantly faster than creating a database with one table twice and launching the application twice (once for each database). Therefore, some cases are combined, if possible, so that one database is created for all cases, and the application is launched only once.

DBMS-specific things

All of the listed relational DBMSs deal with tables; tables consist of columns and rows. SQL is everywhere. It would seem that I can easily take some existing test from one application and use it for another. Fortunately, sometimes this is almost possible. If a new feature is being developed, which is subsequently planned to be implemented in tools for different DBMSs, then the test for it is usually written first for SQLite, since these tests are the fastest. Often such a test and scripts for it can be reused for other products with minimal changes. But, as we know, the devil is in the details. Different data types, different database management capabilities, different SQL. As a result, the tests still have significant differences. Let's talk about them. The following will list the features for each DBMS separately, in the order these tests were created.

Microsoft Access

Possibly the biggest issue for my tests is the limited SQL in Access, both for DDL and DML. Access has Queries similar to Views in other DBMS, and there is a CREATE VIEW statement described in the documentation, but it does not work. There are many data types that cannot be used either in CREATE TABLE or INSERT. As a result, using a script, you can create a database file with tables using simple data types like numbers or strings. But for something more complex, pre-prepared mdb and accdb files often have to be used. However, even if we have prepared database templates, sometimes simply copying them is not enough. A common practice in Access development is to split the database into a backend and frontend, resulting in linked tables in the frontend that need to update their links to the backend after copying.

Another problem with interacting with Access is that it updates from time to time, causing some tests to stop working. My application stops working even though nothing has changed in it — only Access has changed. They have broken twice the availability of DAO libraries from the outside.

The most frequent support request I receive is that AccdbMerge cannot connect to Access, which is always fixed only by restoring Office, without changing anything in AccdbMerge.

SQL Server

I started my programming career in the early 2000s, and SQL Server was the first database engine I worked with. So, SQL Server is a sort of standard DBMS for me, and for a long time, all other DBMSs were learned through comparison with SQL Server.

Perhaps the most interesting result of making tests for SQL Server was some incompatibility with .NET data types:

The SQL Server decimal data type can exceed the capacity of the .NET decimal data type
The SQL Server uniqueidentifier and the .NET Guid have different internal presentation and sorting rules.

MySQL and MariaDB

The main problem for me during my work on these tools was the frequency of new releases of these DBMSs, each new release needs to be checked for compatibility with the application. At first, this mostly concerned MariaDB; MySQL stayed on version 8.0 for a long time. However, it recently released versions 8.1 to 8.4 within a short interval.

From my experience working with different DBMSs, MySQL ranks second after Access in terms of the number of bugs. And sometimes they don't get fixed for years. One example, which in particular is one of the reasons for the differences between tests, is non-working inline foreign keys. There are a lot of questions on this topic on StackOverflow and one of them contains links to a bunch of related bugs in MySQL.

SQLite

SQLite has several specific features, but the most distinctive for my tools and their tests is handling of data types. Not only can it simply be missing, but even if they are specified, by default you can easily put the value of the different data type. Fortunately, version 3.37.0 introduced the concept of STRICT tables to fix this, but it does not eliminate the need to maintain regular tables.

The compatibility between database files and library versions is admirable. However, knowing about this super-compatibility, it was doubly strange to face the end of support for one of the methods for protecting database files.

PostgreSQL

Similar to SQL Server, I did not have any significant issues with PostgreSQL. The main difficulty from a development point of view was the huge number of data types and a wide variety of types of database objects and their properties. And it was new to me to met NaN and +/- infinity values for some numeric and timestamp data types.

Oracle

As mentioned earlier, before running the test, we must first create a database. For SQL Server, MySQL, and PostgreSQL, this is simply done by sending the CREATE DATABASE command to the database server. But not for Oracle. It doesn't work there. You have the database server - and that's your database. And don't ask for another one. Instead of a database, I had to create a user which is created with a new schema. Then the test interacts with objects within this schema. Since the schema name is displayed in the application as part of the table name, and this schema is different each time, special variables had to be entered and processed. For other DBMS tests, the table is referred to as TableName or dbo.TableName or public.TableName, and this is the real name of the object. But for Oracle tests, I had to write $schema.TABLENAME and then in the code replace $schema with the real name of the schema before looking for this one object on the application UI. At first, I thought that maybe I just didn’t know something about Oracle, but then I came across the source code of one of the DB Fiddles - its authors did the same thing.

Another feature of Oracle is that an empty string is always treated as NULL. Unlike any other DBMS, it is impossible to save an empty string as non-NULL.

I use Windows, and all my desktop apps work only on Windows. Another feature of working with Oracle was the need to use Docker Desktop since I could not find a Windows version for the latest versions of Oracle.

Other

In addition to the listed differences, there are several more points that are specific to each DBMS:

Case-sensitivity and quoted identifiers work differently for object names. Some DBMSs will leave the CamelCase object name as is, others will convert it to lowercase, and others to uppercase. For some DBMSs, CamelCase can be preserved if you surround the name with double quotes, but double quotes are not used by all DBMSs. MySQL and MariaDB use grave accents (`), and SQL Server (by default) and Access use square brackets ([]).
Creating database objects takes time, and sometimes it can be speeded up if you wrap all statements in a transaction. However, not every DBMS supports transactions for DDL statements. Additionally, some DDL commands sometimes cannot be used together with others and must be sent to the DBMS server as a separate request.
When filling tables with data, it is convenient to use multi-row INSERT statements like this:

INSERT INTO TableName
  (ID, Name)
VALUES
  (1, 'Smith'),
  (2, 'Doe')

But this syntax is not supported by all DBMSs, or some DBMSs may not support it for all versions. In order for the test to work on all the required DBMS versions, we have to split the script into separate INSERTs.

Summary

In this article, we discussed the practical experience of building a UI automation testing system for desktop applications that interact with various relational DBMSs. The necessity of automated UI testing became evident after a release highlighted the risks of manual testing. Each DBMS presented unique challenges, from Access's limited SQL capabilities and frequent updates to Oracle's unconventional database creation requirements. The experience highlighted the importance of flexibility and adaptability in automated testing to accommodate the nuances of different DBMSs.

P. S.

According to the publishing rules, I have to note that I have used AI for title, summary and proofreading of the rest of the text. But the text itself was handwritten, including the SQL for the cover image :).