SQL: Multi field mixed deduplication followed by numbering #eg81

#sql #development #devdiscuss #github

The SQL SERVER database table stores personnel records from multiple sources. If one of the Name, Phone, or Email fields in two records is duplicated, it indicates that the two records belong to the same person. Null means that the data is unknown. When both fields of two records are null, it means that they are not duplicated by default. Whether they are duplicated depends on other fields to determine. Note that if A and B are the same, and B and C are the same, then A and C are also the same.

Requirement: Add personnel number 'no' as a calculation column, find duplicate records, and assign an independent 'no' to each set of duplicate records.

SPL code

A1: Query the database through JDBC.

A2: Add a new number column 'no', which defaults to record number #.

A3: Use an infinite loop to traverse records and adjust the ‘no’ column. If there is a situation where ‘no’ is adjusted after a certain traversal, traverse again until all ‘no’ are no longer adjusted. Traverse records from top to bottom. When the current record is i-th, compare it with the (i+1) th till the last record in sequence. If it is considered a duplicate record, synchronize the 'no' of both records and take the smaller one between them. When the field value is null, it defaults to not being duplicated with other records. Note that null is false when performing logical AND operations on any value.

The count function returns the number of members that meet the criteria. During the comparison process between the i-th record and the i+1 to the last record, if there is an action where ‘no’ is adjusted, the count of the inner layer will be greater than 0, which will cause the count of the outer layer to also be greater than 0, satisfying the condition for continuing the loop.

We still need the condition of ‘no!=T.no’ to ensure that at least one 'no' value in the adjustment action becomes smaller, otherwise an endless loop may occur.

Open source SPL source address

Download

DEV Community

SQL: Multi field mixed deduplication followed by numbering #eg81

Top comments (0)

Read next

The Justin Beiber database problem!

Designing a Production-Grade Database for High-Traffic Applications on AWS RDS MySQL

New 41 GitHub Repositories - OpenSource of Dec 22, 2024

Creating a Festive Christmas Web Page with Snowfall Animation