DEV Community: Danwycliff Ndwiga

Deleting Duplicates in SQL

Danwycliff Ndwiga — Fri, 01 Nov 2024 12:53:17 +0000

In SQL handlingn duplicate records is essential for maintaining data accuracy, optimizing query performance, and ensuring consistent results.This article explores some practicle techniques to identify and delete duplicate rows using SQL queries.

Delete Duplicates Using a Unique Identifier

Consider the above code block

DELETE FROM cars
WHERE id IN (
    SELECT MAX(id)
    FROM cars
    GROUP BY model, brand
    HAVING COUNT(1) > 1
);

The above code selects the records from cars and in the above code we want to delete the record that are duplicate in the model and brand we use the id to identify the MAX(we can only get one max) and we delete the only max record

Delete Duplicates Using a Self-Join

In this approach, we use a self-join to identify and delete duplicate rows in the cars table, keeping only the row with the smallest id for each unique combination of model and brand

DELETE FROM cars
WHERE id IN (
    SELECT c2.id
    FROM cars c1
    JOIN cars c2 ON c1.model = c2.model
                 AND c1.brand = c2.brand
    WHERE c1.id < c2.id
);

Delete Duplicates Using a Window Function

DELETE FROM cars
WHERE id IN (
    SELECT id
    FROM (
        SELECT id,
               ROW_NUMBER() OVER (PARTITION BY model, brand ORDER BY id) AS rn
        FROM cars
    ) AS x
    WHERE x.rn > 1
);

In the above case in the inner subquery, the ROW_NUMBER() function assigns a unique row number to each row within each group of duplicates (defined by model and brand)

Using MIN function

delete from cars
where id not in ( select min(id)
                  from cars
                  group by model, brand);

The inner subquery SELECT MIN(id) FROM cars GROUP BY model, brand finds the lowest id for each unique combination of model and brand. This ensures that only one record for each car model and brand pair is retained
The DELETE FROM cars WHERE id NOT IN (...) statement removes records with IDs that aren't the minimum for their model and brand group. Essentially, this keeps only the oldest record and removes duplicates.

Understanding and Using Window Functions in SQL

Danwycliff Ndwiga — Thu, 31 Oct 2024 13:28:34 +0000

Introduction

Window functions are SQL functions that perform calculations across a set of rows related to the current row (window) or partition without grouping rows into a single output
While aggregate functions like (SUM,AVG) group rows to produce a single reuslt window functions keep all rows in the result set and apply calcualtions over a specific range

Syntax

Here is a syntax of the sql window function

<window_function>() OVER (
    [PARTITION BY <partition_column>]
    [ORDER BY <order_column>]
    [<frame_clause>]
)

The Partition by clause divides the data into partitions or subsets based on one or more columns
The order by orders the rows often based on the time column or the numeric value
The frame clause defines the window frame specifying the exact range of rows to include in each calculation relative to the current row

Types of window functions

There are four main type of window functions

Aggregate Window Functions: These include functions like SUM(),AVG(),COUNT(),MIN(), and MAX() used with the OVER clause
Ranking Window Functions: These are functions like ROW_NUMBER(), RANK(),DENSE_RANK() and NTILE() they are used in order ranking
Value Window Functions: Functions like LAG(),LEAD(),FIRST_VALUE() and LAST_VALUE() which access value from other rows within the window funtions
Analytic Window Funtions: Funtions like CUME_DIST() and PERCENT_RANK() they are used to provide statistical insights

SQL "SELECT INTO" vs "INSERT INTO SELECT" statements.

Danwycliff Ndwiga — Wed, 30 Oct 2024 18:32:24 +0000

The "SELECT INTO" statement copies data from one table into a new table.
the syntax of the statement is as follows below

SELECT *
INTO newtable [IN externaldb]
FROM oldtable
WHERE condition;

we can also only copy some columns into the new table

SELECT column1, column2, column3, ...
INTO newtable [IN externaldb]
FROM oldtable
WHERE condition;

INSERT INTO SELECT

The INSERT INTO SELECT statement copies data from one table and inserts it into another table.

The INSERT INTO SELECT statement requires that the data types in source and target tables match.

Note: The existing records in the target table are unaffected.

INSERT INTO table2
SELECT * FROM table1
WHERE condition;

we can also Copy only some columns from one table into another table:

INSERT INTO table2 (column1, column2, column3, ...)
SELECT column1, column2, column3, ...
FROM table1
WHERE condition;

Conclusion

SELECT INTO creates a new table while INSERT INTO SELECT requires an existing table.
SELECT INTO is for creating backup or temporary tables while INSERT INTO SELECT is used to transfer data between existing tables.