Structured Query Language/Eliminate Duplicates
< Structured Query LanguageThe Challenge
Sometimes people detect that corrupt or unwanted data exist in their database, eg: they forgot to create an Primary Key at CREATE TABLE time and got identical values in the key column or they recognize that values in one or the combination of multiple columns are not unique - as a violation of the business rules. The situation is commonly detected when they issue an ALTER TABLE ... ADD PRIMARY KEY (...)
or CREATE INDEX index_1 ON table_1 (col_1, col_2)
command.
In such cases data must be corrected or some rows must be deleted. As the first case depends strongly on the individual situation, we focus ourself on the later one. In general the SQL command will consists of two parts: The DELETE itself and a second part, were the pending rows are identified. In very complicate situations it may get necessary to use more than one SQL command (which always is declarative by definition) - maybe a CUSOR with a loop over the affected rows and additional actions depending on the values in different columns.
DELETE -- the DELETE command needs no additional specification
FROM mytable
WHERE ... -- identify the unwanted rows
;
The solutions we discuss here are closely related to the explanations in Structured Query Language/Retrieve Top N Rows per Group. Over there we locate certain rows within groups. The same must be done here as we want to delete only dedicated rows, at least one must be kept in the affected groups.
We use the same table product on this page. The duty consists in eliminating all but one rows where the product prize is identical to any other prize within the same product group with the goal that the combination of product_group plus prize gets unique.
Identify affected Rows
A first approach to the situation may be a 'sniffing' in the data with a GROUP BY
clause for a listing of possibly affected rows.
SELECT product_group, prize, COUNT(*)
FROM product
GROUP BY product_group, prize -- create groups
HAVING COUNT(*) > 1; -- count the number of rows within each group
product_group | prize | count
---------------+--------+-------
Laptop | 675 | 2
-- Count the number of groups where such a problem exists
SELECT COUNT(*) FROM
(SELECT product_group, prize, COUNT(*)
FROM product
GROUP BY product_group, prize
HAVING COUNT(*) > 1
) tmp;
count
-------
1
But the GROUP BY
clause is not very helpfull as it is not possible to show columns other than the grouping columns and the result of some system functions like COUNT()
(in rare cases a sort over a timestamp together with MAX(id)
does help). The question is: how can we identify the 'right' and the 'wrong' rows? We need access to other columns of the rows to identify them. In the best case we get access to the rows ID.
To see such details we replace the GROUP BY
clause by a window function (this is not the only possible solution). The following SQL command uses the same grouping over the two columns product_group and prize. And it uses a similar way to count affected rows. The main difference is that we see and have access to all columns of all rows.
SELECT product.*,
COUNT(*) OVER (PARTITION BY product_group, prize ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as cnt
FROM product;
id | name | product_group | prize | cnt
----+-----------------+---------------+--------+-----
8 | Workstation ONE | Desktop | 499 | 1
1 | Big Fat One | Desktop | 545 | 1
5 | Solid | Desktop | 565 | 1
11 | Tripple-A | Desktop | 580 | 1
10 | Rusty | Laptop | 390 | 1
3 | Angle | Laptop | 398 | 1
9 | Air | Laptop | 450 | 1
7 | WhiteHorse | Laptop | 675 | 2
2 | SmartAndElegant | Laptop | 675 | 2
13 | AllDay Basic | Smartphone | 75 | 1
4 | Wizzard 7 | Smartphone | 380 | 1
12 | Oxygen 8 | Smartphone | 450 | 1
6 | AllRounder | Smartphone | 535 | 1
This SELECT
offers everything we need: The last column cnt counts the number of unique product_group/prize combinations. And the column id gives us access to every single row.
In the next step we expand this query and shift it into a subselect (window functions cannot be used in a WHERE
clause, only their results). The rows with a counter of '1' are not of interest, we eliminate them from the further processing, orders the remaining rows in a deterministic way, and compute an additional column for the position within each group.
SELECT tmp.*
FROM (
SELECT product.*,
COUNT(*) OVER (PARTITION BY product_group, prize ORDER BY id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as cnt,
ROW_NUMBER() OVER (PARTITION BY product_group, prize ORDER BY id) as position_within_group
FROM product
) tmp
WHERE tmp.cnt > 1;
id | name | product_group | prize | cnt | position_within_group
----+-----------------+---------------+--------+-----+-----------------------
2 | SmartAndElegant | Laptop | 675 | 2 | 1
7 | WhiteHorse | Laptop | 675 | 2 | 2
Up to this point our algorithm to identify problematic rows is easy, clear and the same for all use cases: create groups over the columns of interest with the PARTITION BY
clause, count the number of rows within the groups, and eliminate groups with a counter of '1'. But now you have to decide how to go further on. Which of the rows shall survive and wich ones shall be deleted (or modified?)? The answer depends strongly on your business logic, the manner how the data was gone into the table, the expectations of your customers, and much more. So you have to take your own decision.
On this page we choose a very simple solution: The row with the smallest ID shall survive, all others will be deleted. For testing purposes we retrieve the rows we intend to delete, namely those with a position greater 1.
SELECT tmp.*
FROM
(SELECT product.*,
COUNT(*) OVER (PARTITION BY product_group, prize ORDER BY id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as cnt,
ROW_NUMBER() OVER (PARTITION BY product_group, prize ORDER BY id) as position_within_group
FROM product
) tmp
WHERE tmp.cnt > 1
AND tmp.position_within_group > 1;
id | name | product_group | prize | cnt | position_within_group
----+------------+---------------+--------+-----+-----------------------
7 | WhiteHorse | Laptop | 675 | 2 | 2
-- or retrieve rows which will survive:
...
AND tmp.position_within_group = 1;
Delete Rows
If this is what you expect, you can delete the rows in the final step. Reduce the upper command to retrieve only the IDs, shift it into a subselect, and use its result as the input for a DELETE
command.
BEGIN TRANSACTION;
DELETE
FROM PRODUCT
WHERE id IN
(SELECT tmp.id
FROM
(SELECT product.*,
COUNT(*) OVER (PARTITION BY product_group, prize ORDER BY id
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as cnt,
ROW_NUMBER() OVER (PARTITION BY product_group, prize ORDER BY id) as position_within_group
FROM product
) tmp
WHERE tmp.cnt > 1
AND tmp.position_within_group > 1
);
COMMIT; -- or: ROLLBACK;