Structured Query Language/Eliminate Duplicates

The Challenge

Sometimes people detect that corrupt or unwanted data exist in their database, eg: they forgot to create an Primary Key at CREATE TABLE time and got identical values in the key column or they recognize that values in one or the combination of multiple columns are not unique - as a violation of the business rules. The situation is commonly detected when they issue an ALTER TABLE ... ADD PRIMARY KEY (...) or CREATE INDEX index_1 ON table_1 (col_1, col_2) command.

In such cases data must be corrected or some rows must be deleted. As the first case depends strongly on the individual situation, we focus ourself on the later one. In general the SQL command will consists of two parts: The DELETE itself and a second part, were the pending rows are identified. In very complicate situations it may get necessary to use more than one SQL command (which always is declarative by definition) - maybe a CUSOR with a loop over the affected rows and additional actions depending on the values in different columns.

DELETE              -- the DELETE command needs no additional specification
FROM   mytable
WHERE ...           -- identify the unwanted rows
;

The solutions we discuss here are closely related to the explanations in Structured Query Language/Retrieve Top N Rows per Group. Over there we locate certain rows within groups. The same must be done here as we want to delete only dedicated rows, at least one must be kept in the affected groups.

We use the same table product on this page. The duty consists in eliminating all but one rows where the product prize is identical to any other prize within the same product group with the goal that the combination of product_group plus prize gets unique.

Identify affected Rows

A first approach to the situation may be a 'sniffing' in the data with a GROUP BY clause for a listing of possibly affected rows.

SELECT product_group, prize, COUNT(*)
FROM   product
GROUP BY product_group, prize      -- create groups
HAVING COUNT(*) > 1;               -- count the number of rows within each group

 product_group | prize  | count
---------------+--------+-------
 Laptop        | 675    |     2

-- Count the number of groups where such a problem exists
SELECT COUNT(*) FROM
  (SELECT product_group, prize, COUNT(*)
   FROM   product
   GROUP BY product_group, prize
   HAVING COUNT(*) > 1
  ) tmp;

 count
-------
     1

But the GROUP BY clause is not very helpfull as it is not possible to show columns other than the grouping columns and the result of some system functions like COUNT() (in rare cases a sort over a timestamp together with MAX(id) does help). The question is: how can we identify the 'right' and the 'wrong' rows? We need access to other columns of the rows to identify them. In the best case we get access to the rows ID.

To see such details we replace the GROUP BY clause by a window function (this is not the only possible solution). The following SQL command uses the same grouping over the two columns product_group and prize. And it uses a similar way to count affected rows. The main difference is that we see and have access to all columns of all rows.

SELECT product.*,
       COUNT(*) OVER (PARTITION BY product_group, prize ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as cnt
FROM   product;

 id |      name       | product_group | prize  | cnt
----+-----------------+---------------+--------+-----
  8 | Workstation ONE | Desktop       | 499    |   1
  1 | Big Fat One     | Desktop       | 545    |   1
  5 | Solid           | Desktop       | 565    |   1
 11 | Tripple-A       | Desktop       | 580    |   1
 10 | Rusty           | Laptop        | 390    |   1
  3 | Angle           | Laptop        | 398    |   1
  9 | Air             | Laptop        | 450    |   1
  7 | WhiteHorse      | Laptop        | 675    |   2
  2 | SmartAndElegant | Laptop        | 675    |   2
 13 | AllDay Basic    | Smartphone    |  75    |   1
  4 | Wizzard 7       | Smartphone    | 380    |   1
 12 | Oxygen 8        | Smartphone    | 450    |   1
  6 | AllRounder      | Smartphone    | 535    |   1

This SELECT offers everything we need: The last column cnt counts the number of unique product_group/prize combinations. And the column id gives us access to every single row.

In the next step we expand this query and shift it into a subselect (window functions cannot be used in a WHERE clause, only their results). The rows with a counter of '1' are not of interest, we eliminate them from the further processing, orders the remaining rows in a deterministic way, and compute an additional column for the position within each group.

SELECT tmp.*
FROM   (
  SELECT product.*,
         COUNT(*)     OVER (PARTITION BY product_group, prize ORDER BY id
                            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as cnt,
         ROW_NUMBER() OVER (PARTITION BY product_group, prize ORDER BY id) as position_within_group
  FROM   product
  ) tmp
WHERE  tmp.cnt > 1;

 id |      name       | product_group | prize  | cnt | position_within_group
----+-----------------+---------------+--------+-----+-----------------------
  2 | SmartAndElegant | Laptop        | 675    |   2 |                     1
  7 | WhiteHorse      | Laptop        | 675    |   2 |                     2

Up to this point our algorithm to identify problematic rows is easy, clear and the same for all use cases: create groups over the columns of interest with the PARTITION BY clause, count the number of rows within the groups, and eliminate groups with a counter of '1'. But now you have to decide how to go further on. Which of the rows shall survive and wich ones shall be deleted (or modified?)? The answer depends strongly on your business logic, the manner how the data was gone into the table, the expectations of your customers, and much more. So you have to take your own decision.

On this page we choose a very simple solution: The row with the smallest ID shall survive, all others will be deleted. For testing purposes we retrieve the rows we intend to delete, namely those with a position greater 1.

SELECT tmp.*
FROM  
  (SELECT product.*,
          COUNT(*)     OVER (PARTITION BY product_group, prize ORDER BY id
                             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as cnt,
          ROW_NUMBER() OVER (PARTITION BY product_group, prize ORDER BY id) as position_within_group
   FROM   product
  ) tmp
WHERE  tmp.cnt > 1
AND    tmp.position_within_group > 1;

 id |    name    | product_group | prize  | cnt | position_within_group
----+------------+---------------+--------+-----+-----------------------
  7 | WhiteHorse | Laptop        | 675    |   2 |                     2

-- or retrieve rows which will survive:
...
AND    tmp.position_within_group = 1;

Delete Rows

If this is what you expect, you can delete the rows in the final step. Reduce the upper command to retrieve only the IDs, shift it into a subselect, and use its result as the input for a DELETE command.

BEGIN TRANSACTION;

DELETE
FROM   PRODUCT
WHERE  id IN
  (SELECT tmp.id
   FROM
     (SELECT product.*,
             COUNT(*)     OVER (PARTITION BY product_group, prize ORDER BY id
                                ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as cnt,
             ROW_NUMBER() OVER (PARTITION BY product_group, prize ORDER BY id) as position_within_group
      FROM   product
     ) tmp
   WHERE  tmp.cnt > 1
   AND    tmp.position_within_group > 1
  );

COMMIT;       -- or: ROLLBACK;

This article is issued from Wikibooks. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.