No Duplicates in Subset

Yesterday I was helping a friend of mine with an interesting problem.

There is a big table ~4MLN rows, it is wide (has many columns, lets say 30). We are interested in getting rows, which contain duplicates in 5 columns, we do not care what data are in other columns, at least not with regard to being a double or not. The type of data stored in those 5 columns are dates, numerical or text, you name it. The question is:

How to retrieve all columns of a table for the rows which have duplicates in the 5 previously selected columns?

The question is not so trivial. Very first idea that comes to everyone’s mind is do a GROUP BY with HAVING clause, to select the doubles. But what next? Doing a join over five columns, from which some are text is not computationally friendly idea. You also cannot use id (primary key) in the SELECT statement as GROUP BY will not work. Or can you? No, you cannot.

My idea was quite simple. Get the ids of rows that do not have duplicates and then take everything else. But again, you cannot use GROUP BY. Of course you can., but you have to think how to get the id in the SELECT while grouping over other columns. The query I suggested looked like this

SELECT * FROM my_table WHERE id NOT IN
    (
     SELECT
          max(id) AS new_id
     FROM
          my_table
     GROUP BY
          a, b, c, d, e
     HAVING
          max(id)=min(id)
    );

It works because if and only if a row does not have any duplicates in the table the max(id) and min(id) will be equal. Hence, we need to take everything else.

To add even some speed to the task divide it into two steps. First create a table from the inner SELECT statment, and create an index on the column new_id, then use new table in the inner SELECT statment, or perform LEFT JOIN to the base table, and select all rows where data were not joined.

Leave a Reply