Best way for finding duplicate entries in table?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • andreas.maurer1971@web.de

    Best way for finding duplicate entries in table?

    Hi all,

    since a few years I use the following statement to find duplicate
    entries in a table:

    SELECT t1.id, t2.id,...
    FROM table AS t1 INNER JOIN table AS t2
    ON t1.field = t2.field
    WHERE t1.id < t2.id

    Of course Indexes are set on id (Primary Key, Autoincrement) and field
    (normal Index) and are, according to EXPLAIN, used (index "field").

    This statement works quite well but takes ages (> 10 minutes) of time
    when using it on a table with tenthousands of rows.

    Now my question:
    Are there better and especially faster ways to find duplicate entries
    within a table without having to create a seperate table which will be
    filled with a SELECT DISTINCT... statement from the "old" table?

    Thank's in advance for your help and time,

    Andy Maurer

  • Aggro

    #2
    Re: Best way for finding duplicate entries in table?

    andreas.maurer1 971@web.de wrote:
    [color=blue]
    > Now my question:
    > Are there better and especially faster ways to find duplicate entries
    > within a table without having to create a seperate table which will be
    > filled with a SELECT DISTINCT... statement from the "old" table?[/color]

    select id,count(field) as amount
    from table group by field having amount > 1;

    Comment

    • andreas.maurer1971@web.de

      #3
      Re: Best way for finding duplicate entries in table?



      Aggro schrieb:[color=blue]
      > andreas.maurer1 971@web.de wrote:
      >[color=green]
      > > Now my question:
      > > Are there better and especially faster ways to find duplicate entries
      > > within a table without having to create a seperate table which will be
      > > filled with a SELECT DISTINCT... statement from the "old" table?[/color]
      >
      > select id,count(field) as amount
      > from table group by field having amount > 1;[/color]

      Hi Aggro,

      thank you very much for your help, it shows me duplicate entries but,
      as well due to a not better specified question of mine, some problems
      are left open.
      First of all your statement, when issued in phpmyadmin, tells me that
      it found in total all rows of the table. Do you have any idea, why?
      Another problem I have is: It shows me the duplicate entries in "field"
      but only one of the many keys related to this entry. Therefore, if I
      wanted to delete all but one of these entries, I have to issue a lot of
      SELECT and DELETE statements in order to clean my table.

      Of course I always could create another (temporary) table and fill it
      with DISTINCT entries of table1 but this is sometimes not an option for
      me.

      Do you have another idea how to fix this?

      Thank's again,

      Andy

      Comment

      • Bill Karwin

        #4
        Re: Best way for finding duplicate entries in table?

        andreas.maurer1 971@web.de wrote:[color=blue][color=green]
        >>select id,count(field) as amount
        >>from table group by field having amount > 1;[/color][/color]
        [color=blue]
        > Another problem I have is: It shows me the duplicate entries in "field"
        > but only one of the many keys related to this entry. Therefore, if I
        > wanted to delete all but one of these entries, I have to issue a lot of
        > SELECT and DELETE statements in order to clean my table.[/color]

        This should give the list of fields that occur more than once in the
        database:

        SELECT t2.`field`
        FROM `table` AS t2
        GROUP BY t2.`field`
        HAVING COUNT(t2.`field `) > 1;

        This should give the list of id's in whose records those values for
        field occur:

        SELECT t.id
        FROM `table` AS t
        WHERE t.`field` IN (
        SELECT t2.`field`
        FROM `table` AS t2
        GROUP BY t2.`field`
        HAVING COUNT(t2.`field `) > 1);

        This should give the list of the lowest id for each given value in field:

        SELECT MIN(t3.id)
        FROM `table` AS t3
        GROUP BY t3.`field`;

        Putting it all together, this should delete records who contain values
        for field that appear on more than one row, but do not delete such
        records with the lowest id for a given value of `field`:

        DELETE FROM `table`
        WHERE id IN (
        SELECT t.id
        FROM `table` AS t
        WHERE t.`field` IN (
        SELECT t2.`field`
        FROM `table` AS t2
        GROUP BY t2.`field`
        HAVING COUNT(t2.`field `) > 1))
        AND id NOT IN (
        SELECT MIN(t3.id)
        FROM `table` AS t3
        GROUP BY t3.`field`);

        I'm assuming that you want to keep each record with the lowest id value.
        In order to do this in a single operation, you must have some kind of
        uniform policy like this, that applies to all cases. You could use
        MAX(t3.id) instead to delete all but the most recent records, but other
        than those two options, the alternatives get more complex to implement.

        Regards,
        Bill K.

        Comment

        Working...