Finding duplicate records

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Thomas Arthur Seidel

    Finding duplicate records

    Hello to all,

    I have a small or big problem with a customer data base, where during a change of
    system we might have created duplicate records. This should be easy to find, you
    might think, but, we are talking about roughly 10000 records or less in a total
    volume of 1 MIO records or more.

    I have considered a strategy: The station ID and a field with something like a
    sequence number are supposed to be unique during that period. The sequence number
    will repeat roughly every two - four years, so in the period of three months,
    which is my problematic time, the sequence number (SeqNo) must be unique for each
    of the 40 stations (StationID) that I record in this data base table.

    I did run a simple 'create index' on these two fields, and, as to be expected, the
    routine fails, blahblah, meaning, there are duplicates. Fine. So, how can I write
    a query which lists to me all pairs (StationID, SeqNo) of data, where two or more
    rows are identically?

    If it makes things easier, I have a copy of that data base on a separate server,
    not on the live system I mean. I could strip off all data which is not related to
    that period of time.

    Can someone give to me a hint? Or a select statement?

    TS
  • rhaazy

    #2
    Re: Finding duplicate records

    On Aug 1, 2:07 pm, Thomas Arthur Seidel <thomas.arthur. sei...@gmx.de>
    wrote:
    Hello to all,
    >
    I have a small or big problem with a customer data base, where during a change of
    system we might have created duplicate records. This should be easy to find, you
    might think, but, we are talking about roughly 10000 records or less in atotal
    volume of 1 MIO records or more.
    >
    I have considered a strategy: The station ID and a field with something like a
    sequence number are supposed to be unique during that period. The sequence number
    will repeat roughly every two - four years, so in the period of three months,
    which is my problematic time, the sequence number (SeqNo) must be unique for each
    of the 40 stations (StationID) that I record in this data base table.
    >
    I did run a simple 'create index' on these two fields, and, as to be expected, the
    routine fails, blahblah, meaning, there are duplicates. Fine. So, how canI write
    a query which lists to me all pairs (StationID, SeqNo) of data, where twoor more
    rows are identically?
    >
    If it makes things easier, I have a copy of that data base on a separate server,
    not on the live system I mean. I could strip off all data which is not related to
    that period of time.
    >
    Can someone give to me a hint? Or a select statement?
    >
    TS
    You need to first figure out what you classify as a duplicate record?
    If only one field in a record is different from the next record is it
    a therefor not duplicate? If the only problem is a particular field
    over a given period, then you simply run a procedure to count the
    number of occurances for every ID over a given time. Or better yet
    run a procedure that only returns fields that have a count greater
    than 1.

    Comment

    • Plamen Ratchev

      #3
      Re: Finding duplicate records

      Try this:

      SELECT SeqNo, StationID, COUNT(*)
      FROM Table
      GROUP BY SeqNo, StationID
      HAVING COUNT(*) 1;


      Plamen Ratchev

      Comment

      • Thomas Arthur Seidel

        #4
        Re: Finding duplicate records

        To classify what is a duplicate record: If (StationID,SeqN o) is repeating. Mainly,
        if there is more then one record of the same SeqNo for one StationId. Each of the
        two values individually may repeat: Many records do come from the same station,
        and every station starts with the same sequence Number (SeqNo), it starts with
        "1", increments with every record, continues for years, until it is reset to "1"
        again some nice and warm day, when the accountant does decide this. For the period
        of three months, where I want to find my problem, for sure there is no regular way
        to have two or more transactions from the same station with a repeated SeqNo. The
        SeqNo must be unique for each station during this time.
        TS



        rhaazy wrote:
        On Aug 1, 2:07 pm, Thomas Arthur Seidel <thomas.arthur. sei...@gmx.de>
        wrote:
        >Hello to all,
        >>
        >I have a small or big problem with a customer data base, where during a change of
        >system we might have created duplicate records. This should be easy to find, you
        >might think, but, we are talking about roughly 10000 records or less in a total
        >volume of 1 MIO records or more.
        >>
        >I have considered a strategy: The station ID and a field with something like a
        >sequence number are supposed to be unique during that period. The sequence number
        >will repeat roughly every two - four years, so in the period of three months,
        >which is my problematic time, the sequence number (SeqNo) must be unique for each
        >of the 40 stations (StationID) that I record in this data base table.
        >>
        >I did run a simple 'create index' on these two fields, and, as to be expected, the
        >routine fails, blahblah, meaning, there are duplicates. Fine. So, how can I write
        >a query which lists to me all pairs (StationID, SeqNo) of data, where two or more
        >rows are identically?
        >>
        >If it makes things easier, I have a copy of that data base on a separate server,
        >not on the live system I mean. I could strip off all data which is not related to
        >that period of time.
        >>
        >Can someone give to me a hint? Or a select statement?
        >>
        >TS
        >
        You need to first figure out what you classify as a duplicate record?
        If only one field in a record is different from the next record is it
        a therefor not duplicate? If the only problem is a particular field
        over a given period, then you simply run a procedure to count the
        number of occurances for every ID over a given time. Or better yet
        run a procedure that only returns fields that have a count greater
        than 1.

        Comment

        • Thomas Arthur Seidel

          #5
          Re: Finding duplicate records

          Works !
          Unfortunately it proved also, that my idea of duplicate records was false.

          TS

          Plamen Ratchev wrote:
          Try this:
          >
          SELECT SeqNo, StationID, COUNT(*)
          FROM Table
          GROUP BY SeqNo, StationID
          HAVING COUNT(*) 1;
          >
          >
          Plamen Ratchev
          http://www.SQLStudio.com

          Comment

          Working...