Removing duplicates

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • TVining
    New Member
    • Dec 2007
    • 26

    Removing duplicates

    Okay, I have a table with the following fields (amongst others):
    [ID] (an autonumber)
    [Date] (YYYY-MM-DD HH:MM:SS)
    [Title] (255 character text)
    [xcount] (Numeric)

    What I want to do is delete the duplicate posts. They are duplicate ONLY if [title] is the same, and the [Date] is within 3 days of each other. I want to delete the lower numbered posts, or those with the lowest [xcount].

    Anyone have a good idea for me?

    Thanks!

    TIger
  • coolsti
    Contributor
    • Mar 2008
    • 310

    #2
    I was just going to give you a suggestion of how you might go about this, but then I realize there is a specific problem in your problem specification.

    You are saying that two rows are "duplicate" if the ID is the same and the date field is within 3 days of each other. I have a problem deciding when to decide that rows are duplicates.

    Take this case: Let us say you have 10 rows in your table with the same title, and with different values in the date field, but where these values are 2 days apart from each other. In other words, the total date range here spans about 20 days. Now, which rows can be considered duplicates to which other rows?

    So your basic problem is not really well defined.

    Comment

    • TVining
      New Member
      • Dec 2007
      • 26

      #3
      Sorry.

      I get about 500 records per day. These are press releases.

      Sometimes, I have a person who has family members in 2 seperate states, so I get 2 news releases. I only want to keep one. I don't care WHICH one. Those would be duplicates.

      Sometimes, the news agency has an issue and re-releases the same news 2 days in a row. Those would be duplicates.

      Sometimes "Mike Smith" graduates school, then 2 months later he gets promoted, those would not be duplicates, but would have the same title.

      Here's an example. I would like the system to look at this and come up with 2 posts, one for "arrived for duty" and one for "graduated BMT": http://www.bmtgrad.com/HOMETOWNNEWS/...ip+D.+Smith%22

      [xcount] is if they have any comments. If someone has commented, I would like to keep THAT post.

      That make more sense?

      Comment

      • r035198x
        MVP
        • Sep 2006
        • 13225

        #4
        See if this helps.

        Comment

        • coolsti
          Contributor
          • Mar 2008
          • 310

          #5
          There is still the conceptual problem of knowing when to delete what, since even as you describe it, you can still have a long series of rows in your table that have adjacent entries with dates within 3 days of each other, making it difficult to know what is a duplicate or not.

          I think what you might better want to do is to not think in terms of "deleting" what you consider to be duplicate, but to instead refuse to "insert" a new row if it can be considered a duplicate. This would remove the problem.

          Let us say a new entry is to be added and it has the same title to another entry that was made 2 days ago, and does not include comments or whatever that would make it a non-duplicate. Then you simply do not do the insert. It is never entered and does not need to be deleted. Then 2 days later, another entry is to be made with the same title. But now a simple query finds that the latest entry with the same title is now from 4 days ago, and so you decide this is not a duplicate, and you do the insert.

          Maybe this helps? It sometimes is easier to "not insert in the first place" than to decide what to "delete afterwards".

          Comment

          • TVining
            New Member
            • Dec 2007
            • 26

            #6
            I appreciate the help on this. I'm down to only about 2,000 duplicate posts by deleting a few a day! :-P

            The problem on this solution is that I use an automated script to make the [TITLE] Field. All records are inserted with the same title. [NEWS RELEASE], as that is how they arrive via email.
            I then run a script which renames them based upon the first several words of the release. "Mr John B Smith has graduated from Johnson College", "Mr Micheal Jones has been promoted to senior sales engineer" My script goes through and finds everything after "Mr, Mrs or Ms" and takes the words up to "Has" so the title would be "John B Smith", "Michael Jones", etc.
            Arrgh, If my head wasn't shaven, I'd pull my hair out.

            Comment

            • TVining
              New Member
              • Dec 2007
              • 26

              #7
              Originally posted by r035198x
              See if this helps.
              That does help a lot, but how would I write in the part about them being posted within 72 hours.

              Tony

              Comment

              • r035198x
                MVP
                • Sep 2006
                • 13225

                #8
                First transform the problem to that of selecting records such that their date is within three days of another record. You then run the delete against that result set.

                Comment

                Working...