Is there a more efficient way to get unique results than by using DISTINCT?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Jesse Jones
    New Member
    • Jan 2011
    • 51

    Is there a more efficient way to get unique results than by using DISTINCT?

    I am successfully calling around half a million records, but I have a problem with duplicate records. Without using the DISTINCT function I can call these records in half a second, but when using the DISTINCT function it takes about five minutes for Access to weed out the duplicate records. Is there a more efficient way to get rid of these duplicate records?

    (I am most familiar with SQL and only moderately familiar with Access, so I am working in the SQL View. If your solution is better implemented in the Design View I can work with that fine also.)

    Let me know if you need more information.
    Thank you!
  • code green
    Recognized Expert Top Contributor
    • Mar 2007
    • 1726

    #2
    I have found GROUP BY to be quicker than DISTINCT.
    Not much, but I have never tried with that many records.

    Post the SQL and maybe somebody can suggest a more efficient query.

    Comment

    • TheSmileyCoder
      Recognized Expert Moderator Top Contributor
      • Dec 2009
      • 2322

      #3
      Is there any possibility of removing the duplicate records? Another posibility is if you can use a smaller query (i.e. on fewer fields) to determine uniqueness and then join that query with your main query.

      Comment

      • NeoPa
        Recognized Expert Moderator MVP
        • Oct 2006
        • 32661

        #4
        Interesting question Jesse. I would have expected a DISTINCT predicate to be implemented in such a way as to be more efficient than a GROUP BY personally, but I guess there may be complications with optimisation of the basic concept when many fields are involved and particularly when non-indexed fields are involved. This is not speaking from testing, but just the idea that there is less work to do and less checking involved logically. Each implementation would have its own way of doing things though, hence I expect you're finding DISTINCT slower than GROUP BY in your case.

        I'd be interested to know from you what determines your idea of duplicate records? This is typically determined by comparing the PKs, but it may be that you are comparing each field (the full record essentially), which of course, is orders of magnitude less efficient.

        Smiley's idea is a good one going forward, but I'd make sure, if at all possible, that the compared field(s) are all included in a single index. Determining how practical this will be is for you, who knows more about your use of the data, to do. This is one of the areas where db design experience comes in.

        Comment

        • Jesse Jones
          New Member
          • Jan 2011
          • 51

          #5
          Here is my code:

          Code:
          SELECT [Call Log].[Client ID], DCount("[Call ID]","[Call Log]","[Call Code]<>'NOM' AND [Client ID]=" & [Clients].[Client ID]) AS CountactsCount, [Call Log].[Call Code], Clients.[First Name], Clients.[Last Name], Clients.Sfx, Clients.[Mailing Address], Clients.City, Clients.ST, Clients.Zip, Clients.Phone, Clients.Type, Clients.[Time Zone], Clients.Gender, DCount("Phone","Clients") AS countTotal
          FROM Clients LEFT JOIN [Call Log] ON Clients.[Client ID] = [Call Log].[Client ID]
          WHERE (((DCount("[Call ID]","[Call Log]","[Call Code]<>'NOM' AND [Client ID]=" & [Clients].[Client ID]))=0));
          Here's the background: I have a table of clients and a table that serves as a call log. I want to be able to find all of the people who have not given a reply to our inquiry. We do this by using the dcount shown above to count how many times in the call log people have given us a response other than the no response code. (So, how many times they have given us a response. If it is >0, we don't need to call them again.)

          This works. The problem is that we get a record back for every call to that client. I only want the client to show once. DISTINCT works, but it takes quite a long time. Any thoughts would help.

          Thanks.

          Comment

          • NeoPa
            Recognized Expert Moderator MVP
            • Oct 2006
            • 32661

            #6
            Jesse, if you're going to use a Domain Aggregate function within a query then you can't really expect anything in the way of performance. I'm only surprised it waited for you to add the DISTINCT predicate before slowing down. If you have a record source that needs to be included in the logic of your query then a JOIN of some form is a much more logical approach to use.

            Unfortunately, most RDBMSs run so fast that even SQL like this example runs pretty quickly unless large numbers of records are involved. People often don't even consider what their design structure until they get to that stage, by which time it is too late really. It can be fixed. I'm not saying it's beyond repair, but that sort of approach does tend to waste a great deal of your time in the long run if you want to design a database that can handle the sort of numbers you're talking about.

            Comment

            • Jesse Jones
              New Member
              • Jan 2011
              • 51

              #7
              Okay, thanks for the thought. I'm sorry, but I'm pretty unschooled in this. This database was my first and I am learning as I go. Unfortunately (from a technical side), the demand for this information has grown at a degree that was incomprehensibl e a few months ago. Am I reading that right to understand that I'm pretty much in big trouble? The size of this is only going to grow. I am in the process of converting to SQLServer, but we won't get there for a few weeks at least. I need filter running in the next day or two.

              Any suggestions?

              Comment

              • NeoPa
                Recognized Expert Moderator MVP
                • Oct 2006
                • 32661

                #8
                Originally posted by Jesse Jones
                Jesse Jones:
                Am I reading that right to understand that I'm pretty much in big trouble?
                I would say yes.

                SQL Server does give you more options certainly, but I would stick with my earlier advice of designing a query to process all the inputs required. Domain Aggregate functions are not things that work very efficiently within SQL.

                Comment

                • Rabbit
                  Recognized Expert MVP
                  • Jan 2007
                  • 12517

                  #9
                  For the WHERE clause I would try a subquery while for the count in the SELECT clause, I would try a cross join count if it's feasible. If not, I would use a subquery. I would also make sure all the appropriate indexes have been created.

                  Comment

                  Working...