delete duplicates after import

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jroycestone
    New Member
    • Oct 2011
    • 6

    delete duplicates after import

    New member here, impressed with what I have searched so far...

    So every day data will be imported into a table named the month (ie - 10_2011)... however, this data WILL contain duplicates which I need to get rid of. I am breaking the data up into daily imports into monthly tables because each day the data are over 500-700 rows, wanted to keep tables of manageable size. I'm not married to using this logic, so if someone has a better idea, please share!

    I have used snippets of code I found here but I can't seem to make it work... Please if someone could help me, that would be great. I'm not a board certified programmer but I do know enough to break stuff!

    Code:
    Private Sub cmdImport_Click()
        Dim strDay As String
        Dim strMonth As String
        Dim strFilePath As String
        Dim xlobj
        Dim wsobj
        Dim rngobj
        Dim strDataRange As String
        
        strDay = Format(Now(), "mm" & "." & "dd" & "." & "yyyy")
        strMonth = Format(Now(), "mm" & "_" & "yyyy")
        
        strFilePath = "<hidden>" & strDay & ".xls"
        
        Set xlobj = GetObject(strFilePath)
        Set wsobj = xlobj.worksheets("Sheet1")
        Set rngobj = wsobj.usedrange
        rngobj.Name = srtDataRange
         
        DoCmd.SetWarnings False
        DoCmd.TransferSpreadsheet acImport, 8, strMonth, _
        strFilePath, True, strDataRange
        DoCmd.SetWarnings True
        
        Set xlobj = Nothing
        
        
                    
    'routine for deleting Dups created EVERY DAY when excel is imported
        Dim db As DAO.Database
        Dim rs As Recordset
        Dim strAccount As String
     
        Set db = CurrentDb
        Set rs = db.OpenRecordset(strMonth)
        
    StartFile:
        rs.MoveFirst
        Do Until rs!Temp = False
         If Not rs.EOF Then
           rs.MoveNext
         Else
          GoTo EndFile
          End If
        Loop
     
        strEmail = rs!Account_No
        rs!Temp = True
        rs.MoveNext
     
        Do Until rs.EOF
        If rs!Account_No = strAccount Then
         rs.Delete
        End If
        rs.MoveNext
        Loop
        
        GoTo StartFile
    EndFile:
     
     'remove temp field from table
    
     
        rs.Close
        Set rs = Nothing
        Set db = Nothing
                                
    End Sub
  • patjones
    Recognized Expert Contributor
    • Jun 2007
    • 931

    #2
    In regard to your data structure, I probably would not keep so many tables around that hold the exactly the same kind of data. Depending on how much information each record holds, Access can comfortably handle tens of thousands of records in a single table, or more. This is a digression from the point of your question though.

    I want to start by asking what the Temp column holds, and what you are looking to accomplish by looping through the recordset until rs!Temp is false (lines 38 through 45)?

    I see at least one issue further on in the code, but if you can answer this first then we can move on...

    Pat

    Comment

    • jroycestone
      New Member
      • Oct 2011
      • 6

      #3
      That was a code snippet that I "borrowed" from someone else... If there is a better procedure for looping through the data I would love to see it!

      I am very much a beginner, so I could totally rip out that part of the code. Well you can see that the procedure runs the import then is supposed to check/delete the dups.

      Comment

      • patjones
        Recognized Expert Contributor
        • Jun 2007
        • 931

        #4
        Generally, a copy and paste of code from some other source isn't successful without some degree of modification to fit the code to your particular situation.

        So I go back to my previous questions: what is the purpose of the Temp column and why are you looping through the recordset looking for rs!Temp = False?

        I'm asking because it's not clear to me how that particular section of code is related to checking for duplicates in your table. Thus, you need to be able to explain how it is relevant.

        Comment

        • NeoPa
          Recognized Expert Moderator MVP
          • Oct 2006
          • 32662

          #5
          My first advice would be to avoid the idea of multiple tables for similar data. You're not working with spreadsheets here. You're working with a database, and that requires some different thinking. A field that reflects the month will enable you to store all the data and report on it easily by month or otherwise. Filtering by month gives you the same result as multiple tables without all the restrictions and mess that go along with that approach.

          As far as handling duplicates is concerned I would decide first what you understand by a duplicate. That's an important first step, without which all the rest of it is fairly meaningless.

          With that new understanding in mind, I would proceed on the following basis :
          1. Import the day's data into a special import table.
          2. Remove any duplicates from the new data (If this involves comparing with previous days' data from the same month then so be it).
          3. Report on or log any removals or updates if that fits your requirements.
          4. Append the newly imported and tidied data into the main table ensuring the monh field is set appropriately.
          5. Clear down the temporary data from the daily import table for the next run.

          Comment

          • NeoPa
            Recognized Expert Moderator MVP
            • Oct 2006
            • 32662

            #6
            Another point to understand and so avoid in future, and it's surprisingly common in less experienced coders. It is never necessary to concatenate literal string values together for the benefit of the code. Take your line #10 :
            Code:
            strDay = Format(Now(), "mm" & "." & "dd" & "." & "yyyy")
            There is absolutely no reason why this should not be written much more simply as :
            Code:
            strDay = Format(Date(), "mm.dd.yyyy")
            Of course there are many reasons why it should be done that way.

            NB. Using Now() instead of Date() for date values is another mistake liable to cause confusion (and well worth avoiding).

            Comment

            • jroycestone
              New Member
              • Oct 2011
              • 6

              #7
              The thing that is hard for me to wrap my head around is using 1 table for ALL the daily imports forever more... We're talking over 500 rows of data per day. In a month that would be over 10,000 rows per month meaning over 120,000 rows per year! That seems like it would be way too big for access to handle. Am I wrong in thinking that?

              So the daily order of operation as you see it would be Import data into a "temp" table, compare temp data to current "master" data table, make duplicate corrections and any data cleanup, append data from Temp table to master table, clear data from temp table to be used the next day. Correct?

              Comment

              • NeoPa
                Recognized Expert Moderator MVP
                • Oct 2006
                • 32662

                #8
                Originally posted by jroycestone
                jroycestone:
                The thing that is hard for me to wrap my head around is using 1 table for ALL the daily imports forever more... We're talking over 500 rows of data per day. In a month that would be over 10,000 rows per month meaning over 120,000 rows per year! That seems like it would be way too big for access to handle. Am I wrong in thinking that?
                Yes. Quite wrong. Certainly there are limits to the capacity of an Access database, but these limits are not avoided by managing a set amount of data in multiple tables. Quite the contrary, as each table has its own overheads. You will be able to manage more data in a single table than in multiple ones.

                That said, if the data is vast and is never cleared down (increasing forever) it may well be wise to consider another database engine within which to store your data. It can still be accessed from an Access front-end database, but the size limitations would be less constraining. Something for you to consider.

                Originally posted by jroycestone
                jroycestone:
                So the daily order of operation as you see it would be Import data into a "temp" table, compare temp data to current "master" data table, make duplicate corrections and any data cleanup, append data from Temp table to master table, clear data from temp table to be used the next day. Correct?
                If you reread what I posted you will see I didn't say the highlighted bit as you have it. That may be the approach, but that depends on your definition of duplicate data. It may well be that you only need to check for duplicates within the new Temp data. Otherwise it sounds like you have the idea pretty well.
                Last edited by NeoPa; Oct 5 '11, 03:07 PM. Reason: Typo

                Comment

                • patjones
                  Recognized Expert Contributor
                  • Jun 2007
                  • 931

                  #9
                  There is some dependence on what kind of data the table is going to hold, but it's not inconceivable that an Access table can store in excess of 100,000 records. I have one database that is currently in excess of 30,000 unique records, and there really hasn't been a performance hit so far.

                  If you have the ability to put your data on a server and link your Access front end to it, you can go into millions of records.

                  At some point you will probably want to consider archiving the data somewhere else; but for the sake of proper database design you should use one table and add a column for month.

                  The overall process as you have explained it is pretty much how you would want to do things. I recently completed an application that allows the user to search for a spreadsheet using a file picker dialog, import the selected file into a temporary table, removes duplicates, and commits the clean data to the permanent table...all with a couple button clicks. The import is a little slow on account of the spreadsheets containing anywhere from 10,000 records up, but it works.

                  Comment

                  • jroycestone
                    New Member
                    • Oct 2011
                    • 6

                    #10
                    So here is my "tentative" plan... have one table to store all data and do an archive quarterly to keep the data somewhat limited.

                    Now, the duplicate compare is needed as I described it. I'll explain a little more. The spreadsheet that gets imported daily has scheduled jobs for today and into the future; this includes any date out further than today. So every day when an import is done it will contain jobs out further than today and tomorrow so these "jobs" will continue to be in the spreadsheet every day until that job is out of our queue. Hopefully that explains how the data are more migrated on a daily basis.

                    Comment

                    • NeoPa
                      Recognized Expert Moderator MVP
                      • Oct 2006
                      • 32662

                      #11
                      That still leaves out much necessary information to enable you to develop logic to reflect your requirements. Let's see what we can do to be getting along with anyway though.

                      Your post would indicate to me that you need to ensure that any job in the [Temp] table doesn't clash with any other job in the same table, as well as any job already existing in the main table. How do you recognise which of the duplicates is required? It seems you may be in a position where a job imported today has more up-to-date information with it than the same job imported on a previous day. In that case you'd need to remove the record from the main table. Somewhat different from simply tidying up the data in the [Temp] table, but can still be done.

                      Comment

                      • jroycestone
                        New Member
                        • Oct 2011
                        • 6

                        #12
                        jeeez, you must be experienced in data as I didn't think of that either!

                        Yes, I would need the latest "version" of the row of data to overwrite the old version of the data. There would be no possible way that there would be duplicates within the [Temp] table, only when referring data from the [Temp] table to the [Master] table.

                        There are multiple cells that need to be checked to make sure the data are truly duplicate. Account_NO, OrderDate are the two main fields.

                        Comment

                        • NeoPa
                          Recognized Expert Moderator MVP
                          • Oct 2006
                          • 32662

                          #13
                          So, the [Temp] table never needs to be checked for internal consistency, but it may contain new records as well as updates for existing ones.

                          Jet SQL has no facility for ...

                          WOW!! I just learned something new and very surprising when I researched this. It turns out Jet SQL can and does handle updating and appending in the same UPDATE query. That makes the suggestion somewhat simpler :

                          Create an UPDATE query that specifies all fields (except any AutoNumbers - That will need to be left to the Automatic process and will be unlikely to match any previous data). A link I found on how to do this is ACC2000: How to Update and Append Records in One Update Query.

                          Comment

                          • jroycestone
                            New Member
                            • Oct 2011
                            • 6

                            #14
                            Awesome... that seems to work perfectly! Thank you so much for your help NeoPa!!!!!!

                            Comment

                            • NeoPa
                              Recognized Expert Moderator MVP
                              • Oct 2006
                              • 32662

                              #15
                              Thank you for the question. I've always worked on the assumption that this facility was non-existent. It's a pretty handy thing to know about and I'm very pleased to have discovered it :-)

                              Comment

                              Working...