Importance of primary keys in UNION

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Killer42
    Recognized Expert Expert
    • Oct 2006
    • 8429

    Importance of primary keys in UNION

    Hi all.

    As has been mentioned here recently, I have a front-end database which just includes some queries and a bunch of (rather large) linked tables. For reasons of (very poor) performance, I had to do quite a bit of fiddling around to get my VBA code to build queries which would query each table individually, then UNION the results. Doing a single UNION query produced performance I probably could have bettered with a pencil and paper.

    Something mentioned in this forum (by either NeoPa or mmccarthy, I think) inspired me to do some experimentation (which I had previously been putting off due to time pressure). This has led to the conclusion that the performance of a UNION query is hugely improved if the tables involved have primary keys. This may not come as a surprise to you, but I just hadn’t had time to check it out.

    My big tables don’t have a primary key because a there isn’t room (they’re right up against the 2GB limit) and b it wasn’t necessary when working with one table at a time – performance was quite acceptable. I’ve only just moved to the split design to try and consolidate everything into a single interface.

    What I’d like to know is, have I stuffed up my testing, or am I really likely to see an enormous speed improvement when doing simple queries against a bunch of “unioned” tables by adding a primary key to the tables? (The primary key that I added was totally unrelated to my test queries or to the real ones I’ll be using.)
  • PEB
    Recognized Expert Top Contributor
    • Aug 2006
    • 1418

    #2
    The PK is very important and take care use only longinteger PK! Index your fields on which you do the search!

    It's very fast after!

    :)

    No PK - no hope for fast performance!

    Comment

    • MMcCarthy
      Recognized Expert MVP
      • Aug 2006
      • 14387

      #3
      Originally posted by Killer42

      What I’d like to know is, have I stuffed up my testing, or am I really likely to see an enormous speed improvement when doing simple queries against a bunch of “unioned” tables by adding a primary key to the tables? (The primary key that I added was totally unrelated to my test queries or to the real ones I’ll be using.)
      You'll definately see an improvement as PEB says. In cases where it's just not possible to add a primary key, try to index at least one field in the table. Even an indexed field with duplicates should give some improvement.

      Comment

      • NeoPa
        Recognized Expert Moderator MVP
        • Oct 2006
        • 32633

        #4
        I'm open to correction here but I'm quite happy this is along the right tracks.
        Firstly, FYI, The size limit of Access 2K and beyond is 4GB. 97 was the last one limited to 2GB.
        Beware though, a Unicode feature was installed for text fields which can double your text data unless you use 'Unicode Compression', which should be safe for most databases in the English speaking world.
        Indices - what a subject!
        When an RDBMS (Access for instance) processes a recordset, it will look for an index that will help the processing it's currently on. So, if you run
        Code:
        SELECT [CName], [Surname], [DoB]
        FROM [tblStaff]
        ORDER BY [Surname]
        then having an index based on [Surname] will enable that to be used.
        If there is no index that will help, it will have to sort the data first before use.
        This is where the small PK comes in
        Originally posted by PEB
        The PK is very important, and take care use only Long Integer PK! Index your fields on which you do the search!
        If you can't provide keys for all the various queries that use the table, then those that don't have keys will sort, and here's the nub, if there is a PK then it will sort via that - if no PK at all, it needs to create a temp table and sort the whole dataset :(.

        Having said that, my recommendation would be to use the PK for your main unique index - though there is clearly something to be said for PEB's approach, especially where you need to process the recordset from many different directions in different queries etc - as it were.

        Comment

        • Killer42
          Recognized Expert Expert
          • Oct 2006
          • 8429

          #5
          Hi all.

          Don't have time to read and respond to everything just yet, but I wanted to make a couple of points.
          • I'm running Access 2003, but my databases are in Access 2000 format in case I need to work on them at home.
          • I have run into the 2GB limit, plenty of times. The last time was probably less than a fortnight ago. If I remember correctly, you get an "Invalid command" message, or something similar. It doesn't give any reasonable indication as to what's wrong.
          • All my search fields are indexed (allowing duplicates)
          • Performance is fine as long as I'm dealing with each 2GB database individually. They're around 12-15 million records each, and searching is nice and quick (couple of seconds). Once I UNION a couple of them, and do a simple search on a field which is indexed in both, it takes hours.
          • The "PK" I added was not integer, and had nothing directly to do with the search. What makes the difference appears to be simply whether there is a PK.

          NeoPa, I'll read your lengthy message some time today (it's morning here now).

          Comment

          • Killer42
            Recognized Expert Expert
            • Oct 2006
            • 8429

            #6
            Originally posted by NeoPa
            I'm open to correction here but I'm quite happy this is along the right tracks.
            Firstly, FYI, The size limit of Access 2K and beyond is 4GB. 97 was the last one limited to 2GB.
            Beware though, a Unicode feature was installed for text fields which can double your text data unless you use 'Unicode Compression', which should be safe for most databases in the English speaking world.
            Indices - what a subject!
            When an RDBMS (Access for instance) processes a recordset, it will look for an index that will help the processing it's currently on. So, if you run
            Code:
            SELECT [CName], [Surname], [DoB]
            FROM [tblStaff]
            ORDER BY [Surname]
            then having an index based on [Surname] will enable that to be used.
            If there is no index that will help, it will have to sort the data first before use.
            This is where the small PK comes in

            If you can't provide keys for all the various queries that use the table, then those that don't have keys will sort, and here's the nub, if there is a PK then it will sort via that - if no PK at all, it needs to create a temp table and sort the whole dataset :(.

            Having said that, my recommendation would be to use the PK for your main unique index - though there is clearly something to be said for PEB's approach, especially where you need to process the recordset from many different directions in different queries etc - as it were.
            Phew! Finally found the time to read all of this. Here's my response...
            • All of the fields I'm likely to want to search on are indexed (none unique). the indexes are exactly the same in all the databases. In fact apart from the data, the DBs are identical.
            • I don't have any unique field on the records. Used to have a PK called ID (automatically added by Access, I think) but removed it to save space.
            • As I said, performance was wonderful on all (alright, most) queries until I started "unioning" the databases.
            • Some of the databases had the records in a different sequence to what I really prefer, so the new PK I'm adding is made up of a couple of fields and returns them in the preferred order.
            • It is obviously not doing any major amount of sorting now that my sample database has a PK, as the response is back to a second or two, and I’m searching (and returning) millions of records.
            • As for creating a temp table and so on, I can believe it may have been doing that when I didn’t have the PK, based on the ridiculous length of time and amount of disk activity.
            • Anyone know why Access can’t make use of the keys on the unioned tables? It’s a very simple structure, and query.

            I’ll go ahead and add a PK to all my tables (which will involve splitting each into two parts, since they are up against the size limit). Should be able to tell you in a few days whether it worked. Not much to do on my part, but dealing with these volumes, Access tends to chug along all day in the background. :) I've had a delete query running for 3-4 hours now, and it's probably around 70-80% finished.

            Comment

            • MMcCarthy
              Recognized Expert MVP
              • Aug 2006
              • 14387

              #7
              Originally posted by Killer42
              • Anyone know why Access can’t make use of the keys on the unioned tables? It’s a very simple structure, and query.
              Essentially, Access treats each index correctly when it's on an individual table or query.

              The minute you preform a UNION access can't reconcile the indexes in the various sets and pretty much ignores them.

              One solution to your problem which I have used in the past is to create a VBA routine to take in each of the tables from the various databases and to create a new recordset based on a new table structure and run a DoCmd.RunSQL INSERT query to append each record. The new table would of course have an autonumber primary key.

              It works much faster than the UNION query and your only problem is to work out how to dynamicaly program it to move through the odbc and recordset connections. I used a table with each database path and table as field values.

              Any questions on this approach let me know.

              Mary

              Comment

              • MMcCarthy
                Recognized Expert MVP
                • Aug 2006
                • 14387

                #8
                BTW


                I did this using DAO not ADO which give problems when changing connections during a process.

                Comment

                • Killer42
                  Recognized Expert Expert
                  • Oct 2006
                  • 8429

                  #9
                  Originally posted by mmccarthy
                  Essentially, Access treats each index correctly when it's on an individual table or query.

                  The minute you preform a UNION access can't reconcile the indexes in the various sets and pretty much ignores them.

                  One solution to your problem which I have used in the past is to create a VBA routine to take in each of the tables from the various databases and to create a new recordset based on a new table structure and run a DoCmd.RunSQL INSERT query to append each record. The new table would of course have an autonumber primary key.

                  It works much faster than the UNION query and your only problem is to work out how to dynamicaly program it to move through the odbc and recordset connections. I used a table with each database path and table as field values.

                  Any questions on this approach let me know.

                  Mary
                  Um...

                  As far as I can see, my main problem would still be the same - the data is too big to fit in one table. So how does this help?

                  Anyway, I don't really need an alternative method that badly, now. I have the single table from each of the databases linked to the front-end. It comes down to two choices. If a PK provides acceptable performance, I'll go back to using a UNION query which connects up the data for all of the years (it's one year per database). If not, I'll just have to union the results of the queries on each database. I'm doing that now, but don't like it.

                  Comment

                  • MMcCarthy
                    Recognized Expert MVP
                    • Aug 2006
                    • 14387

                    #10
                    Originally posted by Killer42
                    Um...

                    As far as I can see, my main problem would still be the same - the data is too big to fit in one table. So how does this help?
                    Using this method you never actually link the tables or run any queries so the only thing you have is the new table. How many records are we talking about here. I've done it with a couple of million.

                    Comment

                    • Killer42
                      Recognized Expert Expert
                      • Oct 2006
                      • 8429

                      #11
                      Originally posted by mmccarthy
                      Using this method you never actually link the tables or run any queries so the only thing you have is the new table. How many records are we talking about here. I've done it with a couple of million.
                      I'm afraid you've lost me completely as to what "this method" is.

                      I have a separate database for each year's data, with typically 12 to 15 million records, each database approaching 2GB in size (lucky coincidence). Each consists of a single table. They have identical structure and indexes. It's simply the same database copied, cleaned out and repopulated for each year.

                      I'd like to have a front-end database which just includes a union query to pull them all together, then play with that. It's not all that important - I can just continue to use them separately, but it's sometimes a little inconvenient.

                      Comment

                      • NeoPa
                        Recognized Expert Moderator MVP
                        • Oct 2006
                        • 32633

                        #12
                        I posted this elsewhere, but it really should have come in here.
                        Originally posted by NeoPa
                        Killer,

                        You should seriously consider porting your back end database to SQL Server 2005.
                        I believe the free version is now much more flexible than the 2K MSDE.
                        It will handle your databases (/tables) without strain.
                        Unfortunately, the SQL Server forums are not as active as these here in the Access section (No MM of course).

                        PS. Sorry about the 2GB limit - I thought it had gone from 2GB to 4GB. It must have been from 1GB to 2GB.

                        Comment

                        • MMcCarthy
                          Recognized Expert MVP
                          • Aug 2006
                          • 14387

                          #13
                          Originally posted by Killer42
                          I'm afraid you've lost me completely as to what "this method" is.

                          I have a separate database for each year's data, with typically 12 to 15 million records, each database approaching 2GB in size (lucky coincidence). Each consists of a single table. They have identical structure and indexes. It's simply the same database copied, cleaned out and repopulated for each year.

                          I'd like to have a front-end database which just includes a union query to pull them all together, then play with that. It's not all that important - I can just continue to use them separately, but it's sometimes a little inconvenient.
                          Sorry Killer

                          I didn't realise how many records you were talking about.

                          Comment

                          • Killer42
                            Recognized Expert Expert
                            • Oct 2006
                            • 8429

                            #14
                            Originally posted by mmccarthy
                            I didn't realise how many records you were talking about.
                            Just for the record, the error message when Access tries to go beyond 2GB was "Invalid argument". Why do I bring this up now? I'll give you one guess... :(

                            Comment

                            • NeoPa
                              Recognized Expert Moderator MVP
                              • Oct 2006
                              • 32633

                              #15
                              I don't think Access checks whether or not it's reached its ceiling - It just crashes (apparently randomly) wherever its internal code fails :(.

                              Comment

                              Working...