Grouping Query Range

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • NeoPa
    Recognized Expert Moderator MVP
    • Oct 2006
    • 32662

    #16
    Originally posted by Stewart Ross Inverness
    Hi. The most flexible way to do what you ask is as Fish indicated - use a separate band table. That way you can use bands with unequal boundaries (as illustrated in the attached zipped database).

    The banding table is added to the base data without using INNER, LEFT or RIGHT JOINs; instead, the rows of the banding table are selected where the value being compared is between the band minimum and maximum.
    ...
    Nice idea Stewart.

    I'm curious as to why you wouldn't use an INNER JOIN to link the tables though?

    I would expect that to work better than a simple WHERE clause.

    Comment

    • Stewart Ross
      Recognized Expert Moderator Specialist
      • Feb 2008
      • 2545

      #17
      Hi NeoPa. Reason for no joins is that there are no common fields involved. The comparison is of a single discrete value in one table against a minimum to maximum range of values in the banding table, not key against key. It is the principle of the histogram, where discrete values of a population are grouped over specified intervals into separate but contiguous bands, with counts for each band subsequently displayed in the typical column chart form.

      This kind of banding application is one of those rare occasions when such a Cartesian join between tables is essential, with the inherent multiplication of rows stopped by the banding range always matching just one of the rows in the banding table.

      The reason a band number is included is to be able to order the bands correctly if required - and to be able to left-join the banding table back to the results later if necessary (because, depending on the range of the data, the result bandings are likely to be non-contiguous, and left-joining the bandings back will allow inclusion of all bands, even if there are no results for particular bands).

      Hope this helps to explain the approach.

      -Stewart

      Comment

      • FishVal
        Recognized Expert Specialist
        • Jun 2007
        • 2656

        #18
        I agree with NeoPa. I would prefer to JOIN tables though with only two tables involved it seems not to have a great advantage.

        Regards,
        Fish

        Comment

        • Stewart Ross
          Recognized Expert Moderator Specialist
          • Feb 2008
          • 2545

          #19
          I agree entirely, Fish. When working with relational tables there are always joins involved, and in all of my applications I define the joins and set relational constraints as a matter of routine. However, the banding application is one of these rare occasions when this is simply not possible - in set algebra terms the sets involved are disjoint and cannot be equijoined or outer joined.

          When I analyse several years worth of student data, involving around 30,000 rows, the time penalties and delays are in the subsequent multiple-field grouping and summations not in the formation of a banded base table involving joining of the 30,000 rows to a small table of around 8 rows using a where clause to restrict the rows joined.

          -Stewart

          Comment

          • FishVal
            Recognized Expert Specialist
            • Jun 2007
            • 2656

            #20
            Well. As I could recall, JOIN
            [code=sql]
            ... FROM t1 ... JOIN t2 ON blnExpr
            [/code]
            is a later SQL syntax improvement for cartesian product filtering
            [code=sql]
            ... FROM t1, t2 WHERE blnExpr
            [/code]
            The only difference known to me is that Expr from the first snippet must be evaluated on fields from both tables being joined.
            All the rest seems to be allowed - any kind and combination of arithmetic and logic operations, VBA calls etc. So long it turns into boolean.

            Advantage of JOIN syntax is that it allows outer joins and explicitely define the order of joining.

            In context of OP's question both syntaxes performs more or less equally.

            Regards,
            Fish

            P.S. Did you mean something else?
            Last edited by FishVal; May 28 '08, 09:08 PM. Reason: PS added

            Comment

            • Stewart Ross
              Recognized Expert Moderator Specialist
              • Feb 2008
              • 2545

              #21
              Hi Fish, and thanks for a very thoughtful reply. What I was referring to is that for the most part Cartesian products are avoided, and seen as 'a bad thing', because of the row multiplication involved, but there are occasions when no other approach will do. The lack of an explicit join could be detrimental in some circumstances, but as your reply clarifies the JOIN and WHERE approaches have similar roots and will perform similarly given the small scale of the banding table involved.

              Where I am coming from is that the use of a Cartesian product in this case is quite deliberate and intentional, as it is dictated by the disjoint nature of the datasets - a very different case to the accidental creation of a cross-multipled table for related data by neglecting to join two tables correctly.

              One other difference in the JOIN and WHERE approaches is that with explicit JOINs the database engine query optimisers can avoid projecting the Cartesian product to begin with, thereby operating more efficiently than applying a WHERE clause after the event might do - depending on how efficient the underlying operations and the optimising algorithms really are (which we can only infer from overall performance, as the optimisations are not visible or alterable by us).

              Thank you again for your thoughtful reply which provides further explanation of the JOIN vs WHERE approaches. Really interesting, even if we are drifting a bit from the original thread!

              -Stewart

              Comment

              • FishVal
                Recognized Expert Specialist
                • Jun 2007
                • 2656

                #22
                Originally posted by Stewart Ross Inverness
                .....
                Where I am coming from is that the use of a Cartesian product in this case is quite deliberate and intentional, as it is dictated by the disjoint nature of the datasets - a very different case to the accidental creation of a cross-multipled table for related data by neglecting to join two tables correctly.
                .....
                Sorry, Stewart.

                What "disjoint nature" is all about?
                Just couldn't understand why it does "dictate". ;)

                Regards,
                Fish

                Comment

                • Stewart Ross
                  Recognized Expert Moderator Specialist
                  • Feb 2008
                  • 2545

                  #23
                  Hi Fish. Disjoint is a term from set theory and just means that there are no elements in common. Set theory underlies the relational model developed by EF Codd in the 60s, updated by him since and (imperfectly) implemented in the various standards for SQL ever since.

                  When there is a relation between tables T1 and T2 the underlying sets have at least one element of overlap. There is no such overlap in the case of a range table (where each row represents a paired minimum to maximum range), and the data which contains discrete values (not min and max tuples). The sets are disjoint, therefore no common fields, so no way to join them on a relationship between fields.

                  To illustrate: age part of student data set:

                  {16, 19, 18, 18, 15, 64, 20,...,32}

                  Part of range set:

                  {{14, 15}, {16, 17}, {18, 21},...{61, 100}}

                  The sets are disjoint, so no value-based join is possible. If the product of the set of N values with a range set of M values was created there would be N x M rows in the result - the Cartesian product. But there is a functional relationship between ages and age ranges, which is that the discrete value will match one of the tuples where its value is between the values of the tuple elements - so although no 1-1 join is possible the two datasets of N x M rows can still be combined to give just N rows for the result by using the WHERE clause to find the correct range.

                  There is a more formal treatment of disjoint sets at Wikipedia - linked here.

                  I hope this helps.

                  -Stewart
                  Originally posted by FishVal
                  Sorry, Stewart.

                  What "disjoint nature" is all about?
                  Just couldn't understand why it does "dictate". ;)

                  Regards,
                  Fish

                  Comment

                  • FishVal
                    Recognized Expert Specialist
                    • Jun 2007
                    • 2656

                    #24
                    Ok, Stewart.

                    You have not convinced me though. What you are talking about is table PK/FK relations based on key equalness. Table join, on the other hand, is a tool to combine records on some criteria regardless of PK/FK relationships. Sure it is commonly used to get denormalized view of normalized tables, but the area of table join application is definitely not limited by this.

                    Expressions like
                    [code=sql]
                    SELECT * FROM t1, t2 WHERE t1.f1>t2.UpperB ound AND t1.f1<t2.LowerB ound;
                    [/code]
                    and
                    [code=sql]
                    SELECT * FROM t1 INNER JOIN t2 ON t1.f1>t2.UpperB ound AND t1.f1<t2.LowerB ound;
                    [/code]
                    are equivalents and both working.
                    The advantage of JOIN syntax is, as I've stated already, explicit join order and possibility to perform outer (LEFT, RIGHT) joins.
                    BTW, earlier, outer joins had a special syntax:
                    [code=sql]
                    ... FROM t1, t2 WHERE t1.f1 *= t2.f2
                    [/code]
                    was used to make left join
                    and
                    [code=sql]
                    ... FROM t1, t2 WHERE t1.f1 =* t2.f2
                    [/code]
                    for right join.
                    In some RDBMS this syntax is accepted to provide backward compatibility but Jet SQL doesn't support it.

                    Regards,
                    Fish

                    Comment

                    • Stewart Ross
                      Recognized Expert Moderator Specialist
                      • Feb 2008
                      • 2545

                      #25
                      Thank you, Fish, for another very interesting reply. I live and learn - I have not myself used table joins based on inequality criteria as you have shown. You are quite right that I am looking at the relationships from a PK/FK perspective. In those terms, there is a fundamental difference in the joining of the sets involved. In a PK/FK relationship all FKs are a subset of the PK set; in a table relationship where a table join as you have shown is involved one set is not a subset of the other. Different views of the same situation - and both ways valid. Really interesting!

                      -Stewart

                      Comment

                      • Stewart Ross
                        Recognized Expert Moderator Specialist
                        • Feb 2008
                        • 2545

                        #26
                        Update on post 25: In the query editor SQL view Access accepts the table join version of the SQL just as Fish has said, but the query editor cannot then show the query in design view. The SQL table join statement does indeed work equivalently otherwise.

                        Code:
                        SELECT [Age Histogram Base].Type, [Age Histogram Base].Age, [Age Histogram Base].Gender, [Age Band].[Band], [Age Band].[Band Label]
                        FROM [Age Histogram Base] INNER JOIN [Age Band]
                        ON [Age Histogram Base].Age >= [Age Band].[Min] And [Age Histogram Base].Age <= [Age Band].[Max];
                        Timing update
                        Running a quick test of timing differences on an enlarged set of data (221k rows) the summary crosstab based on the WHERE version of the banded data took 6 seconds to run. The same crosstab based on the table-join version took just under 30 seconds to run.

                        On the simple Count of the banded rows both versions ran in about 6 seconds.

                        -Stewart
                        Last edited by Stewart Ross; May 29 '08, 09:14 AM. Reason: added timing data

                        Comment

                        • NeoPa
                          Recognized Expert Moderator MVP
                          • Oct 2006
                          • 32662

                          #27
                          Originally posted by NeoPa
                          Nice idea Stewart.

                          I'm curious as to why you wouldn't use an INNER JOIN to link the tables though?

                          I would expect that to work better than a simple WHERE clause.
                          You boys seem to have sorted all this without me (You're just too quick :D).

                          FYI Fish correctly interpreted exactly the point I was raising.

                          I would expect, conceptually, that the INNER JOIN should work more efficiently, although I accept that this may well be negligible for small amounts of locally held data. It may prove more important if linking to remote (Non-Access) tables though.

                          Comment

                          Working...