Challenge: Can you optimize this?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • octangle

    Challenge: Can you optimize this?

    This code is attempting to find records that have a RegJrnID that does
    not occur more than one time in the table.

    The reason that I want to find records with non-duplicated RegJrnID
    values is to create "reversal" records for these such that the reversal
    record has identical values for every column except the TaxableAmount
    which will contain a negative amount. (see: example data below).

    /* Set up */

    CREATE TABLE t1(RegJrnID INTEGER, InvoiceDate VARCHAR(8), InvoiceNumber
    VARCHAR(20), TaxableAmount DECIMAL(32,8))

    /* Example data */

    INSERT INTO t1 VALUES (1, '20060101', '2321323', 100.00)
    INSERT INTO t1 VALUES (9, '20060213', '2130009', 40.01)
    INSERT INTO t1 VALUES (3, '20060101', '9402293', 512.44)
    INSERT INTO t1 VALUES (1, '20060104', '2321323', -100.00)
    INSERT INTO t1 VALUES (4, '20060105', '9302221', 612.12)
    INSERT INTO t1 VALUES (5, '20060105', '0003235', 18.11)
    INSERT INTO t1 VALUES (6, '20060111', '5953432', 2101.21)
    INSERT INTO t1 VALUES (3, '20060111', '9402293', -512.44)
    INSERT INTO t1 VALUES (7, '20060115', '4234444', 44.52)
    INSERT INTO t1 VALUES (8, '20060115', '0342222', 95.21)
    INSERT INTO t1 VALUES (6, '20060119', '5953432', -2101.21)
    INSERT INTO t1 VALUES (2, '20060101', '5440033', 231.01)

    /* Show what's in the table - just because */

    SELECT * FROM t1 ORDER BY RegJrnID, InvoiceDate

    /* Query for records to reverse */

    SELECT *
    FROM t1 a

    /* Ignore records that have already been reversed */

    WHERE a.RegJrnID != ALL

    /* This subselect finds reversed records (i.e. those that have a
    duplicate RegJrnID) */

    (
    SELECT b.RegJrnID
    FROM t1 b
    GROUP BY b.RegJrnID
    HAVING COUNT(*) > 1
    )

    /* User selection criteria are appended here */

    /* AND InvoiceNumber >= '5000000' AND InvoiceNumber <= '7500000' */

    /* Make the results look pretty (optional) */

    ORDER BY RegJrnID

    /* Housekeeping */

    DROP TABLE t1

  • Alexander Kuznetsov

    #2
    Re: Challenge: Can you optimize this?

    There are many ways to accomplish that. I would start with something
    this (untested):

    select pos.RegJrnID
    from(
    select * from t1 where TaxableAmount >0
    ) pos
    left outer join
    from(
    select * from t1 where TaxableAmount <0
    ) neg
    on pos.RegJrnID = neg.RegJrnID
    where neg.RegJrnID is null

    Comment

    • octangle

      #3
      Re: Challenge: Can you optimize this?

      Here's the tested (and slightly modified) version of your code...

      SELECT pos.*

      FROM
      (
      SELECT * FROM t1 WHERE TaxableAmount > 0
      ) pos
      LEFT OUTER JOIN
      (
      SELECT * FROM t1 WHERE TaxableAmount < 0
      ) neg
      ON pos.RegJrnID = neg.RegJrnID

      WHERE neg.RegJrnID IS NULL

      /* Make the results look pretty (optional) */

      ORDER BY pos.RegJrnID

      Comment

      • octangle

        #4
        Re: Challenge: Can you optimize this?

        According to the SQL Query analyzer your query is better going head to
        head with the original representing 43.43% of the batch and the
        original representing 56.57% of the batch.

        A 13% improvement!

        Thanks!

        Comment

        • Hugo Kornelis

          #5
          Re: Challenge: Can you optimize this?

          On 6 Jun 2006 13:24:54 -0700, octangle wrote:
          [color=blue]
          >This code is attempting to find records that have a RegJrnID that does
          >not occur more than one time in the table.
          >
          >The reason that I want to find records with non-duplicated RegJrnID
          >values is to create "reversal" records for these such that the reversal
          >record has identical values for every column except the TaxableAmount
          >which will contain a negative amount. (see: example data below).[/color]

          Hi octangle,

          Thanks for providing CREATE TABLE and INSERT statements. This made it
          very easy to set up a test DB and fun to find an answer.

          What worries me is that there's no primary key in your table. I hope
          that you just forgot to include it in the script and that your real
          table does have a key!

          Here's a much quicker way. Running both your version and my version with
          execution plan displayed, yours took 72% and mine 28%. Removing the
          ORDER BY changed this to 64% / 36%. Still a nice gain.

          SELECT RegJrnID, MAX(InvoiceDate ),
          MAX(InvoiceNumb er), MAX(TaxableAmou nt)
          FROM t1
          GROUP BY RegJrnID
          HAVING COUNT(*) = 1
          --ORDER BY RegJrnID

          And here's another one, but it's correctness depends on some assumptions
          I had to make because you forgot to include the primary key. With ORDER
          BY, it's slightly more expensive than the previous version. With the
          ORDER BY commented out, it only costs half as much!

          SELECT RegJrnID, InvoiceDate,
          InvoiceNumber, TaxableAmount
          FROM t1 AS a
          WHERE NOT EXISTS
          (SELECT *
          FROM t1 AS b
          WHERE a.RegJrnID = b.RegJrnID
          AND a.InvoiceDate <> b.InvoiceDate)
          --ORDER BY RegJrnID

          (Note - I have compared these queries using the sample data you provided
          on a SQL Server 2005 database on my computer. Results will probably vary
          on yoour database, especially if your table has indexes, your data
          distribution is not like the sample data, and/or you are running another
          version of SQL Server. I recommend that you test out the various
          suggestions yourself before deciding.)

          --
          Hugo Kornelis, SQL Server MVP

          Comment

          • octangle

            #6
            Re: Challenge: Can you optimize this?

            Upon further review...

            As written in the previous post, records with a TaxableAmount of 0 will
            not be found to be reversed... so in an attempt to remedy this I
            modified the LEFT OUTER JOIN as follows:

            FROM
            (
            SELECT * FROM t1 WHERE TaxableAmount >= 0
            ) pos
            LEFT OUTER JOIN
            (
            SELECT * FROM t1 WHERE TaxableAmount < 0
            ) neg
            ON pos.RegJrnID = neg.RegJrnID

            This succeeds at finding the 0 TaxableAmount records... but once a
            companion reversal record is inserted into the database both records
            (the original and a the reversal) are found on subsequent queries using
            this technique... since these records are retrieved by the query as
            records to reverse, these get reversed again (thus making a total of 4
            instances of the original record - instead of 2 which is all are
            needed). And if we repeat the process a 0 TaxableAmount record will
            redouble it instances every time the process is run...

            Now the questions are...

            Can the above LEFT OUTER JOIN be fixed?

            OR

            Should 0 TaxableAmounts be processed in their own pass with their own
            query (yuck)?

            OR

            Is the original query really better because it works for all
            TaxableAmounts despite the fact that its 13% slower....

            OR

            Is there another option????

            Comment

            • Alexander Kuznetsov

              #7
              Re: Challenge: Can you optimize this?

              one more question: what if there is only one row with negative amount?

              INSERT INTO t1 VALUES (11, '20060101', '2321323', -100.00)

              and there is no corresponding row with positive amount? Nothing in the
              posted DDL prevents you from that. In fact, originally I was
              considering the query posted by Hugo, but realized it would return that
              single row with negative amount and assumed it incorrect. It looks like
              there might be no base for my assumption.

              Comment

              • Erland Sommarskog

                #8
                Re: Challenge: Can you optimize this?

                Alexander Kuznetsov (AK_TIREDOFSPAM @hotmail.COM) writes:[color=blue]
                > one more question: what if there is only one row with negative amount?
                >
                > INSERT INTO t1 VALUES (11, '20060101', '2321323', -100.00)
                >
                > and there is no corresponding row with positive amount? Nothing in the
                > posted DDL prevents you from that. In fact, originally I was
                > considering the query posted by Hugo, but realized it would return that
                > single row with negative amount and assumed it incorrect. It looks like
                > there might be no base for my assumption.[/color]

                Having watched the thread from aside, I think the real problem is that
                the data model needs improvement. I would add a bit column "isbalanceentry "
                or some such. And of course add a primary key. (InvoiceNumber,
                isbalanceentry) looks like a candidate.

                Better get the data model in order, before looking at smart queries.




                --
                Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

                Books Online for SQL Server 2005 at

                Books Online for SQL Server 2000 at

                Comment

                • Hugo Kornelis

                  #9
                  Re: Challenge: Can you optimize this?

                  On 6 Jun 2006 17:45:58 -0700, Alexander Kuznetsov wrote:
                  [color=blue]
                  >one more question: what if there is only one row with negative amount?
                  >
                  >INSERT INTO t1 VALUES (11, '20060101', '2321323', -100.00)
                  >
                  >and there is no corresponding row with positive amount? Nothing in the
                  >posted DDL prevents you from that. In fact, originally I was
                  >considering the query posted by Hugo, but realized it would return that
                  >single row with negative amount and assumed it incorrect. It looks like
                  >there might be no base for my assumption.[/color]

                  Hi Alexander,

                  I've seenn nothing in the original post that justifies special treatment
                  of negative amounts. If these should be excluded, then my version can
                  still be used - just add AND MAX(TaxableAmou nt) >= 0 to the HAVING
                  clause.

                  However, I agree with Erland that a redesign might be a better choice if
                  something like that is the case.

                  --
                  Hugo Kornelis, SQL Server MVP

                  Comment

                  • Alexander Kuznetsov

                    #10
                    Re: Challenge: Can you optimize this?

                    Erland Sommarskog wrote:[color=blue]
                    >
                    > Having watched the thread from aside, I think the real problem is that
                    > the data model needs improvement. I would add a bit column "isbalanceentry "
                    > or some such. And of course add a primary key. (InvoiceNumber,
                    > isbalanceentry) looks like a candidate.
                    >[/color]

                    I concur that the schema might need some work. I was thinking that a
                    single nullable column negated_date might be sufficient, so that
                    instead of inserting one more row one just needs to update negated_date.

                    Comment

                    • octangle

                      #11
                      Challenge: Can you optimize this? (summary)

                      Below is an aggregate script that includes everyone's suggested queries
                      so far...

                      Based upon feedback I have beefed up the test records to more
                      accurately reflect all of the potential scenarios that need to be
                      handled by this quey.

                      The original query (Query attempt #1 (Octangle)) generates the desired
                      result set and therefore is the benchmark of correctness for my
                      purposes.

                      MWMWMWMWMWMWMWM WMWMWMWMWMWMWMW MWMW

                      /* Set up */

                      CREATE TABLE t1(RegJrnID INTEGER, InvoiceDate VARCHAR(8), InvoiceNumber
                      VARCHAR(20), TaxableAmount DECIMAL(32,8))

                      INSERT INTO t1 VALUES (0, '20060120', '0000033', 0.00)
                      INSERT INTO t1 VALUES (1, '20060101', '2321323', 100.00)
                      INSERT INTO t1 VALUES (9, '20060213', '2130009', 40.01)
                      INSERT INTO t1 VALUES (11, '20060324', '3321110', -1200.16)
                      INSERT INTO t1 VALUES (3, '20060101', '9402293', 512.44)
                      INSERT INTO t1 VALUES (1, '20060104', '2321323', -100.00)
                      INSERT INTO t1 VALUES (13, '20051127', '1034501', -77.50)
                      INSERT INTO t1 VALUES (4, '20060105', '9302221', 612.12)
                      INSERT INTO t1 VALUES (5, '20060105', '0003235', 18.11)
                      INSERT INTO t1 VALUES (10, '20060421', '0000033', 0.00)
                      INSERT INTO t1 VALUES (6, '20060111', '5953432', 2101.21)
                      INSERT INTO t1 VALUES (3, '20060111', '9402293', -512.44)
                      INSERT INTO t1 VALUES (12, '20060606', '0000001', 4431.55)
                      INSERT INTO t1 VALUES (7, '20060115', '4234444', 44.52)
                      INSERT INTO t1 VALUES (8, '20060115', '0342222', 95.21)
                      INSERT INTO t1 VALUES (6, '20060119', '5953432', -2101.21)
                      INSERT INTO t1 VALUES (2, '20060101', '5440033', 231.01)
                      INSERT INTO t1 VALUES (10, '20060517', '0000033', 0.00)
                      INSERT INTO t1 VALUES (11, '20060324', '3321110', 1200.16)
                      INSERT INTO t1 VALUES (12, '20060606', '0000001', -4431.55)

                      /* Show what's in the table */

                      SELECT * FROM t1 ORDER BY RegJrnID, InvoiceDate

                      /* Query for records to reverse */

                      /* Query attempt #1 (Octangle) */

                      /* Pros: correct */
                      /* Cons: slow */

                      SELECT *
                      FROM t1 a

                      /* Ignore records that have already been reversed */

                      WHERE a.RegJrnID != ALL

                      /* This subselect finds reversed records (i.e. those that have a
                      duplicate RegJrnID) */

                      (
                      SELECT b.RegJrnID
                      FROM t1 b
                      GROUP BY b.RegJrnID
                      HAVING COUNT(*) > 1
                      )

                      /* User selection criteria are appended here */

                      /* AND InvoiceNumber >= '5000000' AND InvoiceNumber <= '7500000' */

                      /*ORDER BY RegJrnID; * Make the results look pretty (optional) */

                      /* Query attempt #2 (Alexander) */

                      /* Pros: faster */
                      /* Cons: misses 0 TaxableAmounts */

                      SELECT pos.*
                      FROM
                      (
                      SELECT * FROM t1 WHERE TaxableAmount > 0
                      ) pos
                      LEFT OUTER JOIN
                      (
                      SELECT * FROM t1 WHERE TaxableAmount < 0
                      ) neg
                      ON pos.RegJrnID = neg.RegJrnID
                      WHERE neg.RegJrnID IS NULL
                      /*ORDER BY pos.RegJrnID * Make the results look pretty (optional) */

                      /* Query attempt #3 (Alexander - tweaked by Octangle) */

                      /* Pros: faster */
                      /* Cons: finds too many 0 TaxableAmounts */

                      SELECT pos.*
                      FROM
                      (
                      SELECT * FROM t1 WHERE TaxableAmount >= 0
                      ) pos
                      LEFT OUTER JOIN
                      (
                      SELECT * FROM t1 WHERE TaxableAmount < 0
                      ) neg
                      ON pos.RegJrnID = neg.RegJrnID
                      WHERE neg.RegJrnID IS NULL
                      /*ORDER BY pos.RegJrnID * Make the results look pretty (optional) */

                      /* Query attempt #4 (Hugo) */

                      /* Pros: correct , fastest, returns results in RegJrnID order with
                      ORDER BY clause */

                      SELECT RegJrnID, MAX(InvoiceDate ) as "InvoiceDat e",
                      MAX(InvoiceNumb er) as "InvoiceNumber" , MAX(TaxableAmou nt) as
                      "TaxableAmo unt"
                      FROM t1
                      GROUP BY RegJrnID
                      HAVING COUNT(*) = 1

                      /* Query attempt #5 (Hugo) */

                      /* Pros: fast */
                      /* Cons: not correct */

                      SELECT RegJrnID, InvoiceDate, InvoiceNumber, TaxableAmount
                      FROM t1 AS a
                      WHERE NOT EXISTS
                      (
                      SELECT *
                      FROM t1 AS b
                      WHERE a.RegJrnID = b.RegJrnID
                      AND a.InvoiceDate <> b.InvoiceDate
                      )
                      /*ORDER BY RegJrnID * Make the results look pretty (optional) */

                      /* Housekeeping */

                      DROP TABLE t1

                      MWMWMWMWMWMWMWM WMWMWMWMWMWMWMW MWMW

                      Queries as percent of batch (when just executing the queries in the
                      above script)

                      Query #1: 22.66% - Correct
                      Query #2: 19.77%
                      Query #3: 20.27%
                      Query #4: 12.63% - Correct
                      Query #5: 20.67%

                      Queries as percent when compared to only the original query (Query #1)

                      Query #1: 50.00% - Correct
                      Query #2: 42.58%
                      Query #3: 43.19%
                      Query #4: 32.14% - Correct
                      Query #5: 43.67%

                      At this point it looks like the clear winner is Query #4 by Hugo!

                      MWMWMWMWMWMWMWM WMWMWMWMWMWMWMW MWMW

                      To address some of the observations/comments:

                      1. Negative transactions are possible - I augmented the test data to
                      include this case.
                      2. This is for a commercial product that has numerous existing
                      customers, I inherited the data model that this table is based upon...
                      my coding constraits are:
                      - I cannot add any columns (due to how we version a column change would
                      force this release to be considered a major release and not a minor
                      release as desired)
                      - I should not add any indexes/primary keys/uniqueness constriants for
                      performance reasons (see below)

                      The purpose of this table to store processed transaction results. It
                      needs to be as efficient as possible for insertions, so as to not slow
                      down the transaction processing engine. Reporting (and reversing groups
                      of transactions) are secondary concerns and it is acceptable for these
                      functions to be slower.

                      I sincerely want to thank everyone who chipped in a comment or
                      suggestion on this...

                      Comment

                      • Alexander Kuznetsov

                        #12
                        Re: Challenge: Can you optimize this? (summary)

                        1. If negative transactions are possible, than my query is incorrect.
                        2. You probably need a much larger set of test data to test different
                        approaches against. Also I would say there are at lest 2 possible
                        situations:
                        - most transactions are already negated.
                        - most transactions have not been negated yet.
                        In some cases in different situations different queries are the best. I
                        would try both and see if one and the same query is the best.

                        Good luck!

                        Comment

                        • octangle

                          #13
                          Re: Challenge: Can you optimize this? (summary)


                          Alexander Kuznetsov wrote:[color=blue]
                          > 1. If negative transactions are possible, than my query is incorrect.
                          > 2. You probably need a much larger set of test data to test different
                          > approaches against. Also I would say there are at lest 2 possible
                          > situations:
                          > - most transactions are already negated.
                          > - most transactions have not been negated yet.
                          > In some cases in different situations different queries are the best. I
                          > would try both and see if one and the same query is the best.
                          >
                          > Good luck![/color]

                          FYI

                          The normal situation would be that most transactions are not negated in
                          this table. Negation would only occur if the billing system was
                          un-doing a billing run for some technical or business reason...
                          Therefore negating transactions would be rare, theoretically.. .

                          I appreciate the notion of getting better test data - this will
                          naturally accrue as I implement and test this soultion. I will make
                          sure to compare performance as this project matures...

                          Thanks again...!

                          Comment

                          • octangle

                            #14
                            Re: Challenge: Can you optimize this? (summary)

                            /* Query attempt #4 (Hugo) */

                            SELECT RegJrnID, MAX(InvoiceDate ) as "InvoiceDat e", MAX(InvoiceNumb er)
                            as "InvoiceNumber" , MAX(TaxableAmou nt) as "TaxableAmo unt"
                            FROM t1
                            GROUP BY RegJrnID
                            HAVING COUNT(*) = 1

                            I talked to a few folks around the office and none of us had ever
                            though to use MAX() to force values out of a query using a GROUP BY
                            clause...

                            e.g. if the query were changed to look like this:

                            SELECT *
                            FROM t1
                            GROUP BY RegJrnID
                            HAVING COUNT(*) = 1

                            The following error occurs for each column not mentioned in the GROUP
                            BY clause: "Column 't1.InvoiceDate ' is invalid in the select list
                            because it is not contained in either an aggregate function or the
                            GROUP BY clause." So MAX() forces these values to participate in the
                            result set generated by this query...

                            My question with this is, "Is this technique safe for all major DBs
                            (Oracle, SQL Server, DB2 and MySQL) and will it work with all column
                            types?"

                            Comment

                            • Erland Sommarskog

                              #15
                              Re: Challenge: Can you optimize this? (summary)

                              octangle (idea.vortex@gm ail.com) writes:[color=blue]
                              > /* Query attempt #4 (Hugo) */
                              >
                              > SELECT RegJrnID, MAX(InvoiceDate ) as "InvoiceDat e", MAX(InvoiceNumb er)
                              > as "InvoiceNumber" , MAX(TaxableAmou nt) as "TaxableAmo unt"
                              > FROM t1
                              > GROUP BY RegJrnID
                              > HAVING COUNT(*) = 1
                              >
                              > I talked to a few folks around the office and none of us had ever
                              > though to use MAX() to force values out of a query using a GROUP BY
                              > clause...
                              >...
                              > My question with this is, "Is this technique safe for all major DBs
                              > (Oracle, SQL Server, DB2 and MySQL) and will it work with all column
                              > types?"[/color]

                              Yes, it should on any RDBMS worth the name, as it is very plain standard
                              SQL.

                              Then again, MySQL has so many funny quirks, I suspect that one should
                              never take anything for granted with that engine.


                              --
                              Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

                              Books Online for SQL Server 2005 at

                              Books Online for SQL Server 2000 at

                              Comment

                              Working...