Challenge: Can you optimize this?

**Alexander Kuznetsov** · Jun 6 '06, 08:35 PM

Re: Challenge: Can you optimize this?

There are many ways to accomplish that. I would start with something
this (untested):

select pos.RegJrnID
from(
select * from t1 where TaxableAmount >0
) pos
left outer join
from(
select * from t1 where TaxableAmount <0
) neg
on pos.RegJrnID = neg.RegJrnID
where neg.RegJrnID is null

**octangle** · Jun 6 '06, 09:05 PM

Re: Challenge: Can you optimize this?

Here's the tested (and slightly modified) version of your code...

SELECT pos.*

FROM
(
SELECT * FROM t1 WHERE TaxableAmount > 0
) pos
LEFT OUTER JOIN
(
SELECT * FROM t1 WHERE TaxableAmount < 0
) neg
ON pos.RegJrnID = neg.RegJrnID

WHERE neg.RegJrnID IS NULL

/* Make the results look pretty (optional) */

ORDER BY pos.RegJrnID

**octangle** · Jun 6 '06, 09:15 PM

Re: Challenge: Can you optimize this?

According to the SQL Query analyzer your query is better going head to
head with the original representing 43.43% of the batch and the
original representing 56.57% of the batch.

A 13% improvement!

Thanks!

**Hugo Kornelis** · Jun 6 '06, 09:25 PM

Re: Challenge: Can you optimize this?

On 6 Jun 2006 13:24:54 -0700, octangle wrote:
[color=blue]
>This code is attempting to find records that have a RegJrnID that does
>not occur more than one time in the table.
>
>The reason that I want to find records with non-duplicated RegJrnID
>values is to create "reversal" records for these such that the reversal
>record has identical values for every column except the TaxableAmount
>which will contain a negative amount. (see: example data below).[/color]

Hi octangle,

Thanks for providing CREATE TABLE and INSERT statements. This made it
very easy to set up a test DB and fun to find an answer.

What worries me is that there's no primary key in your table. I hope
that you just forgot to include it in the script and that your real
table does have a key!

Here's a much quicker way. Running both your version and my version with
execution plan displayed, yours took 72% and mine 28%. Removing the
ORDER BY changed this to 64% / 36%. Still a nice gain.

SELECT RegJrnID, MAX(InvoiceDate ),
MAX(InvoiceNumb er), MAX(TaxableAmou nt)
FROM t1
GROUP BY RegJrnID
HAVING COUNT(*) = 1
--ORDER BY RegJrnID

And here's another one, but it's correctness depends on some assumptions
I had to make because you forgot to include the primary key. With ORDER
BY, it's slightly more expensive than the previous version. With the
ORDER BY commented out, it only costs half as much!

SELECT RegJrnID, InvoiceDate,
InvoiceNumber, TaxableAmount
FROM t1 AS a
WHERE NOT EXISTS
(SELECT *
FROM t1 AS b
WHERE a.RegJrnID = b.RegJrnID
AND a.InvoiceDate <> b.InvoiceDate)
--ORDER BY RegJrnID

(Note - I have compared these queries using the sample data you provided
on a SQL Server 2005 database on my computer. Results will probably vary
on yoour database, especially if your table has indexes, your data
distribution is not like the sample data, and/or you are running another
version of SQL Server. I recommend that you test out the various
suggestions yourself before deciding.)

--
Hugo Kornelis, SQL Server MVP

**octangle** · Jun 6 '06, 09:45 PM

Re: Challenge: Can you optimize this?

Upon further review...

As written in the previous post, records with a TaxableAmount of 0 will
not be found to be reversed... so in an attempt to remedy this I
modified the LEFT OUTER JOIN as follows:

FROM
(
SELECT * FROM t1 WHERE TaxableAmount >= 0
) pos
LEFT OUTER JOIN
(
SELECT * FROM t1 WHERE TaxableAmount < 0
) neg
ON pos.RegJrnID = neg.RegJrnID

This succeeds at finding the 0 TaxableAmount records... but once a
companion reversal record is inserted into the database both records
(the original and a the reversal) are found on subsequent queries using
this technique... since these records are retrieved by the query as
records to reverse, these get reversed again (thus making a total of 4
instances of the original record - instead of 2 which is all are
needed). And if we repeat the process a 0 TaxableAmount record will
redouble it instances every time the process is run...

Now the questions are...

Can the above LEFT OUTER JOIN be fixed?

OR

Should 0 TaxableAmounts be processed in their own pass with their own
query (yuck)?

OR

Is the original query really better because it works for all
TaxableAmounts despite the fact that its 13% slower....

OR

Is there another option????

**Alexander Kuznetsov** · Jun 7 '06, 12:45 AM

Re: Challenge: Can you optimize this?

one more question: what if there is only one row with negative amount?

INSERT INTO t1 VALUES (11, '20060101', '2321323', -100.00)

and there is no corresponding row with positive amount? Nothing in the
posted DDL prevents you from that. In fact, originally I was
considering the query posted by Hugo, but realized it would return that
single row with negative amount and assumed it incorrect. It looks like
there might be no base for my assumption.

**Erland Sommarskog** · Jun 7 '06, 07:05 AM

Re: Challenge: Can you optimize this?

Alexander Kuznetsov (AK_TIREDOFSPAM @hotmail.COM) writes:[color=blue]
> one more question: what if there is only one row with negative amount?
>
> INSERT INTO t1 VALUES (11, '20060101', '2321323', -100.00)
>
> and there is no corresponding row with positive amount? Nothing in the
> posted DDL prevents you from that. In fact, originally I was
> considering the query posted by Hugo, but realized it would return that
> single row with negative amount and assumed it incorrect. It looks like
> there might be no base for my assumption.[/color]

Having watched the thread from aside, I think the real problem is that
the data model needs improvement. I would add a bit column "isbalanceentry "
or some such. And of course add a primary key. (InvoiceNumber,
isbalanceentry) looks like a candidate.

Better get the data model in order, before looking at smart queries.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

Books Online for SQL Server 2005 at

http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx

Books Online for SQL Server 2000 at

http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

**Hugo Kornelis** · Jun 7 '06, 08:45 AM

Re: Challenge: Can you optimize this?

On 6 Jun 2006 17:45:58 -0700, Alexander Kuznetsov wrote:
[color=blue]
>one more question: what if there is only one row with negative amount?
>
>INSERT INTO t1 VALUES (11, '20060101', '2321323', -100.00)
>
>and there is no corresponding row with positive amount? Nothing in the
>posted DDL prevents you from that. In fact, originally I was
>considering the query posted by Hugo, but realized it would return that
>single row with negative amount and assumed it incorrect. It looks like
>there might be no base for my assumption.[/color]

Hi Alexander,

I've seenn nothing in the original post that justifies special treatment
of negative amounts. If these should be excluded, then my version can
still be used - just add AND MAX(TaxableAmou nt) >= 0 to the HAVING
clause.

However, I agree with Erland that a redesign might be a better choice if
something like that is the case.

--
Hugo Kornelis, SQL Server MVP

**Alexander Kuznetsov** · Jun 7 '06, 05:35 PM

Re: Challenge: Can you optimize this?

Erland Sommarskog wrote:[color=blue]
>
> Having watched the thread from aside, I think the real problem is that
> the data model needs improvement. I would add a bit column "isbalanceentry "
> or some such. And of course add a primary key. (InvoiceNumber,
> isbalanceentry) looks like a candidate.
>[/color]

I concur that the schema might need some work. I was thinking that a
single nullable column negated_date might be sufficient, so that
instead of inserting one more row one just needs to update negated_date.

**octangle** · Jun 7 '06, 05:55 PM

Challenge: Can you optimize this? (summary)

Below is an aggregate script that includes everyone's suggested queries
so far...

Based upon feedback I have beefed up the test records to more
accurately reflect all of the potential scenarios that need to be
handled by this quey.

The original query (Query attempt #1 (Octangle)) generates the desired
result set and therefore is the benchmark of correctness for my
purposes.

MWMWMWMWMWMWMWM WMWMWMWMWMWMWMW MWMW

/* Set up */

CREATE TABLE t1(RegJrnID INTEGER, InvoiceDate VARCHAR(8), InvoiceNumber
VARCHAR(20), TaxableAmount DECIMAL(32,8))

INSERT INTO t1 VALUES (0, '20060120', '0000033', 0.00)
INSERT INTO t1 VALUES (1, '20060101', '2321323', 100.00)
INSERT INTO t1 VALUES (9, '20060213', '2130009', 40.01)
INSERT INTO t1 VALUES (11, '20060324', '3321110', -1200.16)
INSERT INTO t1 VALUES (3, '20060101', '9402293', 512.44)
INSERT INTO t1 VALUES (1, '20060104', '2321323', -100.00)
INSERT INTO t1 VALUES (13, '20051127', '1034501', -77.50)
INSERT INTO t1 VALUES (4, '20060105', '9302221', 612.12)
INSERT INTO t1 VALUES (5, '20060105', '0003235', 18.11)
INSERT INTO t1 VALUES (10, '20060421', '0000033', 0.00)
INSERT INTO t1 VALUES (6, '20060111', '5953432', 2101.21)
INSERT INTO t1 VALUES (3, '20060111', '9402293', -512.44)
INSERT INTO t1 VALUES (12, '20060606', '0000001', 4431.55)
INSERT INTO t1 VALUES (7, '20060115', '4234444', 44.52)
INSERT INTO t1 VALUES (8, '20060115', '0342222', 95.21)
INSERT INTO t1 VALUES (6, '20060119', '5953432', -2101.21)
INSERT INTO t1 VALUES (2, '20060101', '5440033', 231.01)
INSERT INTO t1 VALUES (10, '20060517', '0000033', 0.00)
INSERT INTO t1 VALUES (11, '20060324', '3321110', 1200.16)
INSERT INTO t1 VALUES (12, '20060606', '0000001', -4431.55)

/* Show what's in the table */

SELECT * FROM t1 ORDER BY RegJrnID, InvoiceDate

/* Query for records to reverse */

/* Query attempt #1 (Octangle) */

/* Pros: correct */
/* Cons: slow */

SELECT *
FROM t1 a

/* Ignore records that have already been reversed */

WHERE a.RegJrnID != ALL

/* This subselect finds reversed records (i.e. those that have a
duplicate RegJrnID) */

(
SELECT b.RegJrnID
FROM t1 b
GROUP BY b.RegJrnID
HAVING COUNT(*) > 1
)

/* User selection criteria are appended here */

/* AND InvoiceNumber >= '5000000' AND InvoiceNumber <= '7500000' */

/*ORDER BY RegJrnID; * Make the results look pretty (optional) */

/* Query attempt #2 (Alexander) */

/* Pros: faster */
/* Cons: misses 0 TaxableAmounts */

SELECT pos.*
FROM
(
SELECT * FROM t1 WHERE TaxableAmount > 0
) pos
LEFT OUTER JOIN
(
SELECT * FROM t1 WHERE TaxableAmount < 0
) neg
ON pos.RegJrnID = neg.RegJrnID
WHERE neg.RegJrnID IS NULL
/*ORDER BY pos.RegJrnID * Make the results look pretty (optional) */

/* Query attempt #3 (Alexander - tweaked by Octangle) */

/* Pros: faster */
/* Cons: finds too many 0 TaxableAmounts */

SELECT pos.*
FROM
(
SELECT * FROM t1 WHERE TaxableAmount >= 0
) pos
LEFT OUTER JOIN
(
SELECT * FROM t1 WHERE TaxableAmount < 0
) neg
ON pos.RegJrnID = neg.RegJrnID
WHERE neg.RegJrnID IS NULL
/*ORDER BY pos.RegJrnID * Make the results look pretty (optional) */

/* Query attempt #4 (Hugo) */

/* Pros: correct , fastest, returns results in RegJrnID order with
ORDER BY clause */

SELECT RegJrnID, MAX(InvoiceDate ) as "InvoiceDat e",
MAX(InvoiceNumb er) as "InvoiceNumber" , MAX(TaxableAmou nt) as
"TaxableAmo unt"
FROM t1
GROUP BY RegJrnID
HAVING COUNT(*) = 1

/* Query attempt #5 (Hugo) */

/* Pros: fast */
/* Cons: not correct */

SELECT RegJrnID, InvoiceDate, InvoiceNumber, TaxableAmount
FROM t1 AS a
WHERE NOT EXISTS
(
SELECT *
FROM t1 AS b
WHERE a.RegJrnID = b.RegJrnID
AND a.InvoiceDate <> b.InvoiceDate
)
/*ORDER BY RegJrnID * Make the results look pretty (optional) */

/* Housekeeping */

DROP TABLE t1

MWMWMWMWMWMWMWM WMWMWMWMWMWMWMW MWMW

Queries as percent of batch (when just executing the queries in the
above script)

Query #1: 22.66% - Correct
Query #2: 19.77%
Query #3: 20.27%
Query #4: 12.63% - Correct
Query #5: 20.67%

Queries as percent when compared to only the original query (Query #1)

Query #1: 50.00% - Correct
Query #2: 42.58%
Query #3: 43.19%
Query #4: 32.14% - Correct
Query #5: 43.67%

At this point it looks like the clear winner is Query #4 by Hugo!

MWMWMWMWMWMWMWM WMWMWMWMWMWMWMW MWMW

To address some of the observations/comments:

1. Negative transactions are possible - I augmented the test data to
include this case.
2. This is for a commercial product that has numerous existing
customers, I inherited the data model that this table is based upon...
my coding constraits are:
- I cannot add any columns (due to how we version a column change would
force this release to be considered a major release and not a minor
release as desired)
- I should not add any indexes/primary keys/uniqueness constriants for
performance reasons (see below)

The purpose of this table to store processed transaction results. It
needs to be as efficient as possible for insertions, so as to not slow
down the transaction processing engine. Reporting (and reversing groups
of transactions) are secondary concerns and it is acceptable for these
functions to be slower.

I sincerely want to thank everyone who chipped in a comment or
suggestion on this...

**Alexander Kuznetsov** · Jun 7 '06, 07:05 PM

Re: Challenge: Can you optimize this? (summary)

1. If negative transactions are possible, than my query is incorrect.
2. You probably need a much larger set of test data to test different
approaches against. Also I would say there are at lest 2 possible
situations:
- most transactions are already negated.
- most transactions have not been negated yet.
In some cases in different situations different queries are the best. I
would try both and see if one and the same query is the best.

Good luck!

**octangle** · Jun 7 '06, 08:25 PM

Re: Challenge: Can you optimize this? (summary)

Alexander Kuznetsov wrote:[color=blue]
> 1. If negative transactions are possible, than my query is incorrect.
> 2. You probably need a much larger set of test data to test different
> approaches against. Also I would say there are at lest 2 possible
> situations:
> - most transactions are already negated.
> - most transactions have not been negated yet.
> In some cases in different situations different queries are the best. I
> would try both and see if one and the same query is the best.
>
> Good luck![/color]

FYI

The normal situation would be that most transactions are not negated in
this table. Negation would only occur if the billing system was
un-doing a billing run for some technical or business reason...
Therefore negating transactions would be rare, theoretically.. .

I appreciate the notion of getting better test data - this will
naturally accrue as I implement and test this soultion. I will make
sure to compare performance as this project matures...

Thanks again...!

**octangle** · Jun 7 '06, 08:45 PM

Re: Challenge: Can you optimize this? (summary)

/* Query attempt #4 (Hugo) */

SELECT RegJrnID, MAX(InvoiceDate ) as "InvoiceDat e", MAX(InvoiceNumb er)
as "InvoiceNumber" , MAX(TaxableAmou nt) as "TaxableAmo unt"
FROM t1
GROUP BY RegJrnID
HAVING COUNT(*) = 1

I talked to a few folks around the office and none of us had ever
though to use MAX() to force values out of a query using a GROUP BY
clause...

e.g. if the query were changed to look like this:

SELECT *
FROM t1
GROUP BY RegJrnID
HAVING COUNT(*) = 1

The following error occurs for each column not mentioned in the GROUP
BY clause: "Column 't1.InvoiceDate ' is invalid in the select list
because it is not contained in either an aggregate function or the
GROUP BY clause." So MAX() forces these values to participate in the
result set generated by this query...

My question with this is, "Is this technique safe for all major DBs
(Oracle, SQL Server, DB2 and MySQL) and will it work with all column
types?"

**Erland Sommarskog** · Jun 7 '06, 09:45 PM

Re: Challenge: Can you optimize this? (summary)

octangle (idea.vortex@gm ail.com) writes:[color=blue]
> /* Query attempt #4 (Hugo) */
>
> SELECT RegJrnID, MAX(InvoiceDate ) as "InvoiceDat e", MAX(InvoiceNumb er)
> as "InvoiceNumber" , MAX(TaxableAmou nt) as "TaxableAmo unt"
> FROM t1
> GROUP BY RegJrnID
> HAVING COUNT(*) = 1
>
> I talked to a few folks around the office and none of us had ever
> though to use MAX() to force values out of a query using a GROUP BY
> clause...
>...
> My question with this is, "Is this technique safe for all major DBs
> (Oracle, SQL Server, DB2 and MySQL) and will it work with all column
> types?"[/color]

Yes, it should on any RDBMS worth the name, as it is very plain standard
SQL.

Then again, MySQL has so many funny quirks, I suspect that one should
never take anything for granted with that engine.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

Books Online for SQL Server 2005 at

http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx

Books Online for SQL Server 2000 at

http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

Challenge: Can you optimize this?

Challenge: Can you optimize this?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment