How To: Optimize SQL Queries

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Rabbit
    Recognized Expert MVP
    • Jan 2007
    • 12517

    How To: Optimize SQL Queries

    Introduction
    In this article, I discuss the basic methods I use to analyse and optimize SQL queries. A properly written SQL with appropriate indices can turn a query that runs in hours into one that runs in seconds. While the code in this article is rooted in SQL Server, most of the concepts will apply to other databases.

    We will go over:
    1. Analyzing Performance of Queries
    2. Optimizing SQL Syntax
    3. Query Execution Plans
    4. Creating Proper Indices


    Analyzing Performance of Queries
    SQL Server comes with a few options to analyse the performance of a query. My basic barrage of tests that I use on a query is this:
    Code:
    CHECKPOINT
    DBCC DROPCLEANBUFFERS
    DBCC FREEPROCCACHE
    
    SET STATISTICS IO ON
    SET STATISTICS TIME ON
    SET STATISTICS PROFILE ON
    GO
    
    -- SQL GOES HERE
    GO
    
    SET STATISTICS IO OFF
    SET STATISTICS TIME OFF
    SET STATISTICS PROFILE OFF
    GO
    The first thing I do is to clear all the caches so that I can the SQL can run fresh without help from previous runs.

    CHECKPOINT - writes any unwritten buffers.
    DBCC DROPCLEANBUFFER S - clears out the buffer cache.
    DBCC FREEPROCCACHE - clears out the plan cache.

    After that, I turn on some statistical options.

    SET STATISTICS IO - shows you information on the disk activity that occurs with a query. The lower the numbers, the faster it is.
    SET STATISTICS TIME ON - shows you how long, in milliseconds, it takes a query to run.
    SET STATISTICS PROFILE ON - shows you a text version of the query execution plan. This plan reveals every step the database uses to run the query and how long it takes to run that step.

    Optimizing SQL Syntax
    There many syntactical choices you can make that will speed up a query. I will discuss some of those choices here. The first eight are relatively self explanatory but the last three merit further discussion.
    • Avoid using views.
    • Use an INNER JOIN as opposed to one of the OUTER JOINs whenever possible.
    • Avoid the SELECT *, return only the fields that you need.
    • Don't SORT or ORDER BY unless it's needed.
    • Don't JOIN to tables that you don't need for the query.
    • Avoid NOT, i.e. NOT IN and NOT EXISTS.
    • EXISTS is quicker than IN.
    • Use UNION ALL rather than UNION if you don't need DISTINCT records.
    • Sometimes a UNION is quicker than OR. To know which one is quicker, try both and look at the statistics.
    • When filtering records in the WHERE clause, try to avoid calculations on fields. Move them to the constant side of the equation. If it's not possible to avoid, use an index that includes the calculation.
    • Ninety percent of the time, filtering using a subquery in the WHERE clause is slower than filtering by JOINing to a subquery. To know which one is quicker, try both and look at the statistics.


    In the WHERE clause, using an OR condition can sometimes slow down a query by a lot. If this happens, try a UNION or UNION ALL instead.
    Code:
    -- OR syntax
    SELECT someField
    FROM someTable
    WHERE otherField = 5 OR otherField = 6;
    
    -- UNION syntax
    SELECT someField
    FROM someTable
    WHERE otherField = 5
    
    UNION ALL
    
    SELECT someField
    FROM someTable
    WHERE otherField = 6;
    It makes for much longer code but the time savings could be huge. I once had a query that took hours using the OR syntax but under 5 seconds using the UNION. Test both and look at the statistics to see which would be faster in your case.

    When you do calculations on a field in the WHERE clause, the query can no longer just scan an index because it now must perform the calculation on every record to see if it should be returned. For example, if I had a date field and I wanted to know if it was less than seven days old from the current date, I have a couple of syntactical choices:
    Code:
    DATEDIFF(D, dateField, GETDATE()) <= 5
    
    dateField BETWEEN DATEADD(D, -5, GETDATE()) AND GETDATE()
    The first is slower because an index on dateField can't be used since it must calculate for every record. And in a similar vein but with drastically different syntax:
    Code:
    Left(LastName, 1) = 'T'
    
    LastName LIKE 'T%'
    More often than not, if you need to filter records based on another table or the same table, using a subquery in a JOIN is quicker than using a subquery in the WHERE clause. For example, if I wanted all the fields for a record but only wanted the most current record based on a category and date field, I could do it one of two ways:
    Code:
    SELECT *
    FROM someTable AS t
    WHERE dateField = (
       SELECT MAX(dateField)
       FROM someTable
       WHERE categoryField = t.categoryField
    );
    
    SELECT *
    FROM someTable AS t1
    INNER JOIN (
       SELECT categoryField, MAX(dateField) AS dateField
       FROM someTable
       GROUP BY categoryField
    ) AS t2
    ON t1.categoryField = t2.categoryField
       AND t1.dateField = t2.dateField;
    Test both and look at the statistics to see which would be faster in your case. I've had situations where a query is much faster using the WHERE clause and situations where a query is much faster in the JOIN. But the JOIN approach is correct most of the time.

    Query Execution Plans
    The query execution plan tells you the steps the database is going to take to run a query along with the estimated/actual time it took. If a query is running slowly, this will show which step it's getting hung up on. There are a lot of different operators that the database can use to run a query but the main culprits of a slow query are Table Scans and RID Lookups. Table scans read every row of a table to find the correct records to return. RID lookups are used when a non-clustered index is being utilized to find a matching row. It merely points to a location on a heap of data and has to go to the heap to look up the value rather than the data being stored in the index. If you see either of these, you should create the appropriate indices to get rid of them. However, there's a caveat regarding table scans, for tables with a small amount of rows, table scans are quicker than index scans.

    Creating Proper Indices
    Using a clustered index on a primary key field is preferable. Clustered indices significantly speed up searching on a field. A clustered index determines the physical order of data in a table. Therefore, you can only have one. However, you can have as many non-clustered indices as you want.

    When creating an index, order matters. Put the fields in the following order: WHERE clause, JOIN clause, ORDER BY clause, SELECT clause. So if I wanted to optimize the following query:
    Code:
    SELECT t1.someField
    FROM someTable t1
    INNER JOIN otherTable t2
    ON t1.PK = t2.FK
    WHERE t1.otherField = 15
    ORDER BY t1.dateField
    Then I would create an index on someTable with the fields in the following order: otherField, PK, dateField, someField. And an index on otherTable on the field FK.

    Incidentally, the previous example was also a fully covered query. What that means is the index contains all fields required for the query. This means that the database does not have to refer back to the data in the table to find additional data for the query. When feasible, you should create an index to fully cover a query you want to run often.

    However, it is important to note that an index can be "too large". Both in terms of technical limitations and usability. The more fields and the larger the fields included in an index, the longer it takes to run a query using the index.

    Conclusion
    A quick note about hardware. No matter how much optimization you do, sometimes you just need pure power. If you've optimized a query as much as you can but it's still too slow, you may have to upgrade your hardware.

    I hope this article has armed you with the basic tools you need to take a query that runs in hours and make it run in seconds. Remember, optimization is a very deep subject and these are just the basic tools that you can use to speed up your query. The next time you have a long running query, analyse the statistics and execution plans, create the proper indices with help from the execution plan, and test different versions of the SQL syntax.
    Last edited by Rabbit; Mar 13 '12, 05:19 AM.
  • r035198x
    MVP
    • Sep 2006
    • 13225

    #2
    This is very useful. Good work.

    Comment

    • ck9663
      Recognized Expert Specialist
      • Jun 2007
      • 2878

      #3
      Rabbit,

      Very nicely done and a very helpful article. I have some other ideas and will keep adding them on this post as I remember or found them.

      For now here are some of them.

      These are all based on my experience and could be different from other cases.

      If OR or AND is necessary
      If you you really need these logical operators, arrange your conditions properly.

      If you are using AND, pick the most probably False condition first. On the other hand, if you are using OR, pick the most probable True condition first.

      An AND operator will return False if any of the condition is false. While an OR operator will return True if any of the condition is True.

      There is actually no guarantee that SQL Server will not perform the rest of the conditions, but in some cases it stops checking the other tests when the condition is or not satisfied.

      Arrange the conditions in CASE WHEN
      In a CASE WHEN statement/function, place the most probable TRUE condition first so the command does not have to parse through the entire statement only to find the most use condition at the bottom. This might be a small improvement if it's a simple condition on the WHEN clause. But if you have a lot of WHEN clauses or just a few but the condition on those WHENs requires complex calculation or even referring to columns on other tables, you will experience improvement if conditions are properly arranged.

      Index Seek vs Index Scan
      Remember that not because you have an index on your table, SQL Server will always use it. SQL Server will figure out for itself if it's better to use Index Seek or Index Scan. Because there are times that statistics are not updated in the database, the server will use of Index Scan instead of Index Seek.

      Use the statistics and Execution Plan as describe by Rabbit above. As much as possible, try to eliminate Index Scan.

      Use CURSORs and RBARs as the last option
      RBAR (Row By Agonizing Row) is a process where the query is forcing the server to process the data by row and not by result set. Much like how a CURSOR is processed. Although these technique have their own use, you must, as much as possible, use a result based technique than a Row by Row process. If you have to, use temporary or permanent tables.

      A table must have at least one index
      Although it's not required, this mindset will help you make sure that all tables have an index before you use them. And if a table will only have one index, make it a CLUSTERED. Of course you still have to check how the table is being used and/or updated.

      More to come, work for now...

      Comment

      • Rabbit
        Recognized Expert MVP
        • Jan 2007
        • 12517

        #4
        Thanks for those additional ideas ck. We could make this thread the go to place for optimizing queries.

        If any one has additional optimizations, please feel free to add your two cents.

        Comment

        • Vikki McCormick
          New Member
          • Aug 2010
          • 46

          #5
          Rabbit, Great article. Anyone have any advice on how to improve performance on a non-deterministic computed column? I have some of those, and they usually are the most costly operation in the execution plan.

          Comment

          • ck9663
            Recognized Expert Specialist
            • Jun 2007
            • 2878

            #6
            Try to see if you can execute the function at run time and see how the execution plan looks. Depending on how you use the function, you might be able to refactor and restructure your query to utilize Index Seek during execution instead of Index Scan.

            Most of the time a non-deterministic function uses an Index Scan and not Index Seek.


            Good Luck!!!


            ~~ CK

            Comment

            • Rabbit
              Recognized Expert MVP
              • Jan 2007
              • 12517

              #7
              My first piece of advice would be to not use it. If we know what the calculation is doing, we may be able to substitute it with something else. If there's no way around it, make sure it's persistent and put it in an index.

              Comment

              • mafaisal
                New Member
                • Sep 2007
                • 142

                #8
                Very Good Article.
                Thanks Rabbit, CK

                By Faisal

                Comment

                Working...