index bloat?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • sql_server_2000_user

    #16
    Re: index bloat?

    I updated statistics on the loan table and it had no effect.

    Was that a typo and you meant to say it sounds like a good plan?
    Because I forced this plan with the following query:

    select * from (
    select * from loan
    where loan.deal_no='M L4W1') as x
    inner join loan_history as lh
    on x.exloan_id=lh. exloan_id
    where time_period=196
    option (force order, loop join)

    and it is so fast it makes me cry - 1 second if cached, 5 seconds if
    not cached and a large deal. Here's the plan:

    StmtText
    |--Nested Loops(Inner Join, OUTER REFERENCES:([loan].[EXLOAN_ID])
    WITH PREFETCH)
    |--Bookmark Lookup(BOOKMARK :([Bmk1000]),
    OBJECT:([LoanPerformance].[dbo].[loan]))
    | |--Index
    Seek(OBJECT:([LoanPerformance].[dbo].[loan].[IX_loan_deal_no]),
    SEEK:([loan].[DEAL_NO]='ML4W1') ORDERED FORWARD)
    |--Compute
    Scalar(DEFINE:([lh].[YYYYMM]=(floor([lh].[TIME_PERIOD]/12)+1989)*100+[lh].[TIME_PERIOD]%12+1))
    |--Clustered Index
    Seek(OBJECT:([LoanPerformance].[dbo].[loan_history].[PK_loan_history]
    AS [lh]), SEEK:([lh].[TIME_PERIOD]=196 AND
    [lh].[EXLOAN_ID]=[loan].[EXLOAN_ID]) ORDERED FORWARD)


    I tried this query:

    select * from loan
    inner join loan_history lh
    on loan.exloan_id = lh.exloan_id
    where loan.deal_no='M L4W1'
    and time_period='19 6'
    option (force order)

    and it takes a couple minutes if not cached, 5 seconds if cached -
    definitely better than a couple hours, but nowhere near my
    corrected-nested-loop. Here's the plan:

    StmtText
    |--Merge Join(Inner Join,
    MERGE:([loan].[EXLOAN_ID])=([lh].[EXLOAN_ID]),
    RESIDUAL:([lh].[EXLOAN_ID]=[loan].[EXLOAN_ID]))
    |--Bookmark Lookup(BOOKMARK :([Bmk1000]),
    OBJECT:([LoanPerformance].[dbo].[loan]))
    | |--Index
    Seek(OBJECT:([LoanPerformance].[dbo].[loan].[IX_loan_deal_no]),
    SEEK:([loan].[DEAL_NO]='ML4W1') ORDERED FORWARD)
    |--Compute
    Scalar(DEFINE:([lh].[YYYYMM]=(floor([lh].[TIME_PERIOD]/12)+1989)*100+[lh].[TIME_PERIOD]%12+1))
    |--Clustered Index
    Seek(OBJECT:([LoanPerformance].[dbo].[loan_history].[PK_loan_history]
    AS [lh]), SEEK:([lh].[TIME_PERIOD]=196) ORDERED FORWARD)

    And now for the obvious/hard question: now that I know what the right
    execution plan is, how can I get the query optimizer to generate it -
    *without the hints*? Because some users of the DB are using report
    designing software that will just generate the join query with no
    hints, and I don't want them sitting there for hours when they don't
    have to. Does the optimizer not choose the merge join because it
    requires a bookmark lookup? Does it not choose my nested loop join
    because the statistics are wrong somewhere?

    Thanks again,
    Seth

    Comment

    • Erland Sommarskog

      #17
      Re: index bloat?

      sql_server_2000 _user (sethpurcell@co mcast.net) writes:[color=blue]
      > I updated statistics on the loan table and it had no effect.[/color]

      And you used FULLSCAN?

      Did you also update statistics (WITH FULLSCAN) on loan_history?
      [color=blue]
      > Was that a typo and you meant to say it sounds like a good plan?[/color]

      Yes. Don't believe everything I say. :-)
      [color=blue]
      > Because I forced this plan with the following query:
      >
      > select * from (
      > select * from loan
      > where loan.deal_no='M L4W1') as x
      > inner join loan_history as lh
      > on x.exloan_id=lh. exloan_id
      > where time_period=196
      > option (force order, loop join)
      >
      > and it is so fast it makes me cry - 1 second if cached, 5 seconds if
      > not cached and a large deal. Here's the plan:[/color]

      And that is indeed the plan we are looking for!
      [color=blue]
      > and it takes a couple minutes if not cached, 5 seconds if cached -
      > definitely better than a couple hours, but nowhere near my
      > corrected-nested-loop. Here's the plan:[/color]

      Certainly better, but still scanning the entire time_period 196.
      [color=blue]
      > And now for the obvious/hard question: now that I know what the right
      > execution plan is, how can I get the query optimizer to generate it -
      > *without the hints*? Because some users of the DB are using report
      > designing software that will just generate the join query with no
      > hints, and I don't want them sitting there for hours when they don't
      > have to. Does the optimizer not choose the merge join because it
      > requires a bookmark lookup? Does it not choose my nested loop join
      > because the statistics are wrong somewhere?[/color]

      I have no idea. I have posted a question to our internal MVP forum
      to get some suggestions. So, OK, I have a vague guess: time_period
      is a bit too unselective to be the first column in the index. The
      statistics for the index, has a distribution histogramme for this
      column only.

      Thus, here is an idea: Make the PK on loan_history unclustered,
      and then add a clustered index on (exloan_id, time_period), that is
      the reverse key. But this is a very costly operation to do on a
      305 million row table, and there are other queries that could take
      a tool. (Hm, if you add a non-clustered index on the reverse only?
      Could that help?)

      There is also the idea that I've mentioned before, make the
      index on loan.deal_no the clustered on that table. That would at
      least be faster to implement. But that is also likely to have
      ramifications on other queries.



      --
      Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

      Books Online for SQL Server SP3 at
      Get the flexibility you need to use integrated solutions, apps, and innovations in technology with your data, wherever it lives—in the cloud, on-premises, or at the edge.


      Comment

      • Steve Kass

        #18
        Re: index bloat?



        sql_server_2000 _user wrote:[color=blue]
        > I updated statistics on the loan table and it had no effect.
        >
        >
        > And now for the obvious/hard question: now that I know what the right
        > execution plan is, how can I get the query optimizer to generate it -
        > *without the hints*? Because some users of the DB are using report
        > designing software that will just generate the join query with no
        > hints, and I don't want them sitting there for hours when they don't
        > have to. Does the optimizer not choose the merge join because it
        > requires a bookmark lookup? Does it not choose my nested loop join
        > because the statistics are wrong somewhere?[/color]

        This can be hard, but one place to start that I don't
        see done in this thread is to find the bad estimate, if
        there is one causing the problem. Usually this will be
        a bad rowcount estimate (rowcount estimates cannot
        always be improved with better statistics, so this can
        happen even after FULLSCAN) - sometimes it is a row size estimate.

        First, though, just one thought - You noted that the index is
        very fragmented, though the density is good. This could be
        a problem costing the seek on time period 196. The optimizer might
        expect all the qualifying rows to be physically contiguous, but
        they might be scattered all over the place. If the optimizer
        expects 40-50 rows per page, and perhaps 1,000,000 qualifying
        rows, the I/O for the scan will be ~20,000 pages, or ~150MB.
        If instead, the index is so fragmented that to follow the
        logical order requires 10 times as many pages, you could move
        from an "all in memory" query to a disk-grinder.

        I'm not sure whether the optimizer considers fragmentation at
        all in planning - I assume not, but could be wrong.

        Can you find out how many pages contain rows with time period
        196? You could clear the buffers and run a simple select like
        select <some columns only in the clustered index>
        from loan_history
        where time_period = 196

        It would be good here also to see if the estimated and actual
        row count for that value are close in this simple query.

        Then look at estimates in the actual slow query and see if any
        are bad. If you run one of the too-slow queries (the slowest
        you can stand to run) after typing CTRL-K for "show execution plan",
        you will see the estimated execution plan with a couple of
        extra actual values in addition to most of the estimates you can
        get with CTRL-L. In particular, you will see the actual number
        of rows (Row Count) and number of executions (Number of Executes)
        of each operator.

        Compare (Estimated Rowcount)*(Esti mated number of executes) with
        (Row Count) for the key operators. You will have to get the
        estimated number of executes from CTRL-L, because it is not repeated
        in the actual plan. Note that the actual rowcount already accounts
        for the number of executes, so don't multiply it in.

        Also for the key operators, look at Estimated Row Size.
        It should be the average row size of an index row, and if it is
        very far off, it can cause the wrong plan to be chosen. I
        doubt this is your problem, but it can't hurt to look.0

        If you see bad estimates, let us know.

        If this is the problem, and it can't be fixed with
        statistics, you might consider trying to change the estimate
        with extra predicates that don't affect the query result.
        Changing the estimate to a smaller number can be done by
        repeating or adding predicates that have no real effect but
        which the optimizer expects will restrict the number of rows.
        Making estimates larger in the hope of leading the optimizer
        away from an index that is not helping might be possible
        by adding superfluous OR conditions, or with a non-SARGable
        predicate that cannot use statistics and will use a generic
        estimate.

        Perhaps some of this will help, in addition to what Erland's
        good advice is doing.

        Steve Kass
        Drew University

        [color=blue]
        >
        > Thanks again,
        > Seth
        >[/color]

        Comment

        • sql_server_2000_user

          #19
          Re: index bloat?

          Yes, I updated statistics with a full scan, and no, I didn't update the
          loan_history statistics - this would be a really large job, and they
          haven't changed since before the behavior changed, so I didn't think
          that could be a problem.

          As an experiment, I built a table identical to loan_history but with
          the primary key reversed (still clustered). I populated the table with
          10% of the original loan_history table, and here is the simple join and
          its execution plan:

          from loan
          inner join loan_history lh
          on loan.exloan_id= lh.exloan_id
          where loan.deal_no='E Q604'
          and time_period=81

          StmtText
          |--Merge Join(Inner Join,
          MERGE:([loan].[EXLOAN_ID])=([lh].[EXLOAN_ID]),
          RESIDUAL:([lh].[EXLOAN_ID]=[loan].[EXLOAN_ID]))
          |--Bookmark Lookup(BOOKMARK :([Bmk1000]),
          OBJECT:([LoanPerformance].[dbo].[loan]))
          | |--Index
          Seek(OBJECT:([LoanPerformance].[dbo].[loan].[IX_loan_deal_no]),
          SEEK:([loan].[DEAL_NO]='EQ604') ORDERED FORWARD)
          |--Compute
          Scalar(DEFINE:([lh].[YYYYMM]=(floor([lh].[TIME_PERIOD]/12)+1989)*100+[lh].[TIME_PERIOD]%12+1))
          |--Clustered Index
          Seek(OBJECT:([LoanPerformance].[dbo].[loan_history].[PK_loan_history]
          AS [lh]), SEEK:([lh].[TIME_PERIOD]=81) ORDERED FORWARD)

          As you can see, it's identical to the plan generated on the original
          table when I use the option (force order) hint, which is good but still
          not as good as the nested loop join I came up with. Why can't it find
          the nested loop plan? What do I know about the table that the optimizer
          doesn't? Could this be due to the different scales of the tables?
          Perhaps somewhere between 10% and 100% it becomes more efficient to use
          my nested loop join as opposed to this merge join, and if the tables
          were the same size, I would see this?

          I am working on the rowcount/rowsize comparisons.

          Thanks for all the great help,
          Seth

          Comment

          • sql_server_2000_user

            #20
            Re: index bloat?

            Wow, this is really interesting:

            I made a spreadsheet with one row per time period, showing the join
            method, row estimates and actual row counts, execution time, etc., and
            discovered some interesting things. A key aspect of this query is how
            the number of rows in loan_history increases as the time_period
            increases, and how this causes the execution plan to change. This
            should give you some idea of how the number of records increases:

            time_period record count
            40 596
            50 6178
            60 12998
            70 37802
            80 65614
            90 108234
            100 227347
            110 410063
            120 801426
            130 1404053
            140 1957924
            150 2487100
            160 3473813
            170 4435225
            180 5622040
            190 7571203

            Pretty much exponential. So now that you know that, here's what QA does
            with this query:

            select *
            from loan
            inner join loan_history lh
            on loan.exloan_id= lh.exloan_id
            where loan.deal_no=@d eal_no
            and time_period=@ti me_period

            time_period join method
            39-40 loop; loan_history on top
            41-42 merge; loan on top
            43-44 merge; loan_history on top
            45-95 merge; loan on top
            96-194 loop; loan on top
            195-197 loop; loan_history on top

            It all makes a lot of sense and is very very fast *until* you hit time
            period 195, when it all goes to hell. Here are the row count estimates
            and actual values for various time periods in the above ranges (there
            are 596 rows in loan for the deal I used in my test queries):

            time_period top est. bottom est. top actual bottom actual
            40 647 1 596 596
            42 1089 1450 596 597
            44 2785 1089 596 597
            95 1089 145187 596 64429
            194 1089 2265 2265 1
            197 1 1 6,570,403 2265

            I'm sure you see the problem: the estimates for time_period 197 are
            nuts.

            Some of the other comparisons between estimates and actual values look
            a little funny to me, but there's one thing I noticed that left me
            completely perplexed: I thought the optimizer was basing its plan
            purely on the number of rows, but time period 197 currently has *fewer
            rows* than time period 194 (6,570,403 vs 8,260,954), yet time period
            197 uses the same join method as 195 & 196, which have 8,261,776 and
            8,273,743 rows, respectively. So how is it picking the broken plan, and
            why? Is there any way I can tell if the stats are screwed up besides
            rebuilding them and seeing if the problem goes away? It seems like
            stats were built that essentially say time periods 195, 196 and 197
            have almost no rows, since the plan is the same as that used in the
            earliest two time periods - and the rows for these time periods were
            added after the initial load of the table, when I was doing things to
            the table that almost certainly caused stats to be updated. I think I'm
            going to look into stats for loan_history, and see if I can get these
            plans back on track. But how can I avoid this in the future? Do I have
            to update stats every time I add a new time period? Is this normal?

            Thanks everybody for all your help!
            Seth

            Comment

            • sql_server_2000_user

              #21
              Re: index bloat?

              Ok, well, I'm a dunce: dbcc show_statistics (loan_history,
              pk_loan_history ) shows lovely stats for all values of time_period up to
              194, then nothing. I checked the stats a while ago, when I was under
              the impression that the optimizer was generating the wrong plan for all
              time periods, and I didn't notice this then.

              Updated Rows Rows Sampled Steps Density Average key length
              Jul 1 2005 10:40AM 281035968 281035976 156 0.0 22.0

              I find it strange that the number of rows sampled is greater than the
              number of rows in the table.

              But why hasn't it updated stats? The index is clustered, so I can't
              even set statistics norecompute.

              Thanks,
              Seth

              Comment

              • sql_server_2000_user

                #22
                Re: index bloat?

                Ok, I found what I was looking for: autostats is triggered after 20% of
                the rows in a table are updated, and this is why there are simply no
                stats for the last few time periods. About 56 million updates need to
                happen before autostats runs, and the rowcount is currently at 24
                million, so time to update statistics manually.

                Thanks again,
                Seth

                Comment

                • Erland Sommarskog

                  #23
                  Re: index bloat?

                  sql_server_2000 _user (sethpurcell@co mcast.net) writes:[color=blue]
                  > As an experiment, I built a table identical to loan_history but with
                  > the primary key reversed (still clustered). I populated the table with
                  > 10% of the original loan_history table, and here is the simple join and
                  > its execution plan:[/color]

                  Judging from that plan, you somehow failed to reverse the key. I mean:
                  [color=blue]
                  > Seek(OBJECT:([LoanPerformance].[dbo].[loan_history].[PK_loan_history]
                  > AS [lh]), SEEK:([lh].[TIME_PERIOD]=81) ORDERED FORWARD)[/color]

                  would not be possible with a reverse key.

                  Anyway, it seems that you have performed one hell of a job to track
                  this down yourself, and found the answer. I must say that I'm full of
                  admiration for your feat. It's very unusual to see someone put so
                  much work into his problem - and also report back to the newsgroup.
                  Big thanks for doing this!
                  [color=blue]
                  > It seems like stats were built that essentially say time periods 195,
                  > 196 and 197 have almost no rows, since the plan is the same as that used
                  > in the earliest two time periods - and the rows for these time periods
                  > were added after the initial load of the table, when I was doing things
                  > to the table that almost certainly caused stats to be updated. I think
                  > I'm going to look into stats for loan_history, and see if I can get
                  > these plans back on track. But how can I avoid this in the future? Do I
                  > have to update stats every time I add a new time period? Is this
                  > normal?[/color]

                  Yes, as you have found this is kind of normal.

                  There is a problem with contiguously growing keys. In our system
                  a colleague of mine has set up a job that reindexes tables. But to
                  avoid that this job takes too long time to run, he uses SHOWCONTIG,
                  and if fragmentation is moderate, he skips the table.

                  Some time ago, one of our customers reported that one certain function
                  was slow, and I tracked it down to stored procedure. The customer
                  had just gotten 8.10 of our system, and the procedure was indeed changed
                  in that version. But, the strange thing was that the query that was
                  slow had not changed. And as it was a plain join of four tables, I
                  didn't really feel like tweaking it.

                  I found with the stats_date() function that statistics for one of tables
                  was a tad old, and an UPDATE STATISTICS WITH FULLSCAN on this table and
                  another resovled the issue. These tables both have monotonically growing
                  (or almost monotonically) primary keys that are clustered. The good
                  thing with that is that you don't get fragmentation. The bad thing is
                  that statistics may not be updated, if you skip tables when you run
                  you maintenance job. So my colleauge is now finding a strategy for how
                  run UPDATE STATISTICS on the tables he does not reindex.

                  But why had this happened with the new version? Well, since the procedure
                  had changed, the query plan was flushed from the cache, new balls - and,
                  oops bad plan.

                  And now I know why your new index opened Pandora's box: that caused
                  the query plans for the query to be flushed. So until you created the
                  index, you had the old statistics - and an old query to go with it.

                  Thus, it seems that if you just don't do anything, and don't rock the
                  boat nothing will happen. But since the most trivial thing to flush a
                  plan from the cache is a server reboot, it's a fragile strategy.

                  So you should schedule UPDATE STATISTICS with some frequency on the
                  table. It does not have to be WITH FULLSCAN, but you may need a
                  higher sample percent than the default. Since you know the presumptions,
                  you should be able to monitor this, and find out what works and what
                  does not.


                  --
                  Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

                  Books Online for SQL Server SP3 at
                  Get the flexibility you need to use integrated solutions, apps, and innovations in technology with your data, wherever it lives—in the cloud, on-premises, or at the edge.


                  Comment

                  • Steve Kass

                    #24
                    Re: index bloat?

                    I'm glad to hear you worked this out. Thanks
                    very much for posting so much detail, and the
                    resolution. That's nice for those of us reading
                    now and will also help anyone with a similar
                    problem who finds this thread in the future.

                    SK

                    sql_server_2000 _user wrote:[color=blue]
                    > Ok, I found what I was looking for: autostats is triggered after 20% of
                    > the rows in a table are updated, and this is why there are simply no
                    > stats for the last few time periods. About 56 million updates need to
                    > happen before autostats runs, and the rowcount is currently at 24
                    > million, so time to update statistics manually.
                    >
                    > Thanks again,
                    > Seth
                    >[/color]

                    Comment

                    • sql_server_2000_user

                      #25
                      Re: index bloat?

                      Looks like you caught another one of my mistakes - when I cut and
                      pasted my query I forgot to change the table name to loan_history2. I
                      investigated this (with the correct query this time) and the join
                      method the optimizer picks is the nested loops with two index seeks,
                      independent of the chosen time period. This makes sense, because the
                      number of rows for any given value of the first column of the index
                      (now exloan_id) is now basically constant, not varying by four orders
                      of magnitude like the time-period based index is.

                      Ah, the query plan cache - this is another thing I poked around in
                      early on and then decided it wasn't the problem, hahaha. Thanks for
                      explaining this to me, it makes perfect sense now. It would have been
                      awful to have had this happen for the first time after some maintenance
                      reboot of the server - I'm glad it happened beforehand and I'm on top
                      of it now. I will consider where to go from here, as far as reindexing
                      to defragment, setting up a plan to update statistics after loading new
                      data, etc.

                      The statistics are updating as I write this - and have been for the
                      past few hours.

                      Thanks for everything,
                      Seth

                      Comment

                      • Erland Sommarskog

                        #26
                        Re: index bloat?

                        sql_server_2000 _user (sethpurcell@co mcast.net) writes:[color=blue]
                        > Ah, the query plan cache - this is another thing I poked around in
                        > early on and then decided it wasn't the problem, hahaha.[/color]

                        I also feel kind of stupid that I did not think of this earlier. That
                        would have led us faster to the outdated statistics.

                        Oh well, while it's a whole lot of work, exercises like this one
                        are good lessons for the future. (What I didn't say in my war story
                        was that a second customer ran into the same problem when they later
                        got 8.10. That time I could just tell our helpdesk folks "try this".



                        --
                        Erland Sommarskog, SQL Server MVP, esquel@sommarsk og.se

                        Books Online for SQL Server SP3 at
                        Get the flexibility you need to use integrated solutions, apps, and innovations in technology with your data, wherever it lives—in the cloud, on-premises, or at the edge.


                        Comment

                        Working...