How does DB2's estimate size tool calculate avg row size when the table is empty?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • sql-db2-dba

    How does DB2's estimate size tool calculate avg row size when the table is empty?

    Does DB2 just fudge it when it is an empty table? Is there a "formula"
    for average row size when you have variable length records. Or you
    really have to know what your application is packing into those
    varchar columns.

    Bill Leung
    leungb@aptea.co m
  • Blair Adamache

    #2
    Re: How does DB2's estimate size tool calculate avg row size whenthe table is empty?

    Until you've run runstats, most of the stats will = -1 - which is saying
    DB2 does not know. So, no fudging or guessing - the optimizer just
    proceeds on assuming that the data is unavailable. Example: cardinality
    (number of rows in the table). For a new table, DB2 could assume 0 or 1
    - but the -1 value for cardinality says DB2 does not know - maybe the
    table was created with an import statement that populated the table with
    10,000 rows.

    sql-db2-dba wrote:
    [color=blue]
    > Does DB2 just fudge it when it is an empty table? Is there a "formula"
    > for average row size when you have variable length records. Or you
    > really have to know what your application is packing into those
    > varchar columns.
    >
    > Bill Leung
    > leungb@aptea.co m[/color]

    Comment

    • Amit

      #3
      Re: How does DB2's estimate size tool calculate avg row size when the table is empty?

      For most datatypes, this is calcuated by adding AVGCOLLEN in
      syscat.columns (which is updated when you RUNSTATS). Long columns
      space usage is based on the defined size of long columns. And on top
      of everything else a 10 byte row overhead is added. So even though the
      table may be empty, average row length will have some value.
      The query below calculates the avg row length for my SQL version of
      reorgchk - I use it in a common table expression in my SQL version of
      REORGCHK, the results it produces match those from reorgchk in DB2 v8
      fp 4.
      Btw, I've noticed in the past that some of the calculations for
      REORGCHK output are done differently from size estimator...loo ks like
      the two tools were developed independent of each other

      SELECT TABSCHEMA, TABNAME,
      CASE
      WHEN MIN (
      CASE
      WHEN A.LOGGED<>''
      THEN 0
      ELSE A.COLCARD
      END) >= 0
      THEN (SUM(
      CASE
      WHEN A.LOGGED = ''
      THEN A.AVGCOLLEN
      WHEN A.LENGTH <= 1024
      THEN 72
      WHEN A.LENGTH <= 8192
      THEN 96
      WHEN A.LENGTH <= 65536
      THEN 120
      WHEN A.LENGTH <= 524000
      THEN 144
      WHEN A.LENGTH <= 4190000
      THEN 168
      WHEN A.LENGTH <= 134000000
      THEN 200
      WHEN A.LENGTH <= 536000000
      THEN 224
      WHEN A.LENGTH <= 1070000000
      THEN 256
      WHEN A.LENGTH <= 1470000000
      THEN 280
      WHEN A.LENGTH <= 2147483647
      THEN 316
      END +
      CASE
      WHEN LOGGED <> ''
      AND NULLS = 'Y'
      THEN 1
      ELSE 0
      END ) + 10 )
      ELSE -1
      END AS AVGROWLEN
      FROM SYSCAT.COLUMNS A
      GROUP BY TABSCHEMA, TABNAME

      Comment

      • Pierre Saint-Jacques

        #4
        Re: How does DB2's estimate size tool calculate avg row size when the table is empty?

        Doen't the optimizer use "guesstimat es" in this case?
        If you run an explain on select * from tablename, if you have no data and no
        stats, visual explain will show you default values that it calculates.
        These are not stored anywhere but generated when you explain as the
        optimizer has nothing to work with. So I would guess in this case it does
        some fudging because, when looking at the output of show statistics, there
        are values!!
        Does it make sense, Pierre.

        "Blair Adamache" <badamache@2muc hspam.yahoo.com > a écrit dans le message de
        news:c0hfbq$9o9 $1@hanover.toro lab.ibm.com...[color=blue]
        > Until you've run runstats, most of the stats will = -1 - which is saying
        > DB2 does not know. So, no fudging or guessing - the optimizer just
        > proceeds on assuming that the data is unavailable. Example: cardinality
        > (number of rows in the table). For a new table, DB2 could assume 0 or 1
        > - but the -1 value for cardinality says DB2 does not know - maybe the
        > table was created with an import statement that populated the table with
        > 10,000 rows.
        >
        > sql-db2-dba wrote:
        >[color=green]
        > > Does DB2 just fudge it when it is an empty table? Is there a "formula"
        > > for average row size when you have variable length records. Or you
        > > really have to know what your application is packing into those
        > > varchar columns.
        > >
        > > Bill Leung
        > > leungb@aptea.co m[/color]
        >[/color]

        Comment

        • sql-db2-dba

          #5
          Re: How does DB2's estimate size tool calculate avg row size when the table is empty?

          How did you come up with those numbers in your case statement for long
          columns? I have a table with a CLOB ( 1M ) column. If I do a
          AVG(LENGTH(CLOB _COL)) I get about 2000. But if I were to use your
          formula, it would be 168.

          Bill Leung
          leungb@aptea.co m

          aamit@hotmail.c om (Amit) wrote in message news:<85e31e0f. 0402122150.58a6 433e@posting.go ogle.com>...[color=blue]
          > For most datatypes, this is calcuated by adding AVGCOLLEN in
          > syscat.columns (which is updated when you RUNSTATS). Long columns
          > space usage is based on the defined size of long columns. And on top
          > of everything else a 10 byte row overhead is added. So even though the
          > table may be empty, average row length will have some value.
          > The query below calculates the avg row length for my SQL version of
          > reorgchk - I use it in a common table expression in my SQL version of
          > REORGCHK, the results it produces match those from reorgchk in DB2 v8
          > fp 4.
          > Btw, I've noticed in the past that some of the calculations for
          > REORGCHK output are done differently from size estimator...loo ks like
          > the two tools were developed independent of each other
          >
          > SELECT TABSCHEMA, TABNAME,
          > CASE
          > WHEN MIN (
          > CASE
          > WHEN A.LOGGED<>''
          > THEN 0
          > ELSE A.COLCARD
          > END) >= 0
          > THEN (SUM(
          > CASE
          > WHEN A.LOGGED = ''
          > THEN A.AVGCOLLEN
          > WHEN A.LENGTH <= 1024
          > THEN 72
          > WHEN A.LENGTH <= 8192
          > THEN 96
          > WHEN A.LENGTH <= 65536
          > THEN 120
          > WHEN A.LENGTH <= 524000
          > THEN 144
          > WHEN A.LENGTH <= 4190000
          > THEN 168
          > WHEN A.LENGTH <= 134000000
          > THEN 200
          > WHEN A.LENGTH <= 536000000
          > THEN 224
          > WHEN A.LENGTH <= 1070000000
          > THEN 256
          > WHEN A.LENGTH <= 1470000000
          > THEN 280
          > WHEN A.LENGTH <= 2147483647
          > THEN 316
          > END +
          > CASE
          > WHEN LOGGED <> ''
          > AND NULLS = 'Y'
          > THEN 1
          > ELSE 0
          > END ) + 10 )
          > ELSE -1
          > END AS AVGROWLEN
          > FROM SYSCAT.COLUMNS A
          > GROUP BY TABSCHEMA, TABNAME[/color]

          Comment

          • Amit

            #6
            Re: How does DB2's estimate size tool calculate avg row size when the table is empty?

            The table only stores the descriptors for LOB columns. These
            descriptors point to the location of the LOB data. The size of the row
            includes the size of the descriptor and not the actual LOB data. The
            Descriptor sizes are given in DB2 SQL Reference under CREATE TABLE
            section.


            Maximum LOB Length LOB Descriptor Size
            1 024 72
            8 192 96
            65 536 120
            524 000 144
            4 190 000 168
            134 000 000 200
            536 000 000 224
            1 070 000 000 256
            1 470 000 000 280
            2 147 483 647 316




            leungb@aptea.co m (sql-db2-dba) wrote in message news:<c8b20941. 0402131237.1d77 91@posting.goog le.com>...[color=blue]
            > How did you come up with those numbers in your case statement for long
            > columns? I have a table with a CLOB ( 1M ) column. If I do a
            > AVG(LENGTH(CLOB _COL)) I get about 2000. But if I were to use your
            > formula, it would be 168.
            >
            > Bill Leung
            > leungb@aptea.co m
            >
            > aamit@hotmail.c om (Amit) wrote in message news:<85e31e0f. 0402122150.58a6 433e@posting.go ogle.com>...[color=green]
            > > For most datatypes, this is calcuated by adding AVGCOLLEN in
            > > syscat.columns (which is updated when you RUNSTATS). Long columns
            > > space usage is based on the defined size of long columns. And on top
            > > of everything else a 10 byte row overhead is added. So even though the
            > > table may be empty, average row length will have some value.
            > > The query below calculates the avg row length for my SQL version of
            > > reorgchk - I use it in a common table expression in my SQL version of
            > > REORGCHK, the results it produces match those from reorgchk in DB2 v8
            > > fp 4.
            > > Btw, I've noticed in the past that some of the calculations for
            > > REORGCHK output are done differently from size estimator...loo ks like
            > > the two tools were developed independent of each other
            > >
            > > SELECT TABSCHEMA, TABNAME,
            > > CASE
            > > WHEN MIN (
            > > CASE
            > > WHEN A.LOGGED<>''
            > > THEN 0
            > > ELSE A.COLCARD
            > > END) >= 0
            > > THEN (SUM(
            > > CASE
            > > WHEN A.LOGGED = ''
            > > THEN A.AVGCOLLEN
            > > WHEN A.LENGTH <= 1024
            > > THEN 72
            > > WHEN A.LENGTH <= 8192
            > > THEN 96
            > > WHEN A.LENGTH <= 65536
            > > THEN 120
            > > WHEN A.LENGTH <= 524000
            > > THEN 144
            > > WHEN A.LENGTH <= 4190000
            > > THEN 168
            > > WHEN A.LENGTH <= 134000000
            > > THEN 200
            > > WHEN A.LENGTH <= 536000000
            > > THEN 224
            > > WHEN A.LENGTH <= 1070000000
            > > THEN 256
            > > WHEN A.LENGTH <= 1470000000
            > > THEN 280
            > > WHEN A.LENGTH <= 2147483647
            > > THEN 316
            > > END +
            > > CASE
            > > WHEN LOGGED <> ''
            > > AND NULLS = 'Y'
            > > THEN 1
            > > ELSE 0
            > > END ) + 10 )
            > > ELSE -1
            > > END AS AVGROWLEN
            > > FROM SYSCAT.COLUMNS A
            > > GROUP BY TABSCHEMA, TABNAME[/color][/color]

            Comment

            • sql-db2-dba

              #7
              Re: How does DB2's estimate size tool calculate avg row size when the table is empty?

              Thanks Amit.

              Not to beat this to a dead horse, but IBM probably should have left it
              to be minus 1 or 0 when there is zero row in the table. I have
              descibed my CLOB table below. To use Amit's calculation, it would
              yeild a AVGROLLEN of 210 or 211 (13+4+168+10)+1 0. Control Center's
              estimator gives 193. I am beginning to think that it might have
              factored in the "Nulls" attribute. Perhaps nullable columns would
              account for 80% or 90% of the allowable col-length. Unless Toronto
              developers divulge their formula, I am not convinced.


              $ db2 "describe table TNX_DOC"

              Column Type
              name name Length Scale Nulls
              --------------------------- ------------------ -------- ----- ------
              TNX_ID VARCHAR 13 0 No
              TNX_DOC_CNT INTEGER 4 0 No
              TNX_PY_CLB CLOB 1048576 0 Yes
              TNX_INS_DTTIME TIMESTAMP 10 0 Yes

              4 record(s) selected.

              $ db2 "select SUBSTR(TABSCHEM A,1,8) as tabschema,
              SUBSTR(TABNAME, 1,16) as tabname, SUBSTR(COLNAME, 1,16) as colname,
              LOGGED, AVGCOLLEN from SYSCAT.COLUMNS WHERE TABSCHEMA='LDEV ' and
              TABNAME='TNX_DO C'" | pg

              TABSCHEMA TABNAME COLNAME LOGGED AVGCOLLEN
              --------- ------------ ---------------- ------ ---------
              LDEV TNX_DOC TNX_ID 17
              LDEV TNX_DOC TNX_INS_DTTIME 11
              LDEV TNX_DOC TNX_PY_CLB Y -1
              LDEV TNX_DOC TNX_DOC_CNT 4

              4 record(s) selected.




              aamit@hotmail.c om (Amit) wrote in message news:<85e31e0f. 0402160906.5e78 01de@posting.go ogle.com>...[color=blue]
              > The table only stores the descriptors for LOB columns. These
              > descriptors point to the location of the LOB data. The size of the row
              > includes the size of the descriptor and not the actual LOB data. The
              > Descriptor sizes are given in DB2 SQL Reference under CREATE TABLE
              > section.
              >
              >
              > Maximum LOB Length LOB Descriptor Size
              > 1 024 72
              > 8 192 96
              > 65 536 120
              > 524 000 144
              > 4 190 000 168
              > 134 000 000 200
              > 536 000 000 224
              > 1 070 000 000 256
              > 1 470 000 000 280
              > 2 147 483 647 316
              >
              >
              >
              >
              > leungb@aptea.co m (sql-db2-dba) wrote in message news:<c8b20941. 0402131237.1d77 91@posting.goog le.com>...[color=green]
              > > How did you come up with those numbers in your case statement for long
              > > columns? I have a table with a CLOB ( 1M ) column. If I do a
              > > AVG(LENGTH(CLOB _COL)) I get about 2000. But if I were to use your
              > > formula, it would be 168.
              > >
              > > Bill Leung
              > > leungb@aptea.co m
              > >
              > > aamit@hotmail.c om (Amit) wrote in message news:<85e31e0f. 0402122150.58a6 433e@posting.go ogle.com>...[color=darkred]
              > > > For most datatypes, this is calcuated by adding AVGCOLLEN in
              > > > syscat.columns (which is updated when you RUNSTATS). Long columns
              > > > space usage is based on the defined size of long columns. And on top
              > > > of everything else a 10 byte row overhead is added. So even though the
              > > > table may be empty, average row length will have some value.
              > > > The query below calculates the avg row length for my SQL version of
              > > > reorgchk - I use it in a common table expression in my SQL version of
              > > > REORGCHK, the results it produces match those from reorgchk in DB2 v8
              > > > fp 4.
              > > > Btw, I've noticed in the past that some of the calculations for
              > > > REORGCHK output are done differently from size estimator...loo ks like
              > > > the two tools were developed independent of each other
              > > >
              > > > SELECT TABSCHEMA, TABNAME,
              > > > CASE
              > > > WHEN MIN (
              > > > CASE
              > > > WHEN A.LOGGED<>''
              > > > THEN 0
              > > > ELSE A.COLCARD
              > > > END) >= 0
              > > > THEN (SUM(
              > > > CASE
              > > > WHEN A.LOGGED = ''
              > > > THEN A.AVGCOLLEN
              > > > WHEN A.LENGTH <= 1024
              > > > THEN 72
              > > > WHEN A.LENGTH <= 8192
              > > > THEN 96
              > > > WHEN A.LENGTH <= 65536
              > > > THEN 120
              > > > WHEN A.LENGTH <= 524000
              > > > THEN 144
              > > > WHEN A.LENGTH <= 4190000
              > > > THEN 168
              > > > WHEN A.LENGTH <= 134000000
              > > > THEN 200
              > > > WHEN A.LENGTH <= 536000000
              > > > THEN 224
              > > > WHEN A.LENGTH <= 1070000000
              > > > THEN 256
              > > > WHEN A.LENGTH <= 1470000000
              > > > THEN 280
              > > > WHEN A.LENGTH <= 2147483647
              > > > THEN 316
              > > > END +
              > > > CASE
              > > > WHEN LOGGED <> ''
              > > > AND NULLS = 'Y'
              > > > THEN 1
              > > > ELSE 0
              > > > END ) + 10 )
              > > > ELSE -1
              > > > END AS AVGROWLEN
              > > > FROM SYSCAT.COLUMNS A
              > > > GROUP BY TABSCHEMA, TABNAME[/color][/color][/color]

              Comment

              • Amit

                #8
                Re: How does DB2's estimate size tool calculate avg row size when the table is empty?

                For your example, DB2 RUNSTATS sets the avg col lengths to 4,4,-1,11
                even when the table is empty. My query is based on avg col len and
                will yield 198 ( 5 more than control center).
                My query is actually a subquery I use for doing REORGCHK via sql.
                Using the results the avg row length query gives, I can mimic the
                results of TSIZE(estimated table size = avg row length * cardinality)
                in DB2 REORGCHK utility. I have no clue why IBM chose to use different
                formulas for Control Center and REORGCHK. I used the REORGCHK formula
                since it is documented.


                leungb@aptea.co m (sql-db2-dba) wrote in message news:<c8b20941. 0402170708.610a f48@posting.goo gle.com>...[color=blue]
                > Thanks Amit.
                >
                > Not to beat this to a dead horse, but IBM probably should have left it
                > to be minus 1 or 0 when there is zero row in the table. I have
                > descibed my CLOB table below. To use Amit's calculation, it would
                > yeild a AVGROLLEN of 210 or 211 (13+4+168+10)+1 0. Control Center's
                > estimator gives 193. I am beginning to think that it might havereorgchk
                > factored in the "Nulls" attribute. Perhaps nullable columns would
                > account for 80% or 90% of the allowable col-length. Unless Toronto
                > developers divulge their formula, I am not convinced.
                >
                >
                > $ db2 "describe table TNX_DOC"
                >
                > Column Type
                > name name Length Scale Nulls
                > --------------------------- ------------------ -------- ----- ------
                > TNX_ID VARCHAR 13 0 No
                > TNX_DOC_CNT INTEGER 4 0 No
                > TNX_PY_CLB CLOB 1048576 0 Yes
                > TNX_INS_DTTIME TIMESTAMP 10 0 Yes
                >
                > 4 record(s) selected.
                >
                > $ db2 "select SUBSTR(TABSCHEM A,1,8) as tabschema,
                > SUBSTR(TABNAME, 1,16) as tabname, SUBSTR(COLNAME, 1,16) as colname,
                > LOGGED, AVGCOLLEN from SYSCAT.COLUMNS WHERE TABSCHEMA='LDEV ' and
                > TABNAME='TNX_DO C'" | pg
                >
                > TABSCHEMA TABNAME COLNAME LOGGED AVGCOLLEN
                > --------- ------------ ---------------- ------ ---------
                > LDEV TNX_DOC TNX_ID 17
                > LDEV TNX_DOC TNX_INS_DTTIME 11
                > LDEV TNX_DOC TNX_PY_CLB Y -1
                > LDEV TNX_DOC TNX_DOC_CNT 4
                >
                > 4 record(s) selected.
                >
                >
                >
                >
                > aamit@hotmail.c om (Amit) wrote in message news:<85e31e0f. 0402160906.5e78 01de@posting.go ogle.com>...[color=green]
                > > The table only stores the descriptors for LOB columns. These
                > > descriptors point to the location of the LOB data. The size of the row
                > > includes the size of the descriptor and not the actual LOB data. The
                > > Descriptor sizes are given in DB2 SQL Reference under CREATE TABLE
                > > section.
                > >
                > >
                > > Maximum LOB Length LOB Descriptor Size
                > > 1 024 72
                > > 8 192 96
                > > 65 536 120
                > > 524 000 144
                > > 4 190 000 168
                > > 134 000 000 200
                > > 536 000 000 224
                > > 1 070 000 000 256
                > > 1 470 000 000 280
                > > 2 147 483 647 316
                > >
                > >
                > >
                > >
                > > leungb@aptea.co m (sql-db2-dba) wrote in message news:<c8b20941. 0402131237.1d77 91@posting.goog le.com>...[color=darkred]
                > > > How did you come up with those numbers in your case statement for long
                > > > columns? I have a table with a CLOB ( 1M ) column. If I do a
                > > > AVG(LENGTH(CLOB _COL)) I get about 2000. But if I were to use your
                > > > formula, it would be 168.
                > > >
                > > > Bill Leung
                > > > leungb@aptea.co m
                > > >
                > > > aamit@hotmail.c om (Amit) wrote in message news:<85e31e0f. 0402122150.58a6 433e@posting.go ogle.com>...
                > > > > For most datatypes, this is calcuated by adding AVGCOLLEN in
                > > > > syscat.columns (which is updated when you RUNSTATS). Long columns
                > > > > space usage is based on the defined size of long columns. And on top
                > > > > of everything else a 10 byte row overhead is added. So even though the
                > > > > table may be empty, average row length will have some value.
                > > > > The query below calculates the avg row length for my SQL version of
                > > > > reorgchk - I use it in a common table expression in my SQL version of
                > > > > REORGCHK, the results it produces match those from reorgchk in DB2 v8
                > > > > fp 4.
                > > > > Btw, I've noticed in the past that some of the calculations for
                > > > > REORGCHK output are done differently from size estimator...loo ks like
                > > > > the two tools were developed independent of each other
                > > > >
                > > > > SELECT TABSCHEMA, TABNAME,
                > > > > CASE
                > > > > WHEN MIN (
                > > > > CASE
                > > > > WHEN A.LOGGED<>''
                > > > > THEN 0
                > > > > ELSE A.COLCARD
                > > > > END) >= 0
                > > > > THEN (SUM(
                > > > > CASE
                > > > > WHEN A.LOGGED = ''
                > > > > THEN A.AVGCOLLEN
                > > > > WHEN A.LENGTH <= 1024
                > > > > THEN 72
                > > > > WHEN A.LENGTH <= 8192
                > > > > THEN 96
                > > > > WHEN A.LENGTH <= 65536
                > > > > THEN 120
                > > > > WHEN A.LENGTH <= 524000
                > > > > THEN 144
                > > > > WHEN A.LENGTH <= 4190000
                > > > > THEN 168
                > > > > WHEN A.LENGTH <= 134000000
                > > > > THEN 200
                > > > > WHEN A.LENGTH <= 536000000
                > > > > THEN 224
                > > > > WHEN A.LENGTH <= 1070000000
                > > > > THEN 256
                > > > > WHEN A.LENGTH <= 1470000000
                > > > > THEN 280
                > > > > WHEN A.LENGTH <= 2147483647
                > > > > THEN 316
                > > > > END +
                > > > > CASE
                > > > > WHEN LOGGED <> ''
                > > > > AND NULLS = 'Y'
                > > > > THEN 1
                > > > > ELSE 0
                > > > > END ) + 10 )
                > > > > ELSE -1
                > > > > END AS AVGROWLEN
                > > > > FROM SYSCAT.COLUMNS A
                > > > > GROUP BY TABSCHEMA, TABNAME[/color][/color][/color]

                Comment

                Working...