DB2 Health Monitor v9 fixpack 4 still can be a BIG PROBLEM

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • jbenner
    New Member
    • Mar 2008
    • 2

    DB2 Health Monitor v9 fixpack 4 still can be a BIG PROBLEM

    I have opened a PMR for this with IBM, and am not asking for advice from the DB2 DBA community. I am posting this as an FYI that DB2 Health Monitor, even at the latest version of DB2, still can cause huge problems with slow connect times and heavy resource locking in high concurrency / high transaction volume environments. I have an OLTP with 30-90 transactions per second activity, and start of Health Monitor every 2 hours was crashing our application. We only resolved the situation by setting DBM CFG parameter HEALTH_MON = OFF.

    Note that other thescripts reports show problems with db2hmon going back to version 8.2.

    My advice for high transaction volume databases is to not even think about leaving Health Monitor active. (I also see other autonomous features such as STMM giving undesirable effects, so we have deactivated self tuning memory manager as well.).

    Here is more detail, from the doc I sent IBM on 2008.03.14:

    ======

    Summary

    We are a DB2 version 9 Fixpack 4 user, 64-bit, running on a Solaris 10 zone on a T2000. The production application is very high transaction volume, 20-90 transactions per second.

    We have experienced three separate errors with DB2 Health Monitor that have forced us to deactivate it entirely. To summarize they are:

    1. In high transaction environment, every time DB2 Health Monitor runs (every 2 hours) causes heavy new connection slowdown and heavy table locking.
    2. In one server, DB2 Health Monitor crashes every time it runs with a sqlerRemoveAllI PCforRow violation. When it crashes it causes heavy connection slowdowns and heavy table locking which result in errors for the heavy concurrency user application.
    3. DB2 Health Monitor CALL REORGCHK_IX_STA TS fails with a SQLCODE “-2306” even though the table exists.

    Detail

    Problem #1 Health Monitor causes heavy locking and new connections slowdown

    We originally opened this PMR to request IBM’s help in determining why our high transaction volume application was experiencing strong connection slowdowns every 2 hours. IBM's engineer said that it did not seem to be a DB2 process that was causing the problem. I did decide to fix MAXLOCKS and that made a marginal improvement, but the problem continued.

    Yesterday, realizing that DB2HMON runs every 2 hours, we deactivated DB2HMON ( UPDATE DBM CFG USING HEALTH_MON OFF ) and the connection slowdown/locking related application failures immediately ended. It is obvious that the problem was entirely caused by Health Monitor.

    Problem #2 Health Monitor crashes with IPC Resource Error

    On the staging server (NOT the production server), every time DB2HMON attempts to run, it crashes. As a result we have decided to completely prevent DB2HMON from starting, using IBM's advice ( http://www-1.ibm.com/support/docview...id=swg21259046 ) , by setting the DB2_FMP_COMM_HE APSZ to 0. Despite the fact that we are not using Health Monitor, we would like IBM to fix this crash behavior.

    Here is an example from today’s db2diag.log on the staging database:


    Code:
    2008-03-13-00.18.44.059066-300 E876A408           LEVEL: Warning
    PID     : 10499                TID  : 1           PROC : db2sysc 0
    INSTANCE: db2                  NODE : 000
    FUNCTION: DB2 UDB, routine_infrastructure, sqlerReturnFmpToPool, probe:999
    DATA #1 : String, 22 bytes
    Removing FMP from pool
    DATA #2 : Hexdump, 16 bytes
    0xFFFFFFFF7FFFE160 : 0000 0000 0000 0000 0000 084F 0000 0000    ...........O....
    
    2008-03-13-00.18.44.076838-300 E1285A456          LEVEL: Error
    PID     : 10499                TID  : 1           PROC : db2sysc 0
    INSTANCE: db2                  NODE : 000
    FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:10
    DATA #1 : String, 32 bytes
    Freeing IPC resource explicitly:
    DATA #2 : SQLO_PID, PD_TYPE_SQLO_PID, 4 bytes
    2127
    DATA #3 : Hexdump, 4 bytes
    0x0000000200447FE0 : 0000 0000                                  ....
    
    2008-03-13-00.18.44.077523-300 E1742A345          LEVEL: Error
    PID     : 10499                TID  : 1           PROC : db2sysc 0
    INSTANCE: db2                  NODE : 000
    FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:20
    DATA #1 : String, 22 bytes
    IPC resources Address:
    DATA #2 : Pointer, 8 bytes
    0x0000000210010080
    
    2008-03-13-00.18.44.078221-300 E2088A1046         LEVEL: Error
    PID     : 10499                TID  : 1           PROC : db2sysc 0
    INSTANCE: db2                  NODE : 000
    FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:30
    DATA #1 : String, 29 bytes
    Number of IPC resource found:
    DATA #2 : signed integer, 4 bytes
    1
    DATA #3 : String, 29 bytes
    Number of IPC resource freed:
    DATA #4 : signed integer, 4 bytes
    1
    CALLSTCK:
      [0] 0xFFFFFFFF7C88B804 __1cXsqlerRemoveAllIPCforRow6FpnLsqlerFmpRow_b_i_ + 0x89C
      [1] 0xFFFFFFFF7C881A68 __1cXsqlerRemoveFmpFromTable6FpnLsqlerFmpRow_b_i_ + 0x220
      [2] 0xFFFFFFFF7C884A6C __1cUsqlerReturnFmpToPool6FccpnOsqlerFmpHandle_pnNsqle_agent_cb__i_ + 0x15DC
      [3] 0x000000010000AD9C __1cOsqleRunSysCtlr6F_i_ + 0x3D4
      [4] 0x0000000100006A64 __1cLsqleSysCtlr6F_i_ + 0x384
      [5] 0xFFFFFFFF7AF75CDC __1cHDBGTerm6F_i_ + 0x1F94
      [6] 0xFFFFFFFF7AF76984 sqloRunInstance + 0x5E4
      [7] 0x0000000100006524 main + 0x924
      [8] 0x0000000100004E5C _start + 0x17C
      [9] 0x0000000000000000 ?unknown + 0x0
    
    2008-03-13-00.18.44.091581-300 I3135A433          LEVEL: Severe
    PID     : 10499                TID  : 1           PROC : db2sysc 0
    INSTANCE: db2                  NODE : 000
    FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:15
    DATA #1 : Hexdump, 31 bytes
    0x000000010000DB20 : 4865 616C 7468 204D 6F6E 6974 6F72 2050    Health Monitor P
    0x000000010000DB30 : 726F 6365 7373 2063 7261 7368 6564 2E      rocess crashed.
    
    2008-03-13-00.18.44.092207-300 I3569A340          LEVEL: Severe
    PID     : 10499                TID  : 1           PROC : db2sysc 0
    INSTANCE: db2                  NODE : 000
    FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:16
    DATA #1 : Hexdump, 4 bytes
    0xFFFFFFFF7FFFE170 : 0000 084F                                  ...O
    
    2008-03-13-00.18.44.092769-300 I3910A340          LEVEL: Severe
    PID     : 10499                TID  : 1           PROC : db2sysc 0
    INSTANCE: db2                  NODE : 000
    FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:17
    DATA #1 : Hexdump, 4 bytes
    0xFFFFFFFF7FFFE174 : 0000 0101                                  ....
    
    2008-03-13-00.18.44.093349-300 I4251A340          LEVEL: Severe
    PID     : 10499                TID  : 1           PROC : db2sysc 0
    INSTANCE: db2                  NODE : 000
    FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:18
    DATA #1 : Hexdump, 4 bytes
    0xFFFFFFFF7FFFE178 : FFFF FFFF                                  ....
    
    2008-03-13-00.18.44.110896-300 I4592A282          LEVEL: Warning
    PID     : 10499                TID  : 1           PROC : db2sysc 0
    INSTANCE: db2                  NODE : 000
    FUNCTION: DB2 UDB, base sys utilities, sqleRunSysCtlr, probe:63
    MESSAGE : Health Monitor Process restarted.
    Problem #3 Health Monitor fails incorrectly with SQLCODE “-2306”


    When I was running an event monitor to diagnose “Problem #1” above-referenced in this document, I noticed that Health Monitor was running CALL REORGCHK_IX_STA TS with a valid table name, for example:

    CALL REORGCHK_IX_STA TS ('T', '"NEAT "."LOCATION_SET S_LOG"')

    And returning SQLCODE -2306, “The table or index name does not exist” even though the table NEAT.LOCATION_S ETS_LOG does exist.

    Here is event monitor output illustrating the problem:

    Code:
    89214) Statement Event ...
      Appl Handle: 587
      Appl Id: *LOCAL.db2.080311150558
      Appl Seq number: 00006
    
      Record is the result of a flush: FALSE
      -------------------------------------------
      Type     : Dynamic
      Operation: Execute
      Section  : 6
      Creator  : NULLID
      Package  : SYSSH200
      Consistency Token  : SYSLVL01
      Package Version ID  :
      Cursor   : SQL_CURSH200C6
      Cursor was blocking: FALSE
      Text     : CALL REORGCHK_IX_STATS ('T', '"NEAT    "."FLIGHT_FARE_LOG"')
      -------------------------------------------
      Start Time: 03/11/2008 10:05:57.563826
      Stop Time:  03/11/2008 10:05:57.713239
      Elapsed Execution Time:  0.149413 seconds
      Number of Agents created: 1
      User CPU: 0.000000 seconds
      System CPU: 0.000000 seconds
      Fetch Count: 0
      Sorts: 1
      Total sort time: 0
      Sort overflows: 0
      Rows read: 0
      Rows written: 0
      Internal rows deleted: 0
      Internal rows updated: 0
      Internal rows inserted: 0
      Bufferpool data logical reads: 0
      Bufferpool data physical reads: 0
      Bufferpool temporary data logical reads: 0
      Bufferpool temporary data physical reads: 0
      Bufferpool index logical reads: 0
      Bufferpool index physical reads: 0
      Bufferpool temporary index logical reads: 0
      Bufferpool temporary index physical reads: 0
      Bufferpool xda logical page reads: 0
      Bufferpool xda physical page reads: 0
      Bufferpool temporary xda logical page reads: 0
      Bufferpool temporary xda physical page reads: 0
      SQLCA:
       sqlcode: -2306
       sqlstate:
    The matching db2diag.log output for this event was:

    Code:
    2008-03-11-10.05.57.713912-300 I276926A452        LEVEL: Error
    PID     : 18727                TID  : 26          PROC : db2acd 0
    INSTANCE: db2                  NODE : 000
    APPID   : *LOCAL.db2.080311150558
    FUNCTION: DB2 UDB, Automatic Table Maintenance, db2HmonEvalReorg, probe:515
    MESSAGE : Failed to evaluate table "NEAT    "."FLIGHT_FARE_LOG". Probe 400
    DATA #1 : Hexdump, 4 bytes
    0xFFFFFFFF6F53F470 : FFFF F6FE                                  ....
    Last edited by docdiesel; Mar 14 '08, 04:50 PM. Reason: Added code tags
  • docdiesel
    Recognized Expert Contributor
    • Aug 2007
    • 297

    #2
    Hi jbenner,

    thank you very much for sharing your experiences with the health monitor. I'm sure this may help some people out there. I for myself am still looking forward to migrate to v9.1. ;-) By the way, I took the liberty to add some code tags (to be set with the # button in the editor) to the logs for better readability.

    Regards,

    Bernd

    Comment

    • jbenner
      New Member
      • Mar 2008
      • 2

      #3
      You are welcome! Thanks for helping me with the post readability.

      Comment

      • dbpete
        New Member
        • Apr 2008
        • 1

        #4
        Thanks for this information on Health Monitor!

        We are currently migrating from DB2 v8 on Solaris 8 to DB2 v9.1 on Solaris 10 and at the same time exploit the virtualization features of Solaris 10 (Zone concept). We first wanted to directly use DB2 9.5 but we ran into several problems. Thus we decided to go for v9.1, Fixpack 4. Still quite a step!

        What makes things a bit more tricky is that for development and education platforms, we intend to use Sun T5120 servers. For these ones, we ran into nasty problems including intermittent instance crashes. It took us much time, firmware updates and fixes from Sund and IBM until things started running smoother. Please be aware of such problems if you intend to use this new Sun hardware for DB2.

        We encounter the same problems with health monitor to what concerns the db2HmonEvalReor g error on certain tables. We have no clue yet why health monitor can evaluate some of the tables and others not.

        We haven't used health monitor up to now, but intend to do so for the new version. Therefore it's good to know that it may cause such heavy performance problems - we can keep an eye on this.

        And of course I will appreciate any hint on which health indicators are critical in terms of resource consumption.

        Thanks,
        pete

        Comment

        • SuperKuper
          New Member
          • Apr 2008
          • 7

          #5
          While we are on the subject ... When in DB2 V9 a database is explicitly activated, two connections show up by default. One is STMM and the other I do not remember what. Do you know how to turn off both. You already mentioned that you turned off STMM. How did you do that? Thx

          Comment

          • Shashank1984
            New Member
            • Jul 2007
            • 26

            #6
            Superkuper,

            Regarding the new db2 agents...i.e.db 2stmm and db2taskd..
            New DB2 agents have been introduced in DB2 Version 9.1: db2stmm and db2taskd. The db2stmm agent is part of the new self-tuning memory feature added in DB2 Version 9.1. The db2taskd agent is an internal daemon that distributes background database tasks. These new agents remain connected to the database at all times and are not active if the database is activated in exclusive mode.
            Starting with Fix Pack 3, agents for active event monitors are also connected to new databases by default.

            Regarding disabling of STMM...
            Self tuning can be disabled for the entire database by setting self_tuning_mem to OFF. When self_tuning_mem is set to OFF, the memory configuration parameters and buffer pools that are set to AUTOMATIC remain AUTOMATIC and the memory areas remain at their current size.

            You can set self_tuning_mem to OFF using the UPDATE DATABASE CONFIGURATION command, the SQLFUPD API, or through the Change Database Configuration Parameter window in the Control Center.

            Self tuning can also be effectively deactivated for the entire database if only a single memory consumer is enabled for self tuning. This is because memory cannot be redistributed when only one memory area is enabled.

            Thanks,
            Shashank Kharche
            IBM.
            DB2 for Linux, Unix & Windows -Information Management Software

            Comment

            Working...