I have opened a PMR for this with IBM, and am not asking for advice from the DB2 DBA community. I am posting this as an FYI that DB2 Health Monitor, even at the latest version of DB2, still can cause huge problems with slow connect times and heavy resource locking in high concurrency / high transaction volume environments. I have an OLTP with 30-90 transactions per second activity, and start of Health Monitor every 2 hours was crashing our application. We only resolved the situation by setting DBM CFG parameter HEALTH_MON = OFF.
Note that other thescripts reports show problems with db2hmon going back to version 8.2.
My advice for high transaction volume databases is to not even think about leaving Health Monitor active. (I also see other autonomous features such as STMM giving undesirable effects, so we have deactivated self tuning memory manager as well.).
Here is more detail, from the doc I sent IBM on 2008.03.14:
======
Summary
We are a DB2 version 9 Fixpack 4 user, 64-bit, running on a Solaris 10 zone on a T2000. The production application is very high transaction volume, 20-90 transactions per second.
We have experienced three separate errors with DB2 Health Monitor that have forced us to deactivate it entirely. To summarize they are:
1. In high transaction environment, every time DB2 Health Monitor runs (every 2 hours) causes heavy new connection slowdown and heavy table locking.
2. In one server, DB2 Health Monitor crashes every time it runs with a sqlerRemoveAllI PCforRow violation. When it crashes it causes heavy connection slowdowns and heavy table locking which result in errors for the heavy concurrency user application.
3. DB2 Health Monitor CALL REORGCHK_IX_STA TS fails with a SQLCODE “-2306” even though the table exists.
Detail
Problem #1 Health Monitor causes heavy locking and new connections slowdown
We originally opened this PMR to request IBM’s help in determining why our high transaction volume application was experiencing strong connection slowdowns every 2 hours. IBM's engineer said that it did not seem to be a DB2 process that was causing the problem. I did decide to fix MAXLOCKS and that made a marginal improvement, but the problem continued.
Yesterday, realizing that DB2HMON runs every 2 hours, we deactivated DB2HMON ( UPDATE DBM CFG USING HEALTH_MON OFF ) and the connection slowdown/locking related application failures immediately ended. It is obvious that the problem was entirely caused by Health Monitor.
Problem #2 Health Monitor crashes with IPC Resource Error
On the staging server (NOT the production server), every time DB2HMON attempts to run, it crashes. As a result we have decided to completely prevent DB2HMON from starting, using IBM's advice ( http://www-1.ibm.com/support/docview...id=swg21259046 ) , by setting the DB2_FMP_COMM_HE APSZ to 0. Despite the fact that we are not using Health Monitor, we would like IBM to fix this crash behavior.
Here is an example from today’s db2diag.log on the staging database:
Problem #3 Health Monitor fails incorrectly with SQLCODE “-2306”
When I was running an event monitor to diagnose “Problem #1” above-referenced in this document, I noticed that Health Monitor was running CALL REORGCHK_IX_STA TS with a valid table name, for example:
CALL REORGCHK_IX_STA TS ('T', '"NEAT "."LOCATION_SET S_LOG"')
And returning SQLCODE -2306, “The table or index name does not exist” even though the table NEAT.LOCATION_S ETS_LOG does exist.
Here is event monitor output illustrating the problem:
The matching db2diag.log output for this event was:
Note that other thescripts reports show problems with db2hmon going back to version 8.2.
My advice for high transaction volume databases is to not even think about leaving Health Monitor active. (I also see other autonomous features such as STMM giving undesirable effects, so we have deactivated self tuning memory manager as well.).
Here is more detail, from the doc I sent IBM on 2008.03.14:
======
Summary
We are a DB2 version 9 Fixpack 4 user, 64-bit, running on a Solaris 10 zone on a T2000. The production application is very high transaction volume, 20-90 transactions per second.
We have experienced three separate errors with DB2 Health Monitor that have forced us to deactivate it entirely. To summarize they are:
1. In high transaction environment, every time DB2 Health Monitor runs (every 2 hours) causes heavy new connection slowdown and heavy table locking.
2. In one server, DB2 Health Monitor crashes every time it runs with a sqlerRemoveAllI PCforRow violation. When it crashes it causes heavy connection slowdowns and heavy table locking which result in errors for the heavy concurrency user application.
3. DB2 Health Monitor CALL REORGCHK_IX_STA TS fails with a SQLCODE “-2306” even though the table exists.
Detail
Problem #1 Health Monitor causes heavy locking and new connections slowdown
We originally opened this PMR to request IBM’s help in determining why our high transaction volume application was experiencing strong connection slowdowns every 2 hours. IBM's engineer said that it did not seem to be a DB2 process that was causing the problem. I did decide to fix MAXLOCKS and that made a marginal improvement, but the problem continued.
Yesterday, realizing that DB2HMON runs every 2 hours, we deactivated DB2HMON ( UPDATE DBM CFG USING HEALTH_MON OFF ) and the connection slowdown/locking related application failures immediately ended. It is obvious that the problem was entirely caused by Health Monitor.
Problem #2 Health Monitor crashes with IPC Resource Error
On the staging server (NOT the production server), every time DB2HMON attempts to run, it crashes. As a result we have decided to completely prevent DB2HMON from starting, using IBM's advice ( http://www-1.ibm.com/support/docview...id=swg21259046 ) , by setting the DB2_FMP_COMM_HE APSZ to 0. Despite the fact that we are not using Health Monitor, we would like IBM to fix this crash behavior.
Here is an example from today’s db2diag.log on the staging database:
Code:
2008-03-13-00.18.44.059066-300 E876A408 LEVEL: Warning PID : 10499 TID : 1 PROC : db2sysc 0 INSTANCE: db2 NODE : 000 FUNCTION: DB2 UDB, routine_infrastructure, sqlerReturnFmpToPool, probe:999 DATA #1 : String, 22 bytes Removing FMP from pool DATA #2 : Hexdump, 16 bytes 0xFFFFFFFF7FFFE160 : 0000 0000 0000 0000 0000 084F 0000 0000 ...........O.... 2008-03-13-00.18.44.076838-300 E1285A456 LEVEL: Error PID : 10499 TID : 1 PROC : db2sysc 0 INSTANCE: db2 NODE : 000 FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:10 DATA #1 : String, 32 bytes Freeing IPC resource explicitly: DATA #2 : SQLO_PID, PD_TYPE_SQLO_PID, 4 bytes 2127 DATA #3 : Hexdump, 4 bytes 0x0000000200447FE0 : 0000 0000 .... 2008-03-13-00.18.44.077523-300 E1742A345 LEVEL: Error PID : 10499 TID : 1 PROC : db2sysc 0 INSTANCE: db2 NODE : 000 FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:20 DATA #1 : String, 22 bytes IPC resources Address: DATA #2 : Pointer, 8 bytes 0x0000000210010080 2008-03-13-00.18.44.078221-300 E2088A1046 LEVEL: Error PID : 10499 TID : 1 PROC : db2sysc 0 INSTANCE: db2 NODE : 000 FUNCTION: DB2 UDB, routine_infrastructure, sqlerRemoveAllIPCforRow, probe:30 DATA #1 : String, 29 bytes Number of IPC resource found: DATA #2 : signed integer, 4 bytes 1 DATA #3 : String, 29 bytes Number of IPC resource freed: DATA #4 : signed integer, 4 bytes 1 CALLSTCK: [0] 0xFFFFFFFF7C88B804 __1cXsqlerRemoveAllIPCforRow6FpnLsqlerFmpRow_b_i_ + 0x89C [1] 0xFFFFFFFF7C881A68 __1cXsqlerRemoveFmpFromTable6FpnLsqlerFmpRow_b_i_ + 0x220 [2] 0xFFFFFFFF7C884A6C __1cUsqlerReturnFmpToPool6FccpnOsqlerFmpHandle_pnNsqle_agent_cb__i_ + 0x15DC [3] 0x000000010000AD9C __1cOsqleRunSysCtlr6F_i_ + 0x3D4 [4] 0x0000000100006A64 __1cLsqleSysCtlr6F_i_ + 0x384 [5] 0xFFFFFFFF7AF75CDC __1cHDBGTerm6F_i_ + 0x1F94 [6] 0xFFFFFFFF7AF76984 sqloRunInstance + 0x5E4 [7] 0x0000000100006524 main + 0x924 [8] 0x0000000100004E5C _start + 0x17C [9] 0x0000000000000000 ?unknown + 0x0 2008-03-13-00.18.44.091581-300 I3135A433 LEVEL: Severe PID : 10499 TID : 1 PROC : db2sysc 0 INSTANCE: db2 NODE : 000 FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:15 DATA #1 : Hexdump, 31 bytes 0x000000010000DB20 : 4865 616C 7468 204D 6F6E 6974 6F72 2050 Health Monitor P 0x000000010000DB30 : 726F 6365 7373 2063 7261 7368 6564 2E rocess crashed. 2008-03-13-00.18.44.092207-300 I3569A340 LEVEL: Severe PID : 10499 TID : 1 PROC : db2sysc 0 INSTANCE: db2 NODE : 000 FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:16 DATA #1 : Hexdump, 4 bytes 0xFFFFFFFF7FFFE170 : 0000 084F ...O 2008-03-13-00.18.44.092769-300 I3910A340 LEVEL: Severe PID : 10499 TID : 1 PROC : db2sysc 0 INSTANCE: db2 NODE : 000 FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:17 DATA #1 : Hexdump, 4 bytes 0xFFFFFFFF7FFFE174 : 0000 0101 .... 2008-03-13-00.18.44.093349-300 I4251A340 LEVEL: Severe PID : 10499 TID : 1 PROC : db2sysc 0 INSTANCE: db2 NODE : 000 FUNCTION: DB2 UDB, base sys utilities, sqleChildCrashHandler, probe:18 DATA #1 : Hexdump, 4 bytes 0xFFFFFFFF7FFFE178 : FFFF FFFF .... 2008-03-13-00.18.44.110896-300 I4592A282 LEVEL: Warning PID : 10499 TID : 1 PROC : db2sysc 0 INSTANCE: db2 NODE : 000 FUNCTION: DB2 UDB, base sys utilities, sqleRunSysCtlr, probe:63 MESSAGE : Health Monitor Process restarted.
When I was running an event monitor to diagnose “Problem #1” above-referenced in this document, I noticed that Health Monitor was running CALL REORGCHK_IX_STA TS with a valid table name, for example:
CALL REORGCHK_IX_STA TS ('T', '"NEAT "."LOCATION_SET S_LOG"')
And returning SQLCODE -2306, “The table or index name does not exist” even though the table NEAT.LOCATION_S ETS_LOG does exist.
Here is event monitor output illustrating the problem:
Code:
89214) Statement Event ...
Appl Handle: 587
Appl Id: *LOCAL.db2.080311150558
Appl Seq number: 00006
Record is the result of a flush: FALSE
-------------------------------------------
Type : Dynamic
Operation: Execute
Section : 6
Creator : NULLID
Package : SYSSH200
Consistency Token : SYSLVL01
Package Version ID :
Cursor : SQL_CURSH200C6
Cursor was blocking: FALSE
Text : CALL REORGCHK_IX_STATS ('T', '"NEAT "."FLIGHT_FARE_LOG"')
-------------------------------------------
Start Time: 03/11/2008 10:05:57.563826
Stop Time: 03/11/2008 10:05:57.713239
Elapsed Execution Time: 0.149413 seconds
Number of Agents created: 1
User CPU: 0.000000 seconds
System CPU: 0.000000 seconds
Fetch Count: 0
Sorts: 1
Total sort time: 0
Sort overflows: 0
Rows read: 0
Rows written: 0
Internal rows deleted: 0
Internal rows updated: 0
Internal rows inserted: 0
Bufferpool data logical reads: 0
Bufferpool data physical reads: 0
Bufferpool temporary data logical reads: 0
Bufferpool temporary data physical reads: 0
Bufferpool index logical reads: 0
Bufferpool index physical reads: 0
Bufferpool temporary index logical reads: 0
Bufferpool temporary index physical reads: 0
Bufferpool xda logical page reads: 0
Bufferpool xda physical page reads: 0
Bufferpool temporary xda logical page reads: 0
Bufferpool temporary xda physical page reads: 0
SQLCA:
sqlcode: -2306
sqlstate:
Code:
2008-03-11-10.05.57.713912-300 I276926A452 LEVEL: Error PID : 18727 TID : 26 PROC : db2acd 0 INSTANCE: db2 NODE : 000 APPID : *LOCAL.db2.080311150558 FUNCTION: DB2 UDB, Automatic Table Maintenance, db2HmonEvalReorg, probe:515 MESSAGE : Failed to evaluate table "NEAT "."FLIGHT_FARE_LOG". Probe 400 DATA #1 : Hexdump, 4 bytes 0xFFFFFFFF6F53F470 : FFFF F6FE ....
Comment