During a manual service group failover one of the db2 resources will be faulted.

Article: 100027272
Last Published: 2012-08-08
Ratings: 0 0
Product(s): InfoScale & Storage Foundation

Problem

Customer switched the group manually to the other node,
one of the db2 instances will be faulted after the grandparent resource is
offline by VCS

Error Message

>>>>>>> INSTANCE NAMES CONFIGURED ON THE SYSTEM

** "FIRST instance"
PROD_rsacmdb_instance
** "SECOND instance"
PROD_rsadb_instance
** "Third instance"
PROD_rsatldb_instance


>>>>>> FAILURE SEQUENCE

15:07:27  Initiating Offline of Resource PROD_RSADB2_IP
15:07:27  Initiating Offline of Resource PROD_db2sysc_process
15:07:28  PROD_RSADB2_IP:offline:/usr/sbin/ifconfig
15:07:29  Resource PROD_RSADB2_IP is offline on adb2rib32
15:07:31  Resource PROD_db2sysc_process is offline on adb2rib32
*
* >>> Initiating offline FIRST instance
15:07:31  Initiating Offline of Resource PROD_rsacmdb_instance on System adb2rib32
*
>>> SECOND instance spits error with no offline command given
15:07:45 VCS INFO V-16-2-13075 (adb2rib32) Resource(PROD_rsadb_instance) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).
*
>>> DB2 output
15:07:58 Db2udb:PROD_rsacmdb_instance:offline:DB2 shutdown returned the output: 03/14/2012 15:07:56     0   0   SQL1032N  No start database manager command was issued.SQL1032N  No start database manager command was issued.  SQLSTATE=57019
*
>>> THIRD instance spits error with no offline command given
15:08:26 Resource(PROD_rsatldb_instance) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).
*
>>> FIRST instance is offline
15:08:29 VCS INFO V-16-1-10305 Resource PROD_rsacmdb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
*
* >>> Initiating offline THIRD instance
15:08:29 Initiating Offline of Resource PROD_rsatldb_instance on System adb2rib32
*
>>> calling clean for SECOND instance - Offline on it's own
15:08:44 VCS ERROR V-16-2-13067 (adb2rib32) Agent is calling clean for resource(PROD_rsadb_instance) because the resource became OFFLINE unexpectedly, on its own.
*
>>> DB2 output
15:08:53     0   0   SQL1032N  No start database manager command was issued.
SQL1032N  No start database manager command was issued.  SQLSTATE=57019
*
>>> THIRD instance is offline
15:09:26 VCS INFO V-16-1-10305 Resource PROD_rsatldb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib00001032 (VCS initiated)
*
* >>> Initiating offline SECOND instance
2012/03/14 15:09:26 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_rsadb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
*
>>> DB2 output from clean
2012/03/14 15:10:51 VCS INFO V-16-2-13716 (adb2rib32) Resource(PROD_rsadb_instance): Output of the completed operation (clean)
==============================================
03/14/2012 15:08:45     0   0   SQL1032N  No start database manager command was issued.
SQL1032N  No start database manager command was issued.  SQLSTATE=57019
/db2home/rsa01ppr/sqllib/bin/ipclean: Removing DB2 engine and client's IPC resources for rsa01ppr.
==============================================
*
>>> Clean completes
2012/03/14 15:10:51 VCS INFO V-16-2-13068 (adb2rib32) Resource(PROD_rsadb_instance) - clean completed successfully.
*
>>> SECOND instance  OFFLINE unexpectedly
2012/03/14 15:10:51 VCS ERROR V-16-2-13073 (adb2rib32) Resource(PROD_rsadb_instance) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 3) the resource.
*
>>> DB2 restarts
2012/03/14 15:11:07 VCS NOTICE V-16-20004-72 (adb2rib32) Db2udb:PROD_rsadb_instance:online:DB2 startup returned the output:
Instance  : rsa01ppr
DB2 Start : Success
Partition 0 : Success
*
>>> VCS detects SECOND instance restart
2012/03/14 15:11:39 VCS NOTICE V-16-2-13076 (adb2rib32) Agent has successfully restarted resource(PROD_rsadb_instance).
*
>>> DB2 returns SECOND instance has stopped
2012/03/14 15:11:40 VCS NOTICE V-16-20004-74 (adb2rib32) Db2udb:PROD_rsadb_instance:offline:DB2 shutdown returned the output: 03/14/2012 15:11:40     0   0   SQL1064N  DB2STOP processing was successful.
SQL1064N  DB2STOP processing was successful.
*
>>> VCS see's SECOND instance as offline
2012/03/14 15:12:11 VCS INFO V-16-1-10305 Resource PROD_rsadb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
*
>>> VCS now continues on with offlining dependencies
2012/03/14 15:12:11 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_db2data1vol_mount (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
2012/03/14 15:12:11 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_db2data2vol_mount (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
2012/03/14 15:12:11 VCS NOTICE V-16-1-10300 Initiating Offline of Resource


 

Cause

DB2 Agent provides HA at the instance level (not at Database level) [to the db2sysc process that is running]. So irrespective of the number of databases created in the instance, the DB2 Agent would only monitor instance specific "db2sysc" process.

 

****

Observations:

CONFIGURATION

Customer has the below dependency of the DB2 Resources..

// Process PROD_db2sysc_process
// {
// Db2udb PROD_rsacmdb_instance
// {
// Db2udb PROD_rsatldb_instance
// {
// Db2udb PROD_rsadb_instance

Process PROD_db2sysc_process (
ResourceOwner = rsa01ppr
PathName = db2sysc
Arguments = 0
)

Db2udb PROD_rsacmdb_instance (
DB2InstOwner = rsa01ppr
DB2InstHome = "/db2home/rsa01ppr"
DatabaseName = rsacmdb
)

Db2udb PROD_rsadb_instance (
DB2InstOwner = rsa01ppr
DB2InstHome = "/db2home/rsa01ppr"
DatabaseName = rsadb
)

Db2udb PROD_rsatldb_instance (
DB2InstOwner = rsa01ppr
DB2InstHome = "/db2home/rsa01ppr"
DatabaseName = rsatldb
)

The DB2 Instance Home (DB2InstHome = "db2home/rsa01ppr") and DB2 Instance Owner(DB2InstOwner = rsa01ppr) is same for all the 3 DB2 resources. Only the Database Name is different for each resource. Thus it means that the customer has created only 1 DB2 instance, in which he has 3 different databases (rsacmdb, rsadb, rsatldb). Also there is a single partition/ node in this configuration (Node Number 0).

DB2 Instance has a "db2sysc" process associated with it. DB2 Agent provides HA at the instance level (not at Database level) [to the db2sysc process that is running]. So irrespective of the number of databases created in the instance, the DB2 Agent would only monitor instance specific "db2sysc" process.

LOGS:

2012/04/06 16:00:09 VCS INFO V-16-1-50135 User admin fired command: hagrp -switch RSAPASSMARK_DB2PROD_DATASG adb2rib33 localclus from ::ffff:10.10.149.100
2012/04/06 16:00:09 VCS NOTICE V-16-1-10208 Initiating switch of group RSAPASSMARK_DB2PROD_DATASG from system adb2rib32 to system adb2rib33

2012/04/06 16:00:13 VCS DBG_TRACE V-16-50-0 PROD_db2sysc_process::state transition from ONLINE to OFFLINE << Indicates that the db2sysc process has gone offline. ( initiated by PROD_db2sysc_process resource)

Since there is a single (instance specific) "db2sysc" process that has gone offline, the OFFLINE for PROD_rsacmdb_instance correctly prints the below messages message 20 times when it tries to stop db2 conventionally (since in the first attempt it couldn’t stop the db2sysc process)
================================================
2012/04/06 16:00:13 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_rsacmdb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
2012/04/06 16:00:14 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:offline:Offline call /db2home/rsa01ppr/sqllib/adm/db2stop force nodenum 0 with 19 RetryLimit left:
2012/04/06 16:00:14 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:offline:Output returned :
+--------------------------------------------------------------------+
04/06/2012 16:00:13 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
+====================================================================+
2012/04/06 16:00:15 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:offline:Offline call /db2home/rsa01ppr/sqllib/adm/db2stop force nodenum 0 with 18 RetryLimit left:
2012/04/06 16:00:16 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:offline:Output returned :
+--------------------------------------------------------------------+
04/06/2012 16:00:15 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
+====================================================================+
.
.
.
2012/04/06 16:00:57 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:No db2sysc process found.
2012/04/06 16:00:57 VCS INFO V-16-2-13075 (adb2rib32) Resource(PROD_rsadb_instance) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1). <<< Monitor gets schedule for PROD_rsadb_instance Since there is no db2sysc process running, this messages gets printed correctly when monitor doesn’t find db2sysc.
.
.
2012/04/06 16:00:58 VCS DBG_3 V-16-20004-0 (adb2rib33) Db2udb:PROD_rsatldb_instance:monitor:No db2sysc process found.
2012/04/06 16:01:17 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:monitor:No db2sysc process found.
2012/04/06 16:01:18 VCS DBG_TRACE V-16-50-0 internal_is_offline res =PROD_rsacmdb_instance, sys =adb2rib32
2012/04/06 16:01:18 VCS INFO V-16-1-10305 Resource PROD_rsacmdb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
2012/04/06 16:01:18 VCS DBG_TRACE V-16-50-0 PROD_rsacmdb_instance::state transition from ONLINE to OFFLINE << Correctly gets printed since monitor doesn’t detect db2sysc process after completion of the offline entry point for this resource. Now the offline gets initiated for PROD_rsatldb_instance resource. The same process repeats for PROD_rsatldb_instance(conventionally stop DB2 20 times)
.
.
.
2012/04/06 16:01:18 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:offline:Offline call /db2home/rsa01ppr/sqllib/adm/db2stop force nodenum 0 with 19 RetryLimit left:
2012/04/06 16:01:18 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:offline:Output returned :
+--------------------------------------------------------------------+
04/06/2012 16:01:18 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
+====================================================================+

2012/04/06 16:01:20 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:offline:Offline call /db2home/rsa01ppr/sqllib/adm/db2stop force nodenum 0 with 18 RetryLimit left:
2012/04/06 16:01:20 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:offline:Output returned :
+--------------------------------------------------------------------+
04/06/2012 16:01:20 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
+====================================================================+
.
.
.
2012/04/06 16:01:55 VCS DBG_3 V-16-20004-0 (adb2rib33) Db2udb:PROD_rsadb_instance:monitor:No db2sysc process found.
2012/04/06 16:01:57 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:No db2sysc process found.
2012/04/06 16:01:58 VCS ERROR V-16-2-13067 (adb2rib32) Agent is calling clean for resource(PROD_rsadb_instance) because the resource became OFFLINE unexpectedly, on its own. << This message gets correctly printed, as clean entry point is called now that the Tolerance Limit is over.
2012/04/06 16:01:58 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:clean:In ValidNodeNum(): NodeNum=0, NumPartitions=0
2012/04/06 16:01:58 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:clean:In ValidNodeNum(): Partition_arr[0]=0
2012/04/06 16:01:58 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:clean:In ValidNodeNum(): 0 is OK
2012/04/06 16:02:04 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:clean:No db2sysc process found.
2012/04/06 16:02:22 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:monitor:No db2sysc process found.
2012/04/06 16:02:23 VCS DBG_TRACE V-16-50-0 internal_is_offline res =PROD_rsatldb_instance, sys =adb2rib32
2012/04/06 16:02:23 VCS INFO V-16-1-10305 Resource PROD_rsatldb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
2012/04/06 16:02:23 VCS DBG_TRACE V-16-50-0 PROD_rsatldb_instance::state transition from ONLINE to OFFLINE << Correctly gets printed since monitor doesn’t detect db2sysc process after completion of the offline entry point for this resource.
2012/04/06 16:02:23 VCS DBG_TRACE V-16-50-0 PROD_rsatldb_instance::istate transition from RESOURCE_W_OFFLINE_PROPAGATE to RESOURCE_W_NONE
.
.
.
2012/04/06 16:02:23 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_rsadb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
2012/04/06 16:04:05 VCS INFO V-16-2-13716 (adb2rib32) Resource(PROD_rsadb_instance): Output of the completed operation (clean)
==============================================
04/06/2012 16:01:59 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
/db2home/rsa01ppr/sqllib/bin/ipclean: Removing DB2 engine and client's IPC resources for rsa01ppr. << Messages get correctly printed since the clean entry point has now executed successfully.
==============================================

2012/04/06 16:04:05 VCS INFO V-16-2-13068 (adb2rib32) Resource(PROD_rsadb_instance) - clean completed successfully.
.
.
.
2012/04/06 16:04:05 VCS ERROR V-16-2-13073 (adb2rib32) Resource(PROD_rsadb_instance) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 3) the resource. << Since the restart limit for this resource is set to 1, it would try to bring the resource PROD_rsadb_instance online again.
2012/04/06 16:04:05 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:online:
MPPSupport flag is set to 1, total number of partitions is 1
2012/04/06 16:04:05 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:online:In ValidNodeNum(): NodeNum=0, NumPartitions=0
2012/04/06 16:04:05 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:online:In ValidNodeNum(): Partition_arr[0]=0
2012/04/06 16:04:06 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:online:In ValidNodeNum(): 0 is OK
2012/04/06 16:04:20 VCS NOTICE V-16-20004-72 (adb2rib32) Db2udb:PROD_rsadb_instance:online:DB2 startup returned the output:
Instance : rsa01ppr
DB2 Start : Success
Partition 0 : Success
.
.
.
2012/04/06 16:04:52 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:In ValidNodeNum(): NodeNum=0, NumPartitions=0
2012/04/06 16:04:52 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:In ValidNodeNum(): Partition_arr[0]=0
2012/04/06 16:04:52 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:In ValidNodeNum(): 0 is OK
2012/04/06 16:04:53 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:get_db2gcf_status() returns a return code = 0
2012/04/06 16:04:53 VCS NOTICE V-16-2-13076 (adb2rib32) Agent has successfully restarted resource(PROD_rsadb_instance).
2012/04/06 16:04:55 VCS NOTICE V-16-20004-74 (adb2rib32) Db2udb:PROD_rsadb_instance:offline:DB2 shutdown returned the output: 04/06/2012 16:04:54 0 0 SQL1064N DB2STOP processing was successful.
SQL1064N DB2STOP processing was successful. << Finally after cleaning and restarting, it shuts down.
.
.
.
2012/04/06 16:05:01 VCS DBG_3 V-16-20004-0 (adb2rib33) Db2udb:PROD_rsacmdb_instance:monitor:No db2sysc process found.
2012/04/06 16:05:26 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:No db2sysc process found.
2012/04/06 16:05:27 VCS DBG_TRACE V-16-50-0 internal_is_offline res =PROD_rsadb_instance, sys =adb2rib32

internal.C:[4113]
2012/04/06 16:05:27 VCS INFO V-16-1-10305 Resource PROD_rsadb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
2012/04/06 16:05:27 VCS DBG_TRACE V-16-50-0 PROD_rsadb_instance::state transition from ONLINE to OFFLINE << Finally the 3’rd Db2udb resource goes offline
.
.
.
2012/04/06 16:05:35 VCS NOTICE V-16-1-10446 Group RSAPASSMARK_DB2PROD_DATASG is offline on system adb2rib32

Thus from the above logs it is clear why in case of only 2 Db2udb resources and in case of no dependency among them, switch over appears fine.

Solution

The customer can have a single DB2 resource with any of the databases, that he can switch-over since HA is provided only at instance level.

VCS starts multiple partitions simultaneously, which can lead to a race condition.
The agent's RestartLimit attribute is set to a value of three to help avoid this
condition. You can alleviate the potential for this condition by building resource
dependencies for each partition. For example, within a service group you can have
the Db2udb resource 4 (where nodenum=1) depend on Db2udb resource 3 (where
nodenum=2) etc. With the partitions built in a dependency tree, you can set the
value of the RestartLimit to zero.

Was this content helpful?