During a manual service group failover one of the db2 resources will be faulted.

Article: 100027272
Last Published: 2012-08-08
Ratings: 0 0
Product(s): InfoScale & Storage Foundation

Problem

The customer switches the group manually to the other node. One of the db2 instances will be faulted after the grandparent resource is offline by VCS

Error Message

>>>>>>> INSTANCE NAMES CONFIGURED ON THE SYSTEM

** "FIRST instance"
PROD_rsacmdb_instance
** "SECOND instance"
PROD_rsadb_instance
** "Third instance"
PROD_rsatldb_instance


>>>>>> FAILURE SEQUENCE

15:07:27  Initiating Offline of Resource PROD_RSADB2_IP
15:07:27  Initiating Offline of Resource PROD_db2sysc_process
15:07:28  PROD_RSADB2_IP:offline:/usr/sbin/ifconfig
15:07:29  Resource PROD_RSADB2_IP is offline on adb2rib32
15:07:31  Resource PROD_db2sysc_process is offline on adb2rib32

*
* >>> Initiating offline FIRST instance
15:07:31  Initiating Offline of Resource PROD_rsacmdb_instance on System adb2rib32
*
>>> SECOND instance spits error with no offline command given
15:07:45 VCS INFO V-16-2-13075 (adb2rib32) Resource(PROD_rsadb_instance) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).
*
>>> DB2 output
15:07:58 Db2udb:PROD_rsacmdb_instance:offline:DB2 shutdown returned the output: 03/14/2012 15:07:56     0   0   SQL1032N  No start database manager command was issued.SQL1032N  No start database manager command was issued.  SQLSTATE=57019
*
>>> THIRD instance spits error with no offline command given
15:08:26 Resource(PROD_rsatldb_instance) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).
*
>>> FIRST instance is offline
15:08:29 VCS INFO V-16-1-10305 Resource PROD_rsacmdb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
*
* >>> Initiating offline THIRD instance
15:08:29 Initiating Offline of Resource PROD_rsatldb_instance on System adb2rib32
*
>>> calling clean for SECOND instance - Offline on it's own
15:08:44 VCS ERROR V-16-2-13067 (adb2rib32) Agent is calling clean for resource(PROD_rsadb_instance) because the resource became OFFLINE unexpectedly, on its own.
*
>>> DB2 output
15:08:53     0   0   SQL1032N  No start database manager command was issued.
SQL1032N  No start database manager command was issued.  SQLSTATE=57019

*
>>> THIRD instance is offline
15:09:26 VCS INFO V-16-1-10305 Resource PROD_rsatldb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib00001032 (VCS initiated)
*
* >>> Initiating offline SECOND instance
2012/03/14 15:09:26 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_rsadb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
*
>>> DB2 output from clean
2012/03/14 15:10:51 VCS INFO V-16-2-13716 (adb2rib32) Resource(PROD_rsadb_instance): Output of the completed operation (clean)
==============================================
03/14/2012 15:08:45     0   0   SQL1032N  No start database manager command was issued.
SQL1032N  No start database manager command was issued.  SQLSTATE=57019
/db2home/rsa01ppr/sqllib/bin/ipclean: Removing DB2 engine and client's IPC resources for rsa01ppr.
==============================================
*

>>> Clean completes
2012/03/14 15:10:51 VCS INFO V-16-2-13068 (adb2rib32) Resource(PROD_rsadb_instance) - clean completed successfully.
*
>>> SECOND instance  OFFLINE unexpectedly
2012/03/14 15:10:51 VCS ERROR V-16-2-13073 (adb2rib32) Resource(PROD_rsadb_instance) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 3) the resource.
*
>>> DB2 restarts
2012/03/14 15:11:07 VCS NOTICE V-16-20004-72 (adb2rib32) Db2udb:PROD_rsadb_instance:online:DB2 startup returned the output:
Instance  : rsa01ppr
DB2 Start : Success
Partition 0 : Success

*
>>> VCS detects SECOND instance restart
2012/03/14 15:11:39 VCS NOTICE V-16-2-13076 (adb2rib32) Agent has successfully restarted resource(PROD_rsadb_instance).
*
>>> DB2 returns SECOND instance has stopped
2012/03/14 15:11:40 VCS NOTICE V-16-20004-74 (adb2rib32) Db2udb:PROD_rsadb_instance:offline:DB2 shutdown returned the output: 03/14/2012 15:11:40     0   0   SQL1064N  DB2STOP processing was successful.
SQL1064N  DB2STOP processing was successful.

*
>>> VCS see's SECOND instance as offline
2012/03/14 15:12:11 VCS INFO V-16-1-10305 Resource PROD_rsadb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
*
>>> VCS now continues on with offlining dependencies
2012/03/14 15:12:11 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_db2data1vol_mount (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
2012/03/14 15:12:11 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_db2data2vol_mount (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
2012/03/14 15:12:11 VCS NOTICE V-16-1-10300 Initiating Offline of Resource


 

Cause

DB2 Agent provides HA at the instance level (not at Database level) [to the db2sysc process that is running]. So, irrespective of the number of databases created in the instance, the DB2 Agent would only monitor instance specific "db2sysc" process.

 

****

 

Observations:

CONFIGURATION

Customer has the below dependency on the DB2 Resources..

// Process PROD_db2sysc_process
// {
// Db2udb PROD_rsacmdb_instance
// {
// Db2udb PROD_rsatldb_instance
// {
// Db2udb PROD_rsadb_instance

Process PROD_db2sysc_process (
ResourceOwner = rsa01ppr
PathName = db2sysc
Arguments = 0
)

Db2udb PROD_rsacmdb_instance (
DB2InstOwner = rsa01ppr
DB2InstHome = "/db2home/rsa01ppr"
DatabaseName = rsacmdb
)

Db2udb PROD_rsadb_instance (
DB2InstOwner = rsa01ppr
DB2InstHome = "/db2home/rsa01ppr"
DatabaseName = rsadb
)

Db2udb PROD_rsatldb_instance (
DB2InstOwner = rsa01ppr
DB2InstHome = "/db2home/rsa01ppr"
DatabaseName = rsatldb

)

The DB2 Instance Home (DB2InstHome = "db2home/rsa01ppr") and DB2 Instance Owner(DB2InstOwner = rsa01ppr) is same for all the 3 DB2 resources. Only the Database Name is different for each resource. Thus, it means that the customer has created only 1 DB2 instance, in which he has 3 different databases (rsacmdb, rsadb, rsatldb). Also, there is a single partition / node in this configuration (Node Number 0).

DB2 Instance has a "db2sysc" process associated with it. DB2 Agent provides HA at the instance level (not at Database level) [to the db2sysc process that is running]. So, irrespective of the number of databases created in the instance, the DB2 Agent would only monitor the instance-specific "db2sysc" process.

LOGS:

2012/04/06 16:00:09 VCS INFO V-16-1-50135 User admin fired command: hagrp -switch RSAPASSMARK_DB2PROD_DATASG adb2rib33 localclus from ::ffff:10.10.149.100
2012/04/06 16:00:09 VCS NOTICE V-16-1-10208 Initiating switch of group RSAPASSMARK_DB2PROD_DATASG from system adb2rib32 to system adb2rib33

2012/04/06 16:00:13 VCS DBG_TRACE V-16-50-0 PROD_db2sysc_process::state transition from ONLINE to OFFLINE << Indicates that the db2sysc process has gone offline. ( initiated by PROD_db2sysc_process resource)


Since there is a single (instance specific) "db2sysc" process that has gone offline, the OFFLINE for PROD_rsacmdb_instance correctly prints the below messages 20 times when it tries to stop db2 conventionally (since in the first attempt it couldn’t stop the db2sysc process)

================================================
2012/04/06 16:00:13 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_rsacmdb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
2012/04/06 16:00:14 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:offline:Offline call /db2home/rsa01ppr/sqllib/adm/db2stop force nodenum 0 with 19 RetryLimit left:
2012/04/06 16:00:14 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:offline:Output returned :
+--------------------------------------------------------------------+
04/06/2012 16:00:13 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
+====================================================================+
2012/04/06 16:00:15 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:offline:Offline call /db2home/rsa01ppr/sqllib/adm/db2stop force nodenum 0 with 18 RetryLimit left:
2012/04/06 16:00:16 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:offline:Output returned :
+--------------------------------------------------------------------+
04/06/2012 16:00:15 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
+====================================================================+

.
.
.
2012/04/06 16:00:57 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:No db2sysc process found.
2012/04/06 16:00:57 VCS INFO V-16-2-13075 (adb2rib32) Resource(PROD_rsadb_instance) has reported unexpected OFFLINE 1 times, which is still within the ToleranceLimit(1).
<<< Monitor gets schedule for PROD_rsadb_instance Since there is no db2sysc process running, this messages gets printed correctly when monitor doesn’t find db2sysc.
.
.
2012/04/06 16:00:58 VCS DBG_3 V-16-20004-0 (adb2rib33) Db2udb:PROD_rsatldb_instance:monitor:No db2sysc process found.
2012/04/06 16:01:17 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsacmdb_instance:monitor:No db2sysc process found.
2012/04/06 16:01:18 VCS DBG_TRACE V-16-50-0 internal_is_offline res =PROD_rsacmdb_instance, sys =adb2rib32
2012/04/06 16:01:18 VCS INFO V-16-1-10305 Resource PROD_rsacmdb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
2012/04/06 16:01:18 VCS DBG_TRACE V-16-50-0 PROD_rsacmdb_instance::state transition from ONLINE to OFFLINE
<< Correctly gets printed since monitor doesn’t detect db2sysc process after completion of the offline entry point for this resource. Now the offline gets initiated for PROD_rsatldb_instance resource. The same process repeats for PROD_rsatldb_instance(conventionally stop DB2 20 times)
.
.
.
2012/04/06 16:01:18 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:offline:Offline call /db2home/rsa01ppr/sqllib/adm/db2stop force nodenum 0 with 19 RetryLimit left:
2012/04/06 16:01:18 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:offline:Output returned :
+--------------------------------------------------------------------+
04/06/2012 16:01:18 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
+====================================================================+

2012/04/06 16:01:20 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:offline:Offline call /db2home/rsa01ppr/sqllib/adm/db2stop force nodenum 0 with 18 RetryLimit left:
2012/04/06 16:01:20 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:offline:Output returned :
+--------------------------------------------------------------------+
04/06/2012 16:01:20 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
+====================================================================+
.
.
.
2012/04/06 16:01:55 VCS DBG_3 V-16-20004-0 (adb2rib33) Db2udb:PROD_rsadb_instance:monitor:No db2sysc process found.
2012/04/06 16:01:57 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:No db2sysc process found.
2012/04/06 16:01:58 VCS ERROR V-16-2-13067 (adb2rib32) Agent is calling clean for resource(PROD_rsadb_instance) because the resource became OFFLINE unexpectedly, on its own.
<< This message gets correctly printed, as clean entry point is called now that the Tolerance Limit is over.
2012/04/06 16:01:58 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:clean:In ValidNodeNum(): NodeNum=0, NumPartitions=0
2012/04/06 16:01:58 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:clean:In ValidNodeNum(): Partition_arr[0]=0
2012/04/06 16:01:58 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:clean:In ValidNodeNum(): 0 is OK
2012/04/06 16:02:04 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:clean:No db2sysc process found.
2012/04/06 16:02:22 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsatldb_instance:monitor:No db2sysc process found.
2012/04/06 16:02:23 VCS DBG_TRACE V-16-50-0 internal_is_offline res =PROD_rsatldb_instance, sys =adb2rib32
2012/04/06 16:02:23 VCS INFO V-16-1-10305 Resource PROD_rsatldb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
2012/04/06 16:02:23 VCS DBG_TRACE V-16-50-0 PROD_rsatldb_instance::state transition from ONLINE to OFFLINE
<< Correctly gets printed since monitor doesn’t detect db2sysc process after completion of the offline entry point for this resource.
2012/04/06 16:02:23 VCS DBG_TRACE V-16-50-0 PROD_rsatldb_instance::istate transition from RESOURCE_W_OFFLINE_PROPAGATE to RESOURCE_W_NONE
.
.
.
2012/04/06 16:02:23 VCS NOTICE V-16-1-10300 Initiating Offline of Resource PROD_rsadb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) on System adb2rib32
2012/04/06 16:04:05 VCS INFO V-16-2-13716 (adb2rib32) Resource(PROD_rsadb_instance): Output of the completed operation (clean)
==============================================
04/06/2012 16:01:59 0 0 SQL1032N No start database manager command was issued.
SQL1032N No start database manager command was issued. SQLSTATE=57019
/db2home/rsa01ppr/sqllib/bin/ipclean: Removing DB2 engine and client's IPC resources for rsa01ppr.
<< Messages get correctly printed since the clean entry point has now executed successfully.
==============================================

2012/04/06 16:04:05 VCS INFO V-16-2-13068 (adb2rib32) Resource(PROD_rsadb_instance) - clean completed successfully.
.
.
.
2012/04/06 16:04:05 VCS ERROR V-16-2-13073 (adb2rib32) Resource(PROD_rsadb_instance) became OFFLINE unexpectedly on its own. Agent is restarting (attempt number 1 of 3) the resource.
<< Since the restart limit for this resource is set to 1, it would try to bring the resource PROD_rsadb_instance online again.
2012/04/06 16:04:05 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:online:
MPPSupport flag is set to 1, total number of partitions is 1
2012/04/06 16:04:05 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:online:In ValidNodeNum(): NodeNum=0, NumPartitions=0
2012/04/06 16:04:05 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:online:In ValidNodeNum(): Partition_arr[0]=0
2012/04/06 16:04:06 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:online:In ValidNodeNum(): 0 is OK
2012/04/06 16:04:20 VCS NOTICE V-16-20004-72 (adb2rib32) Db2udb:PROD_rsadb_instance:online:DB2 startup returned the output:
Instance : rsa01ppr
DB2 Start : Success
Partition 0 : Success
.
.
.
2012/04/06 16:04:52 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:In ValidNodeNum(): NodeNum=0, NumPartitions=0
2012/04/06 16:04:52 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:In ValidNodeNum(): Partition_arr[0]=0
2012/04/06 16:04:52 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:In ValidNodeNum(): 0 is OK
2012/04/06 16:04:53 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:get_db2gcf_status() returns a return code = 0
2012/04/06 16:04:53 VCS NOTICE V-16-2-13076 (adb2rib32) Agent has successfully restarted resource(PROD_rsadb_instance).
2012/04/06 16:04:55 VCS NOTICE V-16-20004-74 (adb2rib32) Db2udb:PROD_rsadb_instance:offline:DB2 shutdown returned the output: 04/06/2012 16:04:54 0 0 SQL1064N DB2STOP processing was successful.
SQL1064N DB2STOP processing was successful.
<< Finally after cleaning and restarting, it shuts down.
.
.
.
2012/04/06 16:05:01 VCS DBG_3 V-16-20004-0 (adb2rib33) Db2udb:PROD_rsacmdb_instance:monitor:No db2sysc process found.
2012/04/06 16:05:26 VCS DBG_3 V-16-20004-0 (adb2rib32) Db2udb:PROD_rsadb_instance:monitor:No db2sysc process found.
2012/04/06 16:05:27 VCS DBG_TRACE V-16-50-0 internal_is_offline res =PROD_rsadb_instance, sys =adb2rib32

internal.C:[4113]
2012/04/06 16:05:27 VCS INFO V-16-1-10305 Resource PROD_rsadb_instance (Owner: Unspecified, Group: RSAPASSMARK_DB2PROD_DATASG) is offline on adb2rib32 (VCS initiated)
2012/04/06 16:05:27 VCS DBG_TRACE V-16-50-0 PROD_rsadb_instance::state transition from ONLINE to OFFLINE
<< Finally the 3’rd Db2udb resource goes offline
.
.
.
2012/04/06 16:05:35 VCS NOTICE V-16-1-10446 Group RSAPASSMARK_DB2PROD_DATASG is offline on system adb2rib32


Thus, from the above logs it is clear why, in the case of only 2 Db2udb resources and in the case of no dependency among them, switch over appears fine.

Solution

 

The customer can have a single DB2 resource with any of the databases that he can switch-over since HA is provided only at instance level.

VCS starts multiple partitions simultaneously, which can lead to a race condition. The agent's RestartLimit attribute is set to a value of three to help avoid this
condition. You can alleviate the potential for this condition by building resource dependencies for each partition. For example, within a service group you can have
the Db2udb resource 4 (where nodenum=1) depend on Db2udb resource 3 (where nodenum=2) etc. With the partitions built in a dependency tree, you can set the
value of the RestartLimit to zero.

Was this content helpful?