MB Garbage Collection starts on node1 (mbe,cr) and runs for days without doing anything (from what we can tell)
The MBGC on the other mbe,cr nodes does not start since this one never completes.
Need assistance to figure out why and what to do about it.
Recommend to determine what is happening with MB garbage collection.
PDDO data removal, cr garbage collection and queue processing are all running successfully.
MB garbage collection has not run for over 1 month, and the last time it was killed it had run for 170 hours.
The job gets suck on node pdnode03 and does not move to the other nodes.
We can see the Storage/tmp/workflow.1993677 file is created, but not updating after the DEREF line.
Doing Storage/tmp # lsof | grep workflow.1993677
shows that these 4 processes are accessing that file.
pdagent 29915 root 9u REG 199,65534 141 2976 /Storage/tmp/workflow.1993677
php 31801 root 1u REG 199,65534 141 2976 /Storage/tmp/workflow.1993677
php 31801 root 2u REG 199,65534 141 2976 /Storage/tmp/workflow.1993677
php 31801 root 9u REG 199,65534 141 2976 /Storage/tmp/workflow.1993677
Used top (or equiv) and it shows that postmaster is showing high CPU utilization so it appears to be doing something.
Rebuild the MBE database on the problem mbe,cr node.
---- Further troubleshooting:
What you can do is this from spa node:
- Put pdagent on server nodes in debug (pdagent --debug)
- Run MBGC
- After it failed, find the correct pd_jobstep_<jobstepid>.php
(a "grep MBGarbage /Storage/tmp/pd_jobstep*" will get you this)
- Run the script manually: /opt/pdag/bin/php /Storage/tmp/pd_jobstep_xyz.php
(preferably in screen)
If this works, the issue is with the pdagent/pdwfe. Check those logs first.
If this fails as well, but without a decent error:
- Find Application::start(); line in jobstep PHP file
- Add just after this line the following 2 lines:
Debug::$debug = true;
Debug::$debug2Screen = true;
- Rerun and check for error in debug output (might be very messy)
Puredisk 6 node VCS cluster
PureDisk 22.214.171.124 EEB20 with latest rollup2 version is installed.
There are 5 Active nodes. SPA, and 4 MB, CR's and 1 SPARE Node
There are two backup environments running 6.5.6 and one running 126.96.36.199, both are sending PDDO backups to this storage pool.
Problem is with node1 (mbe,cr)
MB Garbage collection appears to be running but not doing anything
Node 1 "pdnode03" (mbe,cr)
db Name db ID db Size Bytes
------------ --------- --------------- -----------
crdb 479618193 33,661,723,124 ( 31.3 Gb)
mb 16387 13,476,079,092 ( 12.6 Gb)
postgres 10819 4,001,268 ( 3.8 Mb)
----------- --------- --------------- -----------
Total 47,141,803,484 ( 43.9 Gb)