Node is not able to join cluster, HAD daemon getting killed

Problem

Node is not able to join cluster, HAD daemon getting killed

Solution

It is observed in Linux environment that sometimes node face issues in joining the cluster. For e.g in a multinode cluster, couple of nodes will form cluster but when other nodes try to join the cluster "had" daemon gets killed on nodes that were already part of cluster.
 
Also seen that later nodes when try to join the cluster get stuck on "REMOTE_BUILD" state, they get stuck because the node which was providing the snapshot of main.cf to them leaves cluster membership. You would notice following error message in this case in engine_A.log:
 
V-16-1-10468 Node providing snapshot has left the cluster
 
Also, you will observe following GAB message in the engine_A.log:
 
Jun 24 16:40:13 <hostname> Had[16151]: VCS ERROR V-16-1-10119 GabHandle::push returned = 12, gh_src = 0, gh_gen = 0, gh_size = 16358
 
Jun 24 16:40:13 <hostname> Had[16151]: VCS ERROR V-16-1-11103 VCS exited. It will restart
 
Even if you try stopping &restarting "had", GAB &LLT - it doesn't help.
 
Major cause found for this behavior is lack of kernel memory
You can workaround this by rebooting the server, that should free up kernel space for that time but likely issue might appear again.
 
 

 

 

Terms of use for this information are found in Legal Notices.

Search

Survey

Did this article answer your question or resolve your issue?

No
Yes

Did this article save you the trouble of contacting technical support?

No
Yes

How can we make this article more helpful?

Email Address (Optional)