Node is not able to join cluster, HAD daemon getting killed
It is observed in Linux environment that sometimes node face issues in joining the cluster. For e.g in a multinode cluster, couple of nodes will form cluster but when other nodes try to join the cluster "had" daemon gets killed on nodes that were already part of cluster.
Also seen that later nodes when try to join the cluster get stuck on "REMOTE_BUILD" state, they get stuck because the node which was providing the snapshot of main.cf to them leaves cluster membership. You would notice following error message in this case in engine_A.log:
V-16-1-10468 Node providing snapshot has left the cluster
Also, you will observe following GAB message in the engine_A.log:
Jun 24 16:40:13 <hostname> Had: VCS ERROR V-16-1-10119 GabHandle::push returned = 12, gh_src = 0, gh_gen = 0, gh_size = 16358
Jun 24 16:40:13 <hostname> Had: VCS ERROR V-16-1-11103 VCS exited. It will restart
Even if you try stopping &restarting "had", GAB &LLT - it doesn't help.
Major cause found for this behavior is lack of kernel memory
You can workaround this by rebooting the server, that should free up kernel space for that time but likely issue might appear again.