A general setup (the very first for me) of Corosync + Pacemaker is made upon 4 Virtual Servers with a VirtualIP, a private network is organised with a help of OpenVPN.
corosync-2.4.3-4.el7.x86_64 corosynclib-2.4.3-4.el7.x86_64 pacemaker-1.1.19-8.el7_6.4.x86_64 pacemaker-cli-1.1.19-8.el7_6.4.x86_64 pacemaker-cluster-libs-1.1.19-8.el7_6.4.x86_64 pacemaker-libs-1.1.19-8.el7_6.4.x86_64 pcs-0.9.165-6.el7.centos.1.x86_64
So I have here 4 VPS with CentOS 7, running OpenVPN. The cluster status in a regular state:
# pcs status Cluster name: hacluster Stack: corosync Current DC: node2 (version 1.1.19-8.el7_6.4-c3c624ea3d) - partition with quorum Last updated: Sat Jun 15 14:00:36 2019 Last change: Sat Jun 15 02:25:39 2019 by hacluster via crmd on platinum 4 nodes configured 1 resource configured Online: [ node1 node2 node3 master ] Full list of resources: virtualIP (ocf::heartbeat:IPaddr2): Started node1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Everything is running fine. Each node can ping another nodes using 172.16.172.0/24 addresses.
If I reboot a VPS the cluster stays online and virtual IP is moved to another active node. So far everything is fine.
Yesterday due to a DDoS attack an IP address from one of the nodes got black-holed. So the other 3 ones could not connect to it, and since that corosync started consume all possible CPU and even more. I had to
killall -9 corosync to bring the servers back to live.
The cluster started to show all nodes are offline even the local one. Nothing helped, tried:
pcs cluster localnode remove node1
restarted daemons, stop/start cluster, etc. CPU consumption by corosync started to grow each time it started.
I guess I missed something very obvious, and still ma I not too sure what is it exactly.
The cluster has recovered only after the failed node returned back online after 4 hours of a downtime.
Kindly let me know what I need to tune to keep the cluster online even if one-two nodes are not accessible.