Problem
The initial cluster configuration fails because nodes can't synchronize with choosen NTP server:
Error Message
In /opt/VRTS/install/logs/ this error is seen:ntpd: no servers found
Time syncronized with NTP server.
2023-03-30T08:18:08.451404-0700 2 NTP configuration return : 1
2023-03-30T08:18:08.452149-0700 2 CPI ERROR V-9-40-2702 NTP configuration failed.
Running the same command run by configuration the output is:
access-appliance:~ # /opt/VRTSnas/pysnas/system/ntp.py install_ntp xx.xx.xx.xx 2>&1
Stopping ntpd service.
ntpd service stopped.
Disabling ntpd service.
ntpd service disabled.
Setting defaults in ntp.conf.
Querying NTP server.
server xx.xx.xx.xx, stratum 4, offset 10.049736, delay 0.04684
31 Mar 02:15:14 ntpdate[115412]: step time server xx.xx.xx.xx offset 10.049736 sec
Adding NTP server to ntp.conf.
Synchronizing time.
ntpd: no servers found
Time syncronized with NTP server.
Failed to synchronize time.
Cause
The problem is the network latency between the Access nodes and NTP server.
Troubleshooting steps:
Use ntpq to check synchronization status:
access-appliance:~ # ntpq -c assocind assid status conf reach auth condition last_event cnt
===========================================================
1 3898 961a yes yes none sys.peer sys_peer 1
Note: If the NTP works fine, the result should be reach=yes, condition=sys.peer.
ntpq> rv 3898associd=3898 status=961a conf, reach, sel_sys.peer, 1 event, sys_peer,
srcadr=10.XX.1XX.1X0, srcport=123, dstadr=10.XX.1XX.1X1, dstport=123,
leap=00, stratum=12, precision=-6, rootdelay=31.250, rootdisp=64.575,
refid=10.62.68.236,
reftime=e0d00ab8.2af01902 Wed, Jul 10 2019 6:56:56.167,
rec=e0d00c5e.d78d706e Wed, Jul 10 2019 7:03:58.842, reach=377,
If the "reach" is NOT "yes", and "condition" is NOT "sys.peer" (which means the time synchronization is having issue), check the local time and NTP server time. If the local time is greater/less than 1000 seconds ntpd will not set the clock. The time must be manually set.
The following status is showing the abnormal synchronization status:
access-appliance:~ # ntpq -c assocind assid status conf reach auth condition last_event cnt
===========================================================
1 58280 8011 yes no none reject mobilize 1
The " reach=no" means the NTP server does not respond to the request or the network is unavailable. Troubleshoot the network and NTP server.
Scenario 1: Network issue:
Use ping to check if the NTP Server is reachable and follow the network troubleshooting to check. Once the network issue is confirmed, ask the user to engage the network team and confirm the network issue is fixed.
Scenario 2: Wrong NTP IP or Service issue:
If the NTP server is pingable, it may be that the user inputs the wrong NTP IP or the NTP service runs into an issue. Confirm with the user the NTP IP address is correct, or use another NTP server if the user has one and asked the user to engage their admin team to check. Sometimes a server reboot can fix the issue, so we can try that route, if that is acceptable for the user.
Scenario 3: Windows NTP server:
Windows time service implements a non full-featured NTP. If the user uses a Windows Server as NTP server, the rootdisp may be higher than 1000. In that case, configure Windows NTP Server to synchronize a reliable external NTP Server.
If the reach=yes, but condition=reject, use ntpq with assoc and rv to check the flash code, dispersion, and rootdisp.
vrm:~ # ntpq -c assocind assid status conf reach auth condition last_event cnt
===========================================================
1 3898 9014 yes yes none reject reachable 1
Note: The assoc option can show the assid which is needed for rv later.
Use the rv command to get the flash code, dispersion, and rootdisp.
Run the ntpq command to enter the ntpq shell, then use rv assid to get the detailed information.
access-appliance:~ # ntpq
ntpq> rv 3898associd=3898 status=9014 conf, reach, sel_reject, 1 event, reachable,
srcadr=10.XX.1XX.1X0, srcport=123, dstadr=10.XX.1XX.1X1, dstport=123,
leap=00, stratum=12, precision=-6, rootdelay=31.250, rootdisp=1814.209,
refid=10.XX.XX.2X6,
reftime=e0cff348.12fb407d Wed, Jul 10 2019 5:16:56.074,
rec=e0cff42b.60680b73 Wed, Jul 10 2019 5:20:43.376, reach=377,
unreach=0, hmode=3, pmode=4, hpoll=6, ppoll=6, headway=50,
flash=400 peer_dist, keyid=0, offset=-2536.264, delay=0.354,
dispersion=16.515, jitter=4.414, xleave=0.038,
filtdelay= 0.35 0.29 0.32 0.26 0.28 3.22 0.28 0.35,
filtoffset= -2536.2 -2538.2 -2529.4 -2536.2 -2541.6 -2530.0 -2532.5 -2538.1,
filtdisp= 15.63 16.63 17.59 18.55 19.53 20.53 21.52 22.50
flash=400 peer_dist >>>>> reject reason
dispersion=16.515 >>>>> it presents the error/variance between that NTP server and client
rootdisp=1814.209 >>>>> it presents the total amount of error/variance from the root NTP server to client
flash=400 peer_dist
indicates the distance to the root NTP server is too long. It is unfit to synchronize."flash" staus codes are:
Code Message Description
0001 pkt_dup duplicate packet
0002 pkt_bogus bogus packet
0004 pkt_unsync server not synchronized
0008 pkt_denied access denied
0010 pkt_auth authentication failure
0020 pkt_stratum invalid leap or stratum
0040 pkt_header header distance exceeded
0080 pkt_autokey Autokey sequence error
0100 pkt_crypto Autokey protocol error
0200 peer_stratum invalid header or stratum
0400 peer_dist distance threshold exceeded
0800 peer_loop synchronization loop
1000 peer_unreach unreachable or nonselect
Solution
The workaround to allow Access nodes to properly synchronize with NTP server is to add:
tos maxdist 20
to both nodes /etc/ntp.conf and restart ntpd OS service.
Default threshold is 1.5 seconds and this setting will increase it to 20 seconds.
The customer will need to consult network team anyway to find out the root cause of the delays with NTP server in order to fix the issue.