Access 3340/3350 initial cluster configuration fails on NTP step

Article: 100055559
Last Published: 2023-03-31
Ratings: 1 2
Product(s): Appliances

Problem

The initial cluster configuration fails because nodes can't synchronize with choosen NTP server:



Error Message

In /opt/VRTS/install/logs/ this error is seen:

ntpd: no servers found
Time syncronized with NTP server.
2023-03-30T08:18:08.451404-0700 2 NTP configuration return : 1
2023-03-30T08:18:08.452149-0700 2 CPI ERROR V-9-40-2702 NTP configuration failed.


Running the same command run by configuration the output is:

access-appliance:~ # /opt/VRTSnas/pysnas/system/ntp.py install_ntp xx.xx.xx.xx 2>&1

Stopping ntpd service.

ntpd service stopped.

Disabling ntpd service.

ntpd service disabled.

Setting defaults in ntp.conf.

Querying NTP server.

server xx.xx.xx.xx, stratum 4, offset 10.049736, delay 0.04684
31 Mar 02:15:14 ntpdate[115412]: step time server xx.xx.xx.xx offset 10.049736 sec

Adding NTP server to ntp.conf.
Synchronizing time.
ntpd: no servers found

Time syncronized with NTP server.

Failed to synchronize time.

Cause

The problem is the network latency between the Access nodes and NTP server.

Troubleshooting steps:

Use ntpq to check synchronization status:     

access-appliance:~ # ntpq -c assoc
ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1  3898  961a   yes   yes  none  sys.peer    sys_peer  1


Note: If the NTP works fine, the result should be  reach=yes, condition=sys.peer.


ntpq> rv 3898
associd=3898 status=961a conf, reach, sel_sys.peer, 1 event, sys_peer,
srcadr=10.XX.1XX.1X0, srcport=123, dstadr=10.XX.1XX.1X1, dstport=123,
leap=00, stratum=12, precision=-6, rootdelay=31.250, rootdisp=64.575,
refid=10.62.68.236,
reftime=e0d00ab8.2af01902  Wed, Jul 10 2019  6:56:56.167,
rec=e0d00c5e.d78d706e  Wed, Jul 10 2019  7:03:58.842, reach=377,


If the "reach" is NOT "yes", and "condition" is NOT "sys.peer" (which means the time synchronization is having issue), check the local time and NTP server time. If the local time is greater/less than 1000 seconds ntpd will not set the clock. The time must be manually set.

The following status is showing the abnormal synchronization status:    
 

access-appliance:~ # ntpq -c assoc
ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1 58280  8011   yes    no  none    reject    mobilize  1


The " reach=no" means the NTP server does not respond to the request or the network is unavailable. Troubleshoot the network and NTP server.

Scenario 1: Network issue:    
Use ping to check if the NTP Server is reachable and follow the network troubleshooting to check. Once the network issue is confirmed, ask the user to engage the network team and confirm the network issue is fixed.

Scenario 2: Wrong NTP IP or Service issue:    
If the NTP server is pingable, it may be that the user inputs the wrong NTP IP or the NTP service runs into an issue. Confirm with the user the NTP IP address is correct, or use another NTP server if the user has one and asked the user to engage their admin team to check. Sometimes a server reboot can fix the issue, so we can try that route, if that is acceptable for the user.

Scenario 3: Windows NTP server:    
Windows time service implements a non full-featured NTP. If the user uses a Windows Server as NTP server, the rootdisp may be higher than 1000. In that case, configure Windows NTP Server to synchronize a reliable external NTP Server. 

If the  reach=yes, but  condition=reject, use ntpq with assoc and rv to check the flash code, dispersion, and rootdisp.
 

vrm:~ # ntpq -c assoc
ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1  3898  9014   yes   yes  none    reject   reachable  1


Note: The assoc option can show the assid which is needed for rv later.

Use the rv command to get the flash code, dispersion, and rootdisp.

Run the ntpq command to enter the ntpq shell, then use rv assid to get the detailed information.
 

access-appliance:~ # ntpq 
ntpq> rv 3898
associd=3898 status=9014 conf, reach, sel_reject, 1 event, reachable,
srcadr=10.XX.1XX.1X0, srcport=123, dstadr=10.XX.1XX.1X1, dstport=123,
leap=00, stratum=12, precision=-6, rootdelay=31.250, rootdisp=1814.209,
refid=10.XX.XX.2X6,
reftime=e0cff348.12fb407d  Wed, Jul 10 2019  5:16:56.074,
rec=e0cff42b.60680b73  Wed, Jul 10 2019  5:20:43.376, reach=377,
unreach=0, hmode=3, pmode=4, hpoll=6, ppoll=6, headway=50,
flash=400 peer_dist, keyid=0, offset=-2536.264, delay=0.354,
dispersion=16.515, jitter=4.414, xleave=0.038,
filtdelay=     0.35    0.29    0.32    0.26    0.28    3.22    0.28    0.35,
filtoffset= -2536.2 -2538.2 -2529.4 -2536.2 -2541.6 -2530.0 -2532.5 -2538.1,
filtdisp=     15.63   16.63   17.59   18.55   19.53   20.53   21.52   22.50

flash=400 peer_dist  >>>>> reject reason
dispersion=16.515    >>>>> it presents the error/variance between that NTP server and client
rootdisp=1814.209    >>>>> it presents the total amount of error/variance from the root NTP server to client


flash=400 peer_dist indicates the distance to the root NTP server is too long. It is unfit to synchronize.

"flash" staus codes are:

Code    Message            Description
0001    pkt_dup            duplicate packet
0002    pkt_bogus        bogus packet
0004    pkt_unsync        server not synchronized
0008    pkt_denied        access denied
0010    pkt_auth        authentication failure
0020    pkt_stratum        invalid leap or stratum
0040    pkt_header        header distance exceeded
0080    pkt_autokey        Autokey sequence error
0100    pkt_crypto        Autokey protocol error
0200    peer_stratum    invalid header or stratum
0400    peer_dist        distance threshold exceeded
0800    peer_loop        synchronization loop
1000    peer_unreach    unreachable or nonselect

Solution

The workaround to allow Access nodes to properly synchronize with NTP server is to add:
tos maxdist 20

to both nodes /etc/ntp.conf and restart ntpd OS service.

Default threshold is 1.5 seconds and this setting will increase it to 20 seconds. 

The customer will need to consult network team anyway to find out the root cause of the delays with NTP server in order to fix the issue.

Was this content helpful?