Problems with ntpd under 6.3

pjy · February 1, 2006, 5:23am

Just wondering if anyone has seen any problems with CPU usage of ntpd on 6.3. We have some intermittent problems of ntpd grabbing up to 98% of CPU. The only solution seems to be to slay the daemon and restart it. Haven’t seen this happen on 6.2.1. Any ideas?

Peter

Dheerajs · February 9, 2006, 12:41pm

Hi,
I didn’t encounter the same problem with my system.
It would be nice if you explain a bit more, so that I can try with my system.
May be you can send the ntp.cfg file.

BR,
-Dheeraj

nedsnow · August 22, 2006, 12:30am

Pjy,
Did anything ever come of this? We have had the same problem since starting with QNX6, and QSSL reckon they’ve never heard of a problem with ntpd.
Ned

JimM · November 17, 2006, 8:33pm

I have seen this problem, but I don’t have a good solution yet. I would appreciate any information others can offer. I wonder if ntpd needs to be run at higher priority.

-Jim

Tim · November 20, 2006, 8:49pm

How often do you see this problem occur and does the deamon eventually fix itself if left alone?

When it does occur what’s happening on your system (is it active, just sitting there idle etc)?

Is the ntpd deamon sending out a lot of packets (you can sniff from another PC or use tcpdump at a higher priority on the machine) when this occurs? Is the deamon updating it’s disk file at this time?

If you want to test the priority theory, the next time it occurs, just log in as root and use the renice command to up the priority of the deamon to see if that fixes things.

Answers to these might help determine what’s going on.

If the deamon doesn’t fix itself it then it sounds like it’s waiting for something and doing an endless poll waiting that it can’t get out of (maybe it’s DNS server or the remote time server etc which is why I asked if it’s sending out packets). For example under QNX4 the printer deamon (I forget the name, lets call it Dev.par) used to consume 98% of the CPU at what appeared to be random times. After tracking this for a while I figured out it happened when the printer ran out of paper and the deamon was being told the printer was out of paper but it had no way to ever display that to the user so it kept polling to see if the printer had paper in it.

Tim

JimM · November 20, 2006, 11:28pm

Thanks for the ideas, Tim.

The problem of excessive CPU usage by ntpd occurs within about 15 minutes to an hour after the QNX console has been idle. The system is also running some simple I/O processes (at priority 16) that use two or three percent of the CPU. I am using my desktop PC running Windows XP as the ntp server; it is running NTP3 while the QNX system is running NTP4. My PC has two NICs: one for the company net, and one (with a switch) dedicated to the QNX systems. I have up to two QNX systems at a time on that network. I am not using a drift file. My PowerPC system has a CompactFlash card for a disk drive, and nothing writes to it during normal operations.

The problem doesn’t seem to occur while I am using the serial console. It seems to be the result of allowing the console to be idle for some minutes. Sometimes I will see the console become sluggish if I haven’t issued a command for a while. This sluggishness seems to go away (or become barely noticeable) while I am actively using the console, but will return if I leave it idle.

In the early stages of sluggishness, I can slay ntpd and cure the sluggishness. It may take some minutes for the command to be processed. I If the problem has gotten too advanced, both the console and telnet sessions are completely unresponsive and I can’t issue any commands. I have never seen the system correct itself from that state. I have had situations where the keyboard latency reached ten minutes. In those cases, the latency did not decrease with usage, but continued to get worse until it was at least days, after which I gave up waiting.

If I run ntpd at priority 9 or priority 17, the sluggishness problem does not develop and I have not been able to catch ntpd using more than a percent or two of the CPU under those conditions. I will try using renice to see if that can correct the problem once it has started.

When running at priority 9 or 17, over a long period of time I will see (using ntpq) that the QNX system will lose synchronization sporadically (there is a space instead of an asterisk in front of the ntp server’s IP address). It will usually will regain synchronization, but that may take hours. It can stay unsynchronized for days, even though WireShark (running on my PC) shows that ntp queries and replies are occurring. My PC is the only NTP sever included in the QNX system’s ntp.config file.

I’m going to experiment further with juggling priorities. Please let me know if you have any other ideas.

-Jim

nedsnow · November 21, 2006, 3:03am

Tim/Jim

Our experience doesn’t seem to have any parallels with Jim’s. Our application is an airport baggage handling system. We have two QNX 6.3.0 servers in a hot-standby configuration. The duty server has TCP/IP connections to 20+ Siemens PLCs, and communicates with 10-15 Windows clients over UDP. The standby server has only a couple of TCP connections open. The whole application runs at default priority. ntpd is used in server mode to synchronise the clients, using the hardware RTC as the time source.

If ntpd is running on the duty server, at some seemingly random time, it will start consuming more and more of the CPU time, until connections start to drop out due to timeouts, etc. The time can be days, or even weeks before it goes bananas. It never “fixes” itself until ntpd is killed, at which time everything goes back to normal. I have never had it happen on the standby, which points to the problem being related to the number of connections/network activity on the machine.

One time only, I had it happen on my development machine, and I was able to get the debugger on it. This seemed to indicate that it was not the ntpd process itself that was consuming the CPU, but something the OS was doing on its behalf, because the machine didn’t speed up when the debugger was attached (stopping the process). It was taking on the order of a minute to single-step one instruction in the debugger.

Needless to say, I currently run ntpd only on the standby

-Ned

Tim · November 21, 2006, 3:07pm

Ned,

Your experience with the debugger is interesting. That does seem to point to it not being nptd or that ntpd is only a part of the problem.

The next time it happens maybe you/Jim can instead run hogs as root to see who is really consuming the CPU time.

hogs -n -p25 -p1 -%1

This will run it at priority 25 (you can try higher if that’s not high enough) and update once a second and show all processes consuming more than 1% of the CPU. That should point out the culprit(s).

Jim,

You might also try running the hogs command esp since it seems to be a lot easier to repeat for you than it is for Ned.

Also I am not sure why you have run the deamon at prio 9 and 17. Were you just trying to see if running lower than the console and lower than your I/O tasks cured the issue (which it seemed to do since you said the system did not become sluggish but instead allowed the time to drift)?

Another thing to try is to disconnect the ethernet cable when the sluggishness occurs to see if that cures the problem (or makes it worse). Or try slaying your I/O processes to see if that fixes the problem.

Also when the problem starts, you might try running sloginfo and see if the ntpd or something else is logging system errors.

But I’m curious to see what hogs reports.

Tim

JimM · November 21, 2006, 5:15pm

Here is a hogs report for a situation with ntpd running at priority 10:

===============================
[/]# nice -n-1 hogs -n -%1

PID NAME MSEC PIDS SYSTEM
1 329 10% 10%
204814 ntpd 2544 83% 84%
299025 command_processor 140 4% 4%

204814 ntpd 2849 94% 94%
299025 command_processor 151 4% 5%

204814 ntpd 2850 94% 95%
299025 command_processor 151 4% 5%

command_processor is my application. I verified that running without my application did not change the symptoms.

I tried priority 9 so that I could use the console to get information about the system’s state. When that kept ntpd from hogging the CPU, I tried 17 (higher than my application) just to see what would happen. I wasn’t sure if it would suck up 99% or behave itself. Since it behaved itself, I would be satisfied at this point (assuming I can run my I/O thread at 18, ntpd at 17, and the rest of my threads at between 11 and 16), except that ntpd is unable to maintain synchronization for more than several hours no matter what priority I use. I have verified that it gets in a state where ntpq shows that it is not in synch (no * and times differ by seconds) and WireShark shows that there are apparently normal NTP packets being exchanged. I’m going to look at the upstream time references for my PC; perhaps they are of poor quality.

-Jim

Tim · November 21, 2006, 6:59pm

Jim,

It most certainly looks as if somethings broken in ntpd despite QNX’s insistence otherwise (they are probably just using an open source version of ntpd).

Was there anything in the system log when this occurred (use the sloginfo command)?

You verified that slaying your app doesn’t fix it. But does disconnecting the ethernet cable help/cause the deamon to fix itself?

It’s almost like the internal algorithms that ntpd uses to correct for time discrepencies have gone haywire at some point (the helpviewer info says it takes 15 minutes to get in sync if you don’t have a drift file and you don’t have one and your problems start soon after that) and are using all the CPU to try and adjust for either a bad timer chip on your CPU or serious network latency (hence the reason for disconnecting the ethernet cable to check this)

Tim

JimM · November 22, 2006, 12:31am

I wasn’t able to do much today, and won’t be back in the office until Monday. I will try the cable test then. I did find that ntpd did not leave anything in the system log as it went in and out of synchronization, but that was not with ntpd hogging the CPU.

-Jim

nedsnow · November 27, 2006, 9:07pm

Tim,

Sorry for the delay - I’m only in the office two days/week.

Hogs is the first thing I run when I get a report from operations that the system seems to be slowing down. It shows ntpd as being the culprit, but the percentages aren’t as high as Jim’s - usually in the 40-60% range. This may be because our system is doing more than Jims.

Given my experience with the debugger, I still think that if anything is broken in ntpd, it is not the code per se, but the way in which it uses kernel facilities. It is indeed the open source ntpd, version 4.1.1. Our local support compiled and supplied 4.2 when trying to solve this problem, but it made no difference.

The code does things which seem unusual to me. It uses both async signals (sig_io) and select to do its socket i-o. We see errors in syslog showing Interrupted Function Call on sendto and select operations. Unless the code for 4.1 is very different from 4.2, these shouldn’t happen, as the code seems to ignore that particular errno.

So it’s all very strange. I look forward to progess being made with Jim, as his problem seems so much more reproducible than mine. I would be very interested in seeing if he could put the debugger on his ntpd process, to see if his experience parallels mine.

Regards,
Ned

Tim · November 28, 2006, 3:43pm

Ned,

I’m surprised you were able to see anything in the debugger. When I do a ‘gdb ntpd’ the debugger tells me there is no debugging symbols available so I can’t see how very much can be gleaned by attaching the debugger when the problem occurs.

The 4.2 version you compiled interally can of course have debug symbols and is a much better candidate for attaching the debugger when the problem occurs.

I am wondering about the select call being the culprit. I assume (without seeing the code) it’s meant to wait on data from the sockets from the remote time server. If the select starts returning with an error of -1 for some reason the code may end up more or less running ready instead of blocking. I say that because I seem to recall (dim memory here) having this issue with select before where I got a -1 from one of the things I was waiting on and somehow that error couldn’t be cleared so the select endlessly was returning -1 causing my code to run ready. I had to do something (again I can’t remember what) to clear the condition to allow select to start functioning properly again. I wonder if this is the issue and if you could add printf’s to your 4.2 code to see if select ever gets to that stage.

Tim

nedsnow · November 30, 2006, 9:05pm

Tim,

I think when I did this, it was the 4.2 version which was compiled with debugging information. However, this is not necessary to see the problem. When the debugger is attached to a running process, it stops the process. In the case of ntpd, this is usually in sigsuspend, where it spends most of its time. If the CPU was being consumed by code within nptd, one would expect the rest of the system to return to normal when this is done. It does not.
Having attached the debugger, one can then type “nexti” to execute the next machine instruction. This is what was taking significant amounts of time in the particular case I mentioned. When the debugger is attached, the process is not running, so cannot consume CPU.

This is why I think that something in the way ntpd uses kernel facilities causes the kernel to “get its panties in a knot”. The clean-up done when the process is killed allows it to recover. Simply halting the process does not.

Tim · December 1, 2006, 3:39pm

Ned,

Something doesn’t quite ring right with what you just wrote and with your prior comment about hogs.

As you say, attaching the debugger halts the process. So at that point the system should return to normal.

If it’s not, then something else must be now consuming those CPU cycles. I wonder what it is and I’d be curious to know.

The one thing about ntpd that is a mystery to me (since I don’t have the code) is it gradually adjusts the time in the system to match what the external source says. Since this is probably an iterative process it’s got to play havoc with the interal timers in the system because they have to handle all kinds of micro adjustments instead of just one large time adjustment.

I wonder if the problem occurs when ntpd is trying to make all those micro adjustments and if it could be induced by letting your system run for a bit then manually changing the time from a terminal to be a few seconds (or minutes) off what the ntpd server is reporting and then see if that causes the system to go haywire. It may be some case of the QNX system getting behind or ahead of the ntpd server that causes the issue.

Tim

nedsnow · December 3, 2006, 8:07pm

I can’t help but agree that it’s strange, but I can only call 'em as I see 'em. That’s why I would be very interested to hear what happens in Jim’s system if he puts the debugger on the ntpd process.

If you’re interested in the code, you can get it from ntp.org. However, remember that ntpd is acting as the server in our system - There is no “server” that it is trying to synchronise with.

Regards,
Ned

JimM · January 4, 2007, 7:03pm

I’m sorry for the long delay in replying.

I tried disconnecting the Ethernet cable when ntpd was hogging the CPU, and it had no effect. Before that, I checked the Ethernet traffic using Wireshark, and it showed no NTP traffic during that time.

In all of the following tests, I was running ntpd at priority 9, so that I could usually use the console to check the system. There have been times when the console became unresponsive even when ntpd was at priority 9, and at those times I found the system would not respond to a telnet connection request, either. I don’t recall if I mentioned in my earlier posts that I am using “tinker panic 0” in the ntp.config file, so that ntpd does not exit even if the error in synchronization is large. I am not running the debugger; I have never used it.

I tried various combinations of options in the ntp.conf file, such as changing iburst, burst, maxpoll and minpoll. None were effective.

I tried changing the QNX tick size. It had been set at 50 us (the clock interrupt is used for some I/O). I found that changing the tick size on the fly could trigger or cure the problem, but this apparently depended on some internal state of the system. For instance, on an overnight runs with, a tick size of 250 us, I found one of two boards had ntpd using 95% of the CPU. In setting up that run, I started ntpd after setting the tick size.

Last night I tried setting up the network to use broadcast NTP packets instead of query-response. I used one MPC-5200 system as the broadcaster (it was synchronized to my desktop PC) and two others as broadcast clients. One of the clients is running a 1 ms tick, the other a 50 us tick. Over the course of this morning, the system with the 50 us tick has been experiencing transient episodes of ntpd hogging the CPU, but it recovers. That system is also losing and recovering synchronization, according to ntpq (has a space or an asterisk in it report). Loss of synch is correlated with ntpd hogging the CPU. The system I used as broadcaster did not show any evidence of hogging the CPU, but it is hard to be sure. I may run this again tonight with hogs running to log the usage on all three systems.

At this point I am thinking of two solutions. One is to use a longer tick (250 us to 1 ms will work for me) and set up another timer interrupt to run my I/O. I will have to devise a means of verifying that this really works, as I have had some hogging incidents with both of those tick sizes. The faint hope here is that the I/O interrupt (which is at most a few microseconds, but triggers IO processing threads) is interfering with ntpd somehow, and that separating it from the tick interrupt will effect a cure. The other is to hack into ntpd, cutting out the stuff I don’t need (encryption, variable polling rate, selecting the best time server), and simplify the algorithm so that it cannot hog the CPU. That is an undesireable choice, but it seems like the option with the best chance of success.

-Jim

Tim · January 5, 2007, 3:47pm

Jim,

Did I just read your post right in terms of tick size? Your using 50 microseconds and have increased it to 250 microseconds to 1 millisecond (vs the normal/default 10 millisecond tick)? What exactly needs that kind of fidelity in your system? Are you polling some hardware at 50 us intervals?

If this is the case, I am can’t say I am surprised your seeing the problem very regularly in a polled case.

I haven’t looked at the ntpd code but I do know when the O/S ticksize is set to that small a rate that strange things can and will occur because the timeslice is SO small that it’s easy for thrashing to occur due to such frequent O/S scheduling especially once the ntpd starts adjusting the time on the system.

Hopefully a slightly longer tick size will help you. But I won’t be surprised if you end up having to modify ntpd code. At least broadcast mode seems to be saving CPU cycles vs polling mode.

Tim

JimM · January 5, 2007, 10:22pm

Thanks for those observations, Tim. We are using the timer tick to manage an analog input system. Using the tick interrupt was an expedient that avoided setting up a separate hardware timer interrupt. It worked fine (just a few percent of CPU overhead penalty), so we didn’t get around to changing it. I guess I will have to add a separate timer now. I think we can tolerate a 1 ms tick in terms of timing precision. 250 us would be plenty of resolution.

-Jim

Problems with ntpd under 6.3

204814 ntpd 2850 94% 95% 299025 command_processor 151 4% 5%

204814 ntpd 2850 94% 95%
299025 command_processor 151 4% 5%