Application hangs on ConnectAttach()

I’m running an application which hangs after about 2 hours and it’s always at the same place when calling for name_open(). I’m running a multithreaded application and this function is called a lot of times. Could this error occur because of a stack overflow or do you have any other suggestions?

(gdb) bt
#0 0xb032c0cf in ConnectAttach ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
#1 0xb0347b89 in _connect_request ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
#2 0xb03478ff in _connect_request ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
#3 0xb0348058 in _connect_ctrl ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
#4 0xb03471fd in _connect_entry ()
from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
#5 0xb0347028 in _connect () from /usr/qnx630/target/qnx6/x86/lib/libc.so.2
#6 0xb0332213 in name_open () from /usr/qnx630/target/qnx6/x86/lib/libc.so.2

By hang you mean never returns?

The other process which you open from, must be able to receive and reply to an IO_CONNECT message. The problem is probably on the other end of the connection.

Since the function called “a lot of times”, I’m going to assume that you’ve remembered to close the connection each time before re-opening it.

Otherwise you might be running out of file descriptors.

Tim

Mario: Yes, name_open never gets a return value. I believe that the server side is correct, EOK the received message with MsgReply(). And when it can run for a couple of hours shouldn’t it be any problem with the connection…

Tim: Yes I’m closing it before re-opening it.

Really confused about this.

Mr_Hq,

You can definitely check on the number of fd’s open by doing a ‘sin fd’ command when the problem occurs.

Anyway, I was looking again at your original posting and I think I see a potential problem. Your backtrace from GDB indicates your stuck in ConnectAttach(). But you indicated your running a multi-threaded app. name_open should be using ConnectAttach_r (the thread safe version).

Are you by any chance doing name_open() calls from various threads? If so, you may be running into a threading problem. There doesn’t seem to be any option to name_open to use a thread safety version so you may have to alter your design to only have one thread do the name_open call, protect the name_open call with a mutex, or ask QNX if there is a thread safe version (one would have hoped they would have compiled all their libraries to be using the thread safe versions by default).

Tim

name_open() is thread safe according to QNX, check the bottom of this page.

qssl.com/developers/docs/6.3 … _open.html

Mr_Hq,

Yes I know it is as I looked at the same doc’s. But it’s clearly calling routines which aren’t thread safe. The reason that the _r family of routines exist for a lot of calls is for thread safety.

Someone goofed it seems on the doc’s or on the calls inside name_open.

Since your stuck and have nothing else to try, I’d recommend putting a mutex around your name_open() call so that only 1 thread at a time can be calling it and see if that makes the problem go away.

The other options are that the problem is in the server someplace (which you believe it’s not) or you are leaking fd’s in some manner (which you can easily check for using ‘sin fd’ when you get stuck).

Note unless your server is doing multi-threaded receives on the other end, the following could be happening:

Thread 1 in your app does a send call to the server
Servers receive calls gets the send and starts to do the work (ie no immediate reply)
Thread 2 in your app does the name_open call to the server (which does a connect)
Server can’t respond with the EOK because it’s recv thread is busy so you appear hung in thread 2 on your side.

Tim

I don’t think that’s true, _r is mostly about how a function returns error code or data. In some cases is does make the function thread safe because the function has extra arguments that points to buffer to store results instead having a static buffer.

Check with pidin what your process is doing, i bet you it’s REPLY BLOCKED on the server.

  • Mario

I have name_open() in few threads, but it all calling only once, when program start. I have never have any errors with them. Probably if you name_open() - name_close() call many times on small pair of time, and you thread, that call name_open/name_close, have high priority then -gns- server, it(gns) might not to be in time to release resourcess (name_close) and you will have that you have.

Tim:
I’ve tried with mutex around the name_open call now and the error is still there, unfortunate I cannot run sin when the error occurs, I can see that the process has started but it just hangs. Probably because the cpu load has risen to max. My main application goes from a couple of percents to over 40 when it hangs. I’ve run sin during normal program execute and can’t see that the fd:s has increased in number.

Mario:
No it’s not REPLY BLOCKED, it’s in READY state

Mr_Hq,

What priority is your process running at? Is your shell running directly on your target or are you using momentics to view what’s going on? I’d suggest that however your connecting, you need to increase the priority of your shell (use the renice command) or connection so you can see what’s going on when this occurs.

I’d also be interested in seeing what ‘pidin’ returns in addition to the sin command. As qnxloader mentioned, the connect actually happens through a process called gns. So I’d want to know what state that process is in too (you might want to start gns manually in another window with the -v option to see if it reports anything).

One more thing. Are you forced to reboot when this happens or can you eventually get control. If you can get control I’d like to know if there are any errors in the system log (sloginfo command) that occur when this problem starts.

Tim

P.S. Also, what version of the O/S and compiler are you using?

Priority is set to 10, not running gns. It’s made with qnx 4 style.

I can get control over the machine again, just to kill the main app process which uses a lot of the cpu. I tried sloginfo and it returns a lots of:

Sev Major Minor Args
2 8 0 Service cmd 0x22 returned errno 22

but it does it all the time not just when the error occur. Do you know why this is coming?

O/S is Neutrino 6.3.0
and an old compiler 2.95.3

Mr_Hq,

I’ve never used name_open() without gns running. I wasn’t even aware it would work without gns in 6.3.

It’s entirely possible there is a bug someplace without gns running.
Of course without seeing your code (driver and client) or doing some tests it’s impossible to know for sure. Is there a reason why your not using gns? Maybe just start it and see if this fixes your problem (you don’t need any special code).

As for the error 22 (ERANGE) I have no idea. I don’t have it on my system (and I use name_open) so I am not sure who/what’s generating it.

Tim

error code 22 is in decimal so it’s EINVAL.

My bad. I saw the ‘Service cmd 0x22 returned errno 22’ and had a dyslexic moment :slight_smile:

Tim

maybe cmd=0x22 , errno=22? :slight_smile:

Hm…

Are you sure “name_open()” never returned? Or is it just your program calling name_open()
again and again in a loop?

Did you check the return value of “name_open()” ? If name_open() report and error, what is it?

Yeah, I’m pretty sure of that it never returns (unless there is an extremely long delay…). Havn’t got any error report from name_open(), logged the result. The thing is that it doesn’t give any return value at all, just being blocked in some way.

Mr_Hq,

The next time it happens (assuming your regularly trying to make it happen in order to diagnose the issue further), try slaying (or exiting from) the server task that did the name open.

I wonder if that would let your app finally ‘return’ from the name_open call.

Also, did you try starting gns to see if that affected things one way or another?

Tim

Hi,

On the server side, you said that you correctly handle the _IO_CONNECT message, but you must also handle the _PULSE_CODE_DISCONNECT otherwise you run out of file descriptors after some time. See example in documentation of name_attach():

[code]
rcvid = MsgReceive(attach->chid, &msg, sizeof(msg), NULL);

   if (rcvid == -1) {/* Error condition, exit */
       break;
   }

   if (rcvid == 0) {/* Pulse received */
       switch (msg.hdr.code) {
       case _PULSE_CODE_DISCONNECT:
           /*
            * A client disconnected all its connections (called
            * name_close() for each name_open() of our name) or
            * terminated
            */
           ConnectDetach(msg.hdr.scoid);
           break;
       case _PULSE_CODE_UNBLOCK:
           /*
            * REPLY blocked client wants to unblock (was hit by
            * a signal or timed out).  It's up to you if you
            * reply now or later.
            */
           break;
       default:
           /*
            * A pulse sent by one of your processes or a
            * _PULSE_CODE_COIDDEATH or _PULSE_CODE_THREADDEATH
            * from the kernel?
    */
       }
       continue;
   }

   /* name_open() sends a connect message, must EOK this */
   if (msg.hdr.type == _IO_CONNECT ) {
       MsgReply( rcvid, EOK, NULL, 0 );
       continue;
   }[/code]

Regards,
Albrecht