QNET - MsgReplyv() blocked forever

sleuth · January 15, 2012, 10:14pm

I’m using Qnet to pass messages between 2 nodes A and B.
At the moment this is only in one direction.
On the Rx side node B uses a resource manager to handle the incoming packets.
On the Tx side node A uses open() and write().
At some point (last time it was after 250,000 messages, sometimes less than 1000), the Rx thread(node B) blocks forever, see call stack below.
It appears dead lock has occurred.

NODE B:-
8 MsgReplyv() 0xfe33ebb4 - blocked by its own process in the reply state.
7 resmgr_msgreplyv() 0xfe34be34
6 _resmgr_handler() 0xfe34db08
5 _resmgr_msg_handler() 0xfe337508
4 _message_handler() 0xfe336680
3 dispatch_handler() 0xfe3352d0
2 RxThread() 0x4804d824

sloginfo also reveals a qnet error:
qnet(kif): pulse_done_id(): MsgReply(327714) failed (No such process)

MsgReply() should be talking to the process id on Node A - which is not the process ID reported in sloginfo.

It appears the context block (including the process id) received by the resource manager and used by MsgReplyv() has been corrupted.
Can anyone confirm that this is the most likely cause of failure.

Thanks.

maschoen · January 17, 2012, 6:04pm

While you’ve said quite a bit here, there is still not much to go on. The id’s shown below all start with 0xfe which suggests to me that corruption has not taken place. If corruption had taken place, I would suspect an abort would be more likely or some strange error code.

In a very simple interaction where one process/thread sends a message and another receives it, things are hard to get wrong. The only place things often do is if the receiver receives but doesn’t reply.

In a more complex interaction with multiple threads in a process and the use of pulses, one must be careful that one hasn’t created the blockage.

My best guess, which is all it is, is that you have a race condition that causes this. Since it doesn’t happen very often, your best line of attack is a careful review of what the code does at each related message passing point.

sleuth · January 18, 2012, 12:39am

Thanks for the input.
Yes i agree it looks like a race condition.
What i did not figure out is the number that qnet reports in the sloginfo log i.e.
qnet(kif): pulse_done_id(): MsgReply(327714) failed (No such process)

This looks likes a pid, but no such pid exists on the local or remote node which is why i also suspected corruption.
The qnet log does report a different number on each failure.
I tried to get the qnet source to find out exactly what is reported in the log when this error occurs but it appears that the software is no longer open source.

mario · January 24, 2012, 10:59am

Is your application multithreaded ( I see there is a fonction called RxThread ). Is the dpp and channel used by some other thread. Is it possible the message has already been reply to/

sleuth · January 24, 2012, 3:39pm

The resource manager is single threaded at this point, since we only have a single peer to peer connection.
However, this may change in the future and we’ll modify the rm to be multi-threaded.

I have managed to stop the problem from occurring but I don’t fully understand why.

Originally when IO_Write was invoked by _resmgr_io_handler() in the context of the RxThread, IO_Write would do some minimal validation before passing the message upward via mq_timedsend().

8 io_write() 0x4804d364
7 _resmgr_io_handler() 0xfe34e248
6 _resmgr_handler() 0xfe34dd88
5 _resmgr_msg_handler() 0xfe337508
4 _message_handler() 0xfe336680
3 dispatch_handler() 0xfe3352d0
2 NodeMessageRxThread() 0x4804da28

_resmgr_handler() would then occasionally block on the reply as seen in the earlier post i.e.

8 MsgReplyv() 0xfe33ebb4 - blocked by its own process in the reply state.
7 resmgr_msgreplyv() 0xfe34be34
6 _resmgr_handler() 0xfe34db08
5 _resmgr_msg_handler() 0xfe337508
4 _message_handler() 0xfe336680
3 dispatch_handler() 0xfe3352d0
2 RxThread() 0x4804d824

The change I made was to introduce a worker thread.
IO_Write (still in the context of RxThread) now copies the incoming message into a protected global buffer and returns.

The worker thread copies the message, does the validation and calls mq_timedsend().

This code now works perfectly and I have no more issues.
But I’m still at a loss why I need the extra copy and extra thread?