I’m using Qnet to pass messages between 2 nodes A and B.
At the moment this is only in one direction.
On the Rx side node B uses a resource manager to handle the incoming packets.
On the Tx side node A uses open() and write().
At some point (last time it was after 250,000 messages, sometimes less than 1000), the Rx thread(node B) blocks forever, see call stack below.
It appears dead lock has occurred.
8 MsgReplyv() 0xfe33ebb4 - blocked by its own process in the reply state.
7 resmgr_msgreplyv() 0xfe34be34
6 _resmgr_handler() 0xfe34db08
5 _resmgr_msg_handler() 0xfe337508
4 _message_handler() 0xfe336680
3 dispatch_handler() 0xfe3352d0
2 RxThread() 0x4804d824
sloginfo also reveals a qnet error:
qnet(kif): pulse_done_id(): MsgReply(327714) failed (No such process)
MsgReply() should be talking to the process id on Node A - which is not the process ID reported in sloginfo.
It appears the context block (including the process id) received by the resource manager and used by MsgReplyv() has been corrupted.
Can anyone confirm that this is the most likely cause of failure.