TCP sending FIN out of sync

I would like some insight into why I sometimes get a QNX box (acting like the server) to send a FIN tcp message out of sync. Below are my two ethereal comms snoops, one being a good comms transaction and the other being the bad comms transaction. The host app is a windows app that is running the exact same code, just sometimes the qnx box screws up.

GOOD:

No. Time Source Destination Protocol Info
3879 260.485094 172.16.9.73 172.16.9.52 TCP 4188 > 24000 [SYN] Seq=0 Ack=0 Win=65535 Len=0 MSS=1460

No. Time Source Destination Protocol Info
3880 260.485247 172.16.9.52 172.16.9.73 TCP 24000 > 4188 [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460

No. Time Source Destination Protocol Info
3881 260.485271 172.16.9.73 172.16.9.52 TCP 4188 > 24000 [ACK] Seq=1 Ack=1 Win=65535 [CHECKSUM INCORRECT] Len=0

No. Time Source Destination Protocol Info
3882 260.485533 172.16.9.73 172.16.9.52 TCP 4188 > 24000 [PSH, ACK] Seq=1 Ack=1 Win=65535 [CHECKSUM INCORRECT] Len=33

No. Time Source Destination Protocol Info
3883 260.502788 172.16.9.52 172.16.9.73 TCP 24000 > 4188 [PSH, ACK] Seq=1 Ack=34 Win=17520 Len=19

No. Time Source Destination Protocol Info
3884 260.709418 172.16.9.73 172.16.9.52 TCP 4188 > 24000 [ACK] Seq=34 Ack=20 Win=65516 [CHECKSUM INCORRECT] Len=0

No. Time Source Destination Protocol Info
3885 262.534701 172.16.9.73 172.16.9.52 TCP 4188 > 24000 [FIN, ACK] Seq=34 Ack=20 Win=65516 [CHECKSUM INCORRECT] Len=0

No. Time Source Destination Protocol Info
3886 262.534856 172.16.9.52 172.16.9.73 TCP 24000 > 4188 [ACK] Seq=20 Ack=35 Win=17520 Len=0

No. Time Source Destination Protocol Info
3887 262.536383 172.16.9.52 172.16.9.73 TCP 24000 > 4188 [FIN, ACK] Seq=20 Ack=35 Win=17520 Len=0

No. Time Source Destination Protocol Info
3888 262.536411 172.16.9.73 172.16.9.52 TCP 4188 > 24000 [ACK] Seq=35 Ack=21 Win=65516 [CHECKSUM INCORRECT] Len=0

BAD:

No. Time Source Destination Protocol Info
3871 258.423211 172.16.9.73 172.16.9.52 TCP 4187 > 24000 [SYN] Seq=0 Ack=0 Win=65535 Len=0 MSS=1460

No. Time Source Destination Protocol Info
3872 258.423364 172.16.9.52 172.16.9.73 TCP 24000 > 4187 [SYN, ACK] Seq=0 Ack=1 Win=16384 Len=0 MSS=1460

No. Time Source Destination Protocol Info
3873 258.423388 172.16.9.73 172.16.9.52 TCP 4187 > 24000 [ACK] Seq=1 Ack=1 Win=65535 [CHECKSUM INCORRECT] Len=0

No. Time Source Destination Protocol Info
3874 258.423657 172.16.9.73 172.16.9.52 TCP 4187 > 24000 [PSH, ACK] Seq=1 Ack=1 Win=65535 [CHECKSUM INCORRECT] Len=33

No. Time Source Destination Protocol Info
3875 258.427683 172.16.9.52 172.16.9.73 TCP 24000 > 4187 [FIN, ACK] Seq=1 Ack=34 Win=17520 Len=0

No. Time Source Destination Protocol Info
3876 258.427729 172.16.9.73 172.16.9.52 TCP 4187 > 24000 [ACK] Seq=34 Ack=2 Win=65535 [CHECKSUM INCORRECT] Len=0

No. Time Source Destination Protocol Info
3877 260.442615 172.16.9.73 172.16.9.52 TCP 4187 > 24000 [FIN, ACK] Seq=34 Ack=2 Win=65535 [CHECKSUM INCORRECT] Len=0

No. Time Source Destination Protocol Info
3878 260.442799 172.16.9.52 172.16.9.73 TCP 24000 > 4187 [ACK] Seq=2 Ack=35 Win=17520 Len=0

Your server called close() or shutdown(, SHUT_WR) on the
socket?

-seanb

hmmm… my server in this case would be my qnx box and with some of the debug prints - I have - I see at this point in time a ConnectAttach fails for an invalid process number but the ConnectAttach is called with zero for process id parameter, and the return value of a ChannelCreate which was called in a thread that spawned from the same function as the failing ConnectAttach.

-Eric

I have a new question in regards to some more research:

I think that the ConnectAttach is dieing because the channel id created by ChannelCreate is no longer valid. The channel being created though is only setup so the thread responsible for transmiting messages out of a tcp socket can get those messages from other tasks/processes; so I am curious to what the correlation would be towards creating a channel for inner-qnx messaging and the tcp stack seemingly closing the socket since they happen at the same time?

I guess to be more detailed the ConnectAttach is failing with this message:
“No such process”

-eric

These sound like different issues but they may have the same
root cause. Could your server be dying / exiting? Look in
/var/dumps for a core file. If it’s dying / exiting the tcp/ip stack
will be sent a disconnect pulse and close its sockets and any
channels it created will no longer be valid.

-seanb

i thought of that, there are no new var/dumps for my tasks.

if I look at the problems seprately I have a main who does a CreateChannel in a thread which doesn’t bomb out and gives me a channel of 10. That “CreateChannel” thread then spawns a receivethread who does a ConnectAttach on the main parent (which doesn’t bomb out) and then does a MsgSend to this parent to give him/her the CreateChannel’s return id of 10. The main parent then does a ConnectAttach on the CreateChannel’s id (of 10 in this case) to use to comminicate with a transmitthread which is created by the “CreateChannel” thread after the thread spawn of the receivethread.
So when a socket disconnects the receiveThread gets the pulse and MsgSend to the parent let him know (then this receivethread exits) and the parent MsgSendPulse the transmitthread to tell him to close.

This all works fine - MOST OF THE TIME - but once in a while I get it where the ConnectAttach in the parent thread of the created channel from the “CreateChannel” (10 in the ex above) will fault wih the errno msg: “No such process”;

Can I somehow monitor or ask the state/status of the this CreateChannel returned channelNumber?

-eric

I FIXED IT!!

There is an issue where qnx’s CreateChannel was sometimes returning a channel number that was not destroyed yet, so when my other threads who were cleaning up from the last disconnect lagged behind the new connection coming in they would destroy the channel which happen to be the same number as the newly returned channel number.

Eric