Detecting loss of a TCP connection when other end has restar

John_Nagle1 · May 4, 2004, 11:55pm

[QNX 6.21]

I have a program that opens a TCP connection to another
device, a Galil motion controller. This device can reset,
in which case it loses all its TCP connections but will open
new ones.

When this happens, my program gets ordinary “timeout” returns
from “select”. I don’t get any indication that the socket
has been disconnected, even though I check for “socket
exceptions”… In addition, if I write to the
closed connection with “send”, I dont’ get an error return.

When the other end of the connection has reset, the next
TCP message should be rejected with an RST, so the protocol
does understand a reset at the other end. if the entire
machine at the other end has reset, the TCP protocol
should reply to any message with an RST. So at the protocol
level, it’s known that the connection has closed.

Understand, this is not the case where the other end
“went away”. That has to time out, which takes a long time.
This is the case where the other end restarted, which is
detectable and recoverable.

Is there some socket option or ioctl command appicable here?

John Nagle
Team Overbot

Warren_Deitch1 · May 5, 2004, 1:21am

John,
I have had a similar problem. select() will return as though data was
present but a read() returns 0 bytes (until socket timeout) instead of
EBADF.
Since I am in a loop once I was select{}ed (big messages, small
buffers) I implemented a counter and any select that resulted in a read
of 0 bytes and the total for that ‘read session’ was 0 bytes (not just a
read of 0 bytes which is legal) was interpreted as a ‘tear down’ situation.

A clumsy workaround but …

John Nagle wrote:

[QNX 6.21]

I have a program that opens a TCP connection to another
device, a Galil motion controller. This device can reset,
in which case it loses all its TCP connections but will open
new ones.

When this happens, my program gets ordinary “timeout” returns
from “select”. I don’t get any indication that the socket
has been disconnected, even though I check for “socket
exceptions”… In addition, if I write to the
closed connection with “send”, I dont’ get an error return.

When the other end of the connection has reset, the next
TCP message should be rejected with an RST, so the protocol
does understand a reset at the other end. if the entire
machine at the other end has reset, the TCP protocol
should reply to any message with an RST. So at the protocol
level, it’s known that the connection has closed.

Understand, this is not the case where the other end
“went away”. That has to time out, which takes a long time.
This is the case where the other end restarted, which is
detectable and recoverable.

Is there some socket option or ioctl command appicable here?

John Nagle
Team Overbot

Bill_Caroselli1 · May 5, 2004, 12:53pm

Hi Warren

I guess a read on a connected socket can return 0 bytes, but if you wake
up from a select and are told that there is data on a socket I don’t think
that it should be allowed to return 0 bytes, unless the connection is now
closed.

Warren Deitch <warren.deitch@transcore.com.au> wrote:
WD > John,
WD > I have had a similar problem. select() will return as though data was
WD > present but a read() returns 0 bytes (until socket timeout) instead of
WD > EBADF.
WD > Since I am in a loop once I was select{}ed (big messages, small
WD > buffers) I implemented a counter and any select that resulted in a read
WD > of 0 bytes and the total for that ‘read session’ was 0 bytes (not just a
WD > read of 0 bytes which is legal) was interpreted as a ‘tear down’ situation.

WD > A clumsy workaround but …

WD > John Nagle wrote:

[QNX 6.21]

I have a program that opens a TCP connection to another
device, a Galil motion controller. This device can reset,
in which case it loses all its TCP connections but will open
new ones.

When this happens, my program gets ordinary “timeout” returns
from “select”. I don’t get any indication that the socket
has been disconnected, even though I check for “socket
exceptions”… In addition, if I write to the
closed connection with “send”, I dont’ get an error return.

When the other end of the connection has reset, the next
TCP message should be rejected with an RST, so the protocol
does understand a reset at the other end. if the entire
machine at the other end has reset, the TCP protocol
should reply to any message with an RST. So at the protocol
level, it’s known that the connection has closed.

Understand, this is not the case where the other end
“went away”. That has to time out, which takes a long time.
This is the case where the other end restarted, which is
detectable and recoverable.

Is there some socket option or ioctl command appicable here?

John Nagle
Team Overbot

–
Bill Caroselli – Q-TPS Consulting
1-(708) 308-4956 <== Note: New Number
qtps@earthlink.net

Wojtek_Lerch1 · May 5, 2004, 2:51pm

Bill Caroselli wrote:

I guess a read on a connected socket can return 0 bytes, but if you wake
up from a select and are told that there is data on a socket I don’t think
that it should be allowed to return 0 bytes, unless the connection is now
closed.

No, the return from select() doesn’t promise that there’s a non-zero
amount of data, only that a read() call would not block. For terminals,
sockets and pipes, that includes a pending EOF. For regular files, it’s
always true.

This is how POSIX explains it:

A descriptor shall be considered ready for reading when a call to an
input function with O_NONBLOCK clear would not block, whether or not the
function would transfer data successfully. (The function might return
data, an end-of-file indication, or an error other than one indicating
that it is blocked, and in each of these cases the descriptor shall be
considered ready for reading.)

(http://www.opengroup.org/onlinepubs/009695399/functions/pselect.html)

Sean_Boudreau1 · May 5, 2004, 4:40pm

John Nagle <nagle@overbot.com> wrote:

[QNX 6.21]

I have a program that opens a TCP connection to another
device, a Galil motion controller. This device can reset,
in which case it loses all its TCP connections but will open
new ones.

When this happens, my program gets ordinary “timeout” returns
from “select”. I don’t get any indication that the socket
has been disconnected, even though I check for “socket
exceptions”… In addition, if I write to the
closed connection with “send”, I dont’ get an error return.

When the other end of the connection has reset, the next
TCP message should be rejected with an RST, so the protocol
does understand a reset at the other end. if the entire
machine at the other end has reset, the TCP protocol
should reply to any message with an RST. So at the protocol
level, it’s known that the connection has closed.

If there’s no traffic, we’ve no indication to unblock the select().
The reception of a RST will unblock select().

The first write the other end receives after reset should poke
it to send the RST. ie the first write will succeed. After
we’ve received a RST from the other end, subsequent writes will
generate a SIGPIPE and fail with EPIPE.

Understand, this is not the case where the other end
“went away”. That has to time out, which takes a long time.
This is the case where the other end restarted, which is
detectable and recoverable.

Provided there’s traffic.

-seanb

Is there some socket option or ioctl command appicable here?

John Nagle
Team Overbot

John_Nagle1 · May 6, 2004, 2:58am

Sean Boudreau wrote:

John Nagle <> nagle@overbot.com> > wrote:

[QNX 6.21]

I have a program that opens a TCP connection to another
device, a Galil motion controller. This device can reset,
in which case it loses all its TCP connections but will open
new ones.

When this happens, my program gets ordinary “timeout” returns
from “select”. I don’t get any indication that the socket
has been disconnected, even though I check for “socket
exceptions”… In addition, if I write to the
closed connection with “send”, I dont’ get an error return.

If there’s no traffic, we’ve no indication to unblock the select().
The reception of a RST will unblock select().

The first write the other end receives after reset should poke
it to send the RST. ie the first write will succeed. After
we’ve received a RST from the other end, subsequent writes will
generate a SIGPIPE and fail with EPIPE.

That sounds reasonable, but we don’t see working that way.
We seem to be able to “send” to the dead socket repeatedly
without any errors. On the receive side, we just get select
timeouts.

We’re definitely sending; it’s a “send message, wait for reply”
situation. And we never get a SIGPIPE signal. Even telnet
doesn’t seem to notice when we power-cycle the Galil controller.

The device we’re talking to is definitely back up; we can
establish new connections to it. It has a fixed IP address
on the same LAN as the QNX machines, so there’s no dynamic
addressing issue.

Of course, there’s always the possibility that the Galil
controller’s TCP stack doesn’t do RST properly.

John Nagle
Team Overbot

Sean_Boudreau1 · May 6, 2004, 12:57pm

John Nagle <nagle@overbot.com> wrote:

Sean Boudreau wrote:
John Nagle <> nagle@overbot.com> > wrote:

[QNX 6.21]

I have a program that opens a TCP connection to another
device, a Galil motion controller. This device can reset,
in which case it loses all its TCP connections but will open
new ones.

When this happens, my program gets ordinary “timeout” returns
from “select”. I don’t get any indication that the socket
has been disconnected, even though I check for “socket
exceptions”… In addition, if I write to the
closed connection with “send”, I dont’ get an error return.

If there’s no traffic, we’ve no indication to unblock the select().
The reception of a RST will unblock select().

The first write the other end receives after reset should poke
it to send the RST. ie the first write will succeed. After
we’ve received a RST from the other end, subsequent writes will
generate a SIGPIPE and fail with EPIPE.

That sounds reasonable, but we don’t see working that way.
We seem to be able to “send” to the dead socket repeatedly
without any errors. On the receive side, we just get select
timeouts.

We’re definitely sending; it’s a “send message, wait for reply”
situation. And we never get a SIGPIPE signal. Even telnet
doesn’t seem to notice when we power-cycle the Galil controller.

If you’ve got SIGPIPE blocked the write will fail with -1
and errno of EPIPE.

The device we’re talking to is definitely back up; we can
establish new connections to it. It has a fixed IP address
on the same LAN as the QNX machines, so there’s no dynamic
addressing issue.

Of course, there’s always the possibility that the Galil
controller’s TCP stack doesn’t do RST properly.

John Nagle
Team Overbot

system · May 20, 2004, 2:19pm

nagle@overbot.com sed in <c7c7fr$rod$1@inn.qnx.com>:

situation. And we never get a SIGPIPE signal. Even telnet
doesn’t seem to notice when we power-cycle the Galil controller.

My wild guess is that controller isn’t implementing TCP in full.

kabe