sockets getting stuck in CLOSED state

I have been struggling with figuring out why my sockets are winding up stuck in the CLOSED state. When this happens all the file descriptors wind up getting used up and my resource manager fails.

What I have is a resource manager that connects with a device on my tcpip networks via a number of different sockets.

In order to make it fault tolerant when the remote device stops responding I try to close and create a new socket.

To recreate the problem I simply disconnect the device from the network. After a while I see using wireshark the connection getting reset by qnx as it tries to close the socket and the socket according to the “pidin fds” command changes state between SYN_SENT and CLOSED. Basically as I would expect, but after a while I see the connections start to get stuck in the SYN_SENT state then CLOSED state until all the sockets get used up.

I have set SO_LINGER to turned off.

Any idea what could be going on here?

Below is a capture of the output of pidin fds as the issue is starting to occur. Note the SYN_SENTs change to CLOSED after a short while and get stuck.

pidin -p devc-serbhn fds

 pid name

232685593 86/o-g/devc-serbhn
0 4103 rw 0 /dev/con1
1 4103 rw 0 /dev/con1
2 4103 rw 0 /dev/con1
3 232685593
4 81935 rw 0 I4TCP 206.130.75.35.62911 206.130.75.93.49154 ESTABLISHED
5 81935 rw 0 I4TCP CLOSED
6 81935 rw 0 I4TCP CLOSED
7 81935 rw 0 I4TCP CLOSED
8 81935 rw 0 I4TCP 206.130.75.35.64782 206.130.75.93.49168 ESTABLISHED
9 81935 rw 0 I4TCP 206.130.75.35.64781 206.130.75.93.49169 ESTABLISHED
10 81935 rw 0 I4TCP CLOSED
11 81935 rw 0 I4TCP CLOSED
12 81935 rw 0 I4TCP CLOSED
13 81935 rw 0 I4TCP CLOSED
14 81935 rw 0 I4TCP 206.130.75.35.63041 206.130.75.126.49154 SYN_SENT
15 81935 rw 0 I4TCP CLOSED
16 81935 rw 0 I4TCP CLOSED
17 81935 rw 0 I4TCP CLOSED
18 81935 rw 0 I4TCP CLOSED
19 81935 rw 0 I4TCP CLOSED
20 81935 rw 0 I4TCP CLOSED
21 81935 rw 0 I4TCP CLOSED
22 81935 rw 0 I4TCP 206.130.75.35.63045 206.130.75.126.49154 SYN_SENT
23 81935 rw 0 I4TCP 206.130.75.35.63033 206.130.75.126.49154 SYN_SENT
24 81935 rw 0 I4TCP 206.130.75.35.63037 206.130.75.126.49154 SYN_SENT
25 81935 rw 0 I4TCP 206.130.75.35.63001 206.130.75.126.49154 SYN_SENT
26 81935 rw 0 I4TCP 206.130.75.35.63029 206.130.75.126.49154 SYN_SENT
27 81935 rw 0 I4TCP 206.130.75.35.63008 206.130.75.126.49154 SYN_SENT
28 81935 rw 0 I4TCP 206.130.75.35.63005 206.130.75.126.49154 SYN_SENT
29 81935 rw 0 I4TCP 206.130.75.35.62993 206.130.75.126.49154 SYN_SENT
30 81935 rw 0 I4TCP 206.130.75.35.62997 206.130.75.126.49154 SYN_SENT
31 81935 rw 0 I4TCP 206.130.75.35.62926 206.130.75.126.49154 SYN_SENT
32 81935 rw 0 I4TCP 206.130.75.35.62989 206.130.75.126.49154 SYN_SENT
33 81935 rw 0 I4TCP 206.130.75.35.62986 206.130.75.126.49154 SYN_SENT
34 81935 rw 0 I4TCP 206.130.75.35.62981 206.130.75.126.49154 SYN_SENT
35 81935 rw 0 I4TCP 206.130.75.35.62978 206.130.75.126.49154 SYN_SENT
36 81935 rw 0 I4TCP 206.130.75.35.62973 206.130.75.126.49154 SYN_SENT
37 81935 rw 0 I4TCP 206.130.75.35.62970 206.130.75.126.49154 SYN_SENT
38 81935 rw 0 I4TCP 206.130.75.35.62949 206.130.75.126.49154 SYN_SENT
39 81935 rw 0 I4TCP 206.130.75.35.62946 206.130.75.126.49154 SYN_SENT
40 81935 rw 0 I4TCP 206.130.75.35.62941 206.130.75.126.49154 SYN_SENT
41 81935 rw 0 I4TCP 206.130.75.35.62938 206.130.75.126.49154 SYN_SENT
42 81935 rw 0 I4TCP 206.130.75.35.62933 206.130.75.126.49154 SYN_SENT
43 81935 rw 0 I4TCP 206.130.75.35.62930 206.130.75.126.49154 SYN_SENT
44 81935 rw 0 I4TCP 206.130.75.35.62918 206.130.75.126.49154 SYN_SENT
45 81935 rw 0 I4TCP 206.130.75.35.62922 206.130.75.126.49154 SYN_SENT
46 81935 rw 0 I4TCP 206.130.75.35.62910 206.130.75.93.49155 ESTABLISHED
47 81935 rw 0 I4TCP 206.130.75.35.62914 206.130.75.126.49154 SYN_SENT
0s 1
2s 232685593
4s 1 MP 0 /dev/serbhn0-1
5s 1 MP 0 /dev/serbhn0-2
6s 1 MP 0 /dev/serbhn0-3
7s 1 MP 0 /dev/serbhn0-4
8s 1 MP 0 /dev/serbhn0-5
9s 1 MP 0 /dev/serbhn0-6
10s 1 MP 0 /dev/serbhn0-7
11s 1 MP 0 /dev/serbhn0-8
12s 1 MP 0 /dev/serbhn1-1
13s 1 MP 0 /dev/serbhn1-2
14s 1 MP 0 /dev/serbhn1-3
15s 1 MP 0 /dev/serbhn1-4
16s 1 MP 0 /dev/serbhn1-5
17s 1 MP 0 /dev/serbhn1-6
18s 1 MP 0 /dev/serbhn1-7
19s 1 MP 0 /dev/serbhn1-8
21s 1

Any ideas would be appreciated.

Actually, I do have an idea. What I don’t have is an understanding. You aren’t the first person to run into this. I ran headlong into the fd limits with closed sockets way back in 2003. I’m afraid I wasn’t a lot of help when the topic came up in October of last year here on openqnx.com. See the thread openqnx.com/index.php?name=P … ic&p=39877

Here’s what I said, slightly edited:
There is a limited number of available sockets. No big deal, except when you close the socket it doesn’t disappear right away. There’s a “virtual” socket hanging around until a timeout, which I think is 60 seconds. So (even when properly closing) if you create sockets fast enough you will eventually run out. It’s like filling a bath tub; if the water is coming in faster than the drain can take it out your tub will overflow.

You can run “netstat” at the command prompt to monitor socket states.

How fast are you creating sockets?

-James Ingraham
Sage Automation, Inc.

Ok thanks for the tip, but does this apply if a socket fails to connect at all? IE the remote IP is actually not connected?

I was trying to connect every 3 seconds and this exhibited the problem after about 40 minutes. I could actually see the socket numbers slowly go up.

I just tried a test with the retry timeout set to 61 seconds and no sign of any problem after an hour, however I don’t know if I just made the issue 20 times longer to show up. So 40 minutes might not show the issue.

60 seconds is too long, so I am going to run the test again with a timeout of 20 seconds and see if it happens.

The one thing that bothers me about this suggested cause is that I don’t see any TIME_WAITs in netstat for this connection. Which I would expect if this was the socket being left open for a while after close.

Is there some other hidden resource being used up when a connection totally fails due to the IP not being there?

AES,

Set the SO_REUSEADDR socket option. This is the option that tells the TCPIP stack to release the sockets right away instead of holding on to them for 60 seconds or so.

Tim

I think what Tim’s describing is the TIME_WAIT state.

It sounds like your connect() is timing out. Check
the return from connect() and if it’s -1 close() the
socket.

-seanb

Thanks seanb that was actually my exact problem.

I missed the close() on a failed connect, which in my case was happening all the time with a broken connection.

-Al