Hitting limit of socket connections? Help needed please.

I am running QNX 6.2.0 on an embedded CPU. Recently I have run into an issue with socket connections. The program I am working with can be configured to create an ethernet connection with multiple devices. The connection is created as follows:
[i].
.
While()
{
sock = socket( AF_INET, SOCK_STREAM, 0 );

  if( sock < 0 )
{
  perror( "opening stream socket" );
}

  fprintf( stderr, "Sock is %d\r\n", sock ); // TO REMOVE

  if( NULL != ( host = gethostbyname( name )))
{
  memset( &server, 0, sizeof( server ));
  server.sin_family = AF_INET;
  memcpy( (char *)&server.sin_addr, host->h_addr, host->h_length);
  server.sin_port = htons( port );
  
  if( connect( sock, (struct sockaddr *)&server, sizeof(server) ))
    {
      fprintf(stderr, "Connect failed on sock %d.  Errno is %d\r\n", sock, errno ); // FAILS
      
      close( sock );
     
      sock = -1;
    }

.
.
SocketLoop()
.
.
sleep()
}[/i]

This works when there are less than ~20 of these proccesses running, but fails for proccesses over that number. It fails at the second fprintf with the following output:

Connect failed on sock 11. Errno is 261

or

Connect failed on sock 11. Errno is 260

The 261 (Connection refused) error will come first, the 260 (Connection timed out) error will come on following iterations.
Unfortunately my experience with Socket programming is limited and far from recent. I am thinking that I am hitting a limit on the number of connections, however it doesn’t seem to be constant. What would cause this limit? Can it be increased?

Not sure if it will help or not but the ethernet driver is called as follows:
io-net -drtl -ptcpip
I have also tried with no effect the following:
io-net -drtl receive=50 -ptcpip

I welcome any suggestions and will be more than happy to provide more information if needed.

Thank you for your time,

Blaine

Blaine,

What happens in SocketLoop? Is that a function that never returns?

Looking at the code, I’d say your leaking sockets by not closing them.

After your final sleep() before you open another socket, you need to close the one you currently have open. It doesn’t matter whether the ‘connect’ call was successful or not. You already allocated a socket in the initial socket() call. You have to release that.

You can check this out by doing a printf on the sock value. It normally starts at 3 (after stdout, stderr) and if it’s ever increasing, it means your leaking sockets.

Tim

Another possible reason: In 6.2 there was a limited number of available sockets. No big deal, except when you close the socket it doesn’t disappear right away. There’s a “virtual” socket hanging around until a timeout, which I think is 60 seconds. So (even when properly closing) if you create sockets fast enough you will eventually run out. It’s like filling a bath tub; if the water is coming in faster than the drain can take it out your tub will overflow. 6.3 increased this limit, and we haven’t run into it since.

You can see the sockets without needing to printfs; just run “netstat” at the command prompt.

One further point: not properly closing the socket is definitely where you should look first. Always blame your own code before the system.

-James Ingraham
Sage Automation, Inc.

Thanks for the reply Tim.

The socket is closed as soon as the failure on connect is caught. Looking at the code sample I posted now, I realize that this isn’t at all clear, but Socketloop() isn’t hit unless the initialization is successful. There is also a close( sock ) call after Socketloop() to close the socket after connections. The corrected code sample is below to clarify:

[i].
.
While()
{
sock = socket( AF_INET, SOCK_STREAM, 0 );

if( sock < 0 )
{
perror( “opening stream socket” );
}

fprintf( stderr, “Sock is %d\r\n”, sock ); // TO REMOVE

if( NULL != ( host = gethostbyname( name )))
{
memset( &server, 0, sizeof( server ));
server.sin_family = AF_INET;
memcpy( (char *)&server.sin_addr, host->h_addr, host->h_length);
server.sin_port = htons( port );

if( connect( sock, (struct sockaddr *)&server, sizeof(server) ))
{
fprintf(stderr, “Connect failed on sock %d. Errno is %d\r\n”, sock, errno ); // FAILS

close( sock );

sock = -1;
}
else
{
.
.
SocketLoop()
close( sock )
.
.
}
sleep()
} [/i]

Thank you for the reply James.

The image our embedded system is running does not include netstat but i have placed it on the primary test machine now the. The output follows:

netstat

Active Internet connections
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 192.168.0.43.65503 192.168.0.39.telnet SYN_SENT
tcp 0 0 192.168.0.43.65504 192.168.0.38.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65505 192.168.0.37.telnet SYN_SENT
tcp 0 0 192.168.0.43.65506 192.168.0.36.telnet SYN_SENT
tcp 0 0 192.168.0.43.65507 192.168.0.35.telnet SYN_SENT
tcp 0 0 192.168.0.43.65508 192.168.0.34.telnet SYN_SENT
tcp 0 0 192.168.0.43.65509 192.168.0.33.telnet SYN_SENT
tcp 0 0 192.168.0.43.65510 192.168.0.32.telnet SYN_SENT
tcp 0 0 192.168.0.43.65511 192.168.0.31.telnet SYN_SENT
tcp 0 0 192.168.0.43.65512 192.168.0.30.telnet SYN_SENT
tcp 0 0 192.168.0.43.65513 192.168.0.29.telnet SYN_SENT
tcp 0 0 192.168.0.43.65514 192.168.0.96.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65515 192.168.0.97.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65516 192.168.0.94.telnet SYN_SENT
tcp 0 0 192.168.0.43.65517 192.168.0.93.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65518 192.168.0.92.telnet SYN_SENT
tcp 0 0 192.168.0.43.65519 192.168.0.91.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65520 192.168.0.48.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65521 192.168.0.47.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65522 192.168.0.24.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65523 192.168.0.52.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65524 192.168.0.51.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65525 192.168.0.50.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65526 192.168.0.69.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65527 192.168.0.60.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65528 192.168.0.68.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65529 192.168.0.67.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65530 192.168.0.21.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65531 192.168.0.95.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65532 192.168.0.72.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65533 192.168.0.78.telnet ESTABLISHED
tcp 0 0 192.168.0.43.65534 192.168.0.44.telnet ESTABLISHED

Obviously this is only a snapshot, but the command gives similar output every time. Thank you also for the information on the “virtual” sockets and 60 second timeout. I tried increasing the sleep() duration from 30 to 180 but there was no change.

I think that I might actually be hitting the limit. Basically what my application should do is maintain up to 32 connections at one time. Under normal circumstances these connections would be constant. Currently it seems to be able to maintain only 20 and will not allow a connect() on anything more. Is the limit of Socket connections for 6.2 only 20, or am I running into some other issue? Also if the limit is indeed 20, is there anything I can do to increase it short of upgrading to 6.3?

Thank you again for your help. Your time is greatly appreciated.

Blaine

Blaine,

OK, with the cleared up code, your not leaking sockets.

I re-read your initial post again. I must have been half asleep this morning before I had my coffee.

Error 261, connection refused, means the remote side is refusing to allow you to connect. This has nothing to do with the number of sockets on the QNX side.

It simply means the other end isn’t accepting a connection from you. A couple of questions:

Are you trying to connect to the same device over and over again? Some O/S’s limit the rate at which you can re-connect (to prevent hackers from probing) to them.

Any chance the remote device isn’t ready to accept a connection when you try to connect?

I’d suggest printing out the IP address of the side that refuses the connection. Then you at least know who’s refusing you. See if it’s the same device every time.

Tim

Thanks again Tim.

Your most recent post got me to reconsider some assumptions that I had been confident in. Specifically I took a very close look at our internal network. Sure enough it seems like specific devices work in some locations but not in others. I have actually verified a problem now from a Windows PC. So I can say that this is definitely not a QNX problem (boy is my face red), nor is it a problem with my code.

Thank you both for the insightful posts and for helping me get pointed in the right direction. I still have an issue, but now at least I know where to look to resolve it.

Blaine