Hub goes... everything goes

I have a network (many networks actually) of QNX nodes. Everything is
communicating well until a network disturbance occurs. This could be a power
flicker or inadvertant unplugging or whatever. I have reproduced it simply
by unplugging the power from the hub and then reconnecting it with the time
unplugged being very minute (<1 sec) to large ( > 10 sec ).

In the network(s), I have processes talking to other processes which has
created virtual circuits. These vc’s are no longer present even if the
disturbance is very small.

I have noticed that when the network is powered, it is required to refresh
the netmap file on each node before normal networking resumes. I have also
noticed that in the network, two or more separate networks will start up
having nodes 1, 2, 5 talking and 3, 4, 6 talking.

Is there a method to detect these disturbances and maintain a reliable
network? Any suggestions or comments would be greatly appreciated.

Doug

when Send() returns -1, repeat qnx_name_locate to get new virtual circuit.

On the other hand try to play with options of you newtwork driver
to increase resistance against transient errors.

Andy

“Doug Rixmann” <rixmannd@rdsdata.com> wrote in message
news:ach1l7$ek3$1@inn.qnx.com

I have a network (many networks actually) of QNX nodes. Everything is
communicating well until a network disturbance occurs. This could be a
power
flicker or inadvertant unplugging or whatever. I have reproduced it simply
by unplugging the power from the hub and then reconnecting it with the
time
unplugged being very minute (<1 sec) to large ( > 10 sec ).

In the network(s), I have processes talking to other processes which has
created virtual circuits. These vc’s are no longer present even if the
disturbance is very small.

I have noticed that when the network is powered, it is required to refresh
the netmap file on each node before normal networking resumes.

That doesn’t make sense, unless your netmap file/licenses aren’t identical
on every machine.

I have also
noticed that in the network, two or more separate networks will start up
having nodes 1, 2, 5 talking and 3, 4, 6 talking.

Is there a method to detect these disturbances and maintain a reliable
network? Any suggestions or comments would be greatly appreciated.

Doug

That’s what happens… some additional information:

  1. nameloc is running on all devices (6 devices)
  • if I don’t start nameloc on a device, it complains about licensing. Can I
    start it to get the device up and then kill the nameloc process on all
    devices but the 2 servers?
  1. netpoll is running on all devices (within our application) as
    netpoll -i1 -p1 -r1

“Doug Rixmann” <> rixmannd@rdsdata.com> > wrote in message
news:ach1l7$ek3$> 1@inn.qnx.com> …
I have a network (many networks actually) of QNX nodes. Everything is
communicating well until a network disturbance occurs. This could be a
power
flicker or inadvertant unplugging or whatever. I have reproduced it
simply
by unplugging the power from the hub and then reconnecting it with the
time
unplugged being very minute (<1 sec) to large ( > 10 sec ).

In the network(s), I have processes talking to other processes which has
created virtual circuits. These vc’s are no longer present even if the
disturbance is very small.

I have noticed that when the network is powered, it is required to
refresh
the netmap file on each node before normal networking resumes.

That doesn’t make sense, unless your netmap file/licenses aren’t identical
on every machine.

I have also
noticed that in the network, two or more separate networks will start up
having nodes 1, 2, 5 talking and 3, 4, 6 talking.

Is there a method to detect these disturbances and maintain a reliable
network? Any suggestions or comments would be greatly appreciated.

Doug

\

“Doug Rixmann” <rixmannd@rdsdata.com> wrote in message
news:actdvi$eid$1@inn.qnx.com

That’s what happens… some additional information:

  1. nameloc is running on all devices (6 devices)

That’s usually not a good idea, too many nameloc can create hard to find
problems

  • if I don’t start nameloc on a device, it complains about licensing. Can
    I
    start it to get the device up and then kill the nameloc process on all
    devices but the 2 servers?

Yes, but look at nameloc -k x

  1. netpoll is running on all devices (within our application) as
    netpoll -i1 -p1 -r1
  • not sure of the value of this

Me neither I never played with these value.

“Mario Charest” <> goto@nothingness.com> > wrote in message
news:aconjr$41f$> 1@inn.qnx.com> …

“Doug Rixmann” <> rixmannd@rdsdata.com> > wrote in message
news:ach1l7$ek3$> 1@inn.qnx.com> …
I have a network (many networks actually) of QNX nodes. Everything is
communicating well until a network disturbance occurs. This could be a
power
flicker or inadvertant unplugging or whatever. I have reproduced it
simply
by unplugging the power from the hub and then reconnecting it with the
time
unplugged being very minute (<1 sec) to large ( > 10 sec ).

In the network(s), I have processes talking to other processes which
has
created virtual circuits. These vc’s are no longer present even if the
disturbance is very small.

I have noticed that when the network is powered, it is required to
refresh
the netmap file on each node before normal networking resumes.

That doesn’t make sense, unless your netmap file/licenses aren’t
identical
on every machine.

I have also
noticed that in the network, two or more separate networks will start
up
having nodes 1, 2, 5 talking and 3, 4, 6 talking.

Is there a method to detect these disturbances and maintain a reliable
network? Any suggestions or comments would be greatly appreciated.

Doug



\

Doug Rixmann <rixmannd@rdsdata.com> wrote:

That’s what happens… some additional information:

  1. nameloc is running on all devices (6 devices)
  • if I don’t start nameloc on a device, it complains about licensing. Can I
    start it to get the device up and then kill the nameloc process on all
    devices but the 2 servers?

Use nameloc -k on the non-server nodes to make them aware of the
license information from the server nodes.

  1. netpoll is running on all devices (within our application) as
    netpoll -i1 -p1 -r1
  • not sure of the value of this

This is probably the source of your problem with the hub going,
and everything going.

With this command to netpoll, you’ve said (basically) “I’ve got a
perfect netowrk, treat the slightest failure as a real failure.”

There is, basically, a tradeoff between ability to handle/ignore
transient failures, and the ability to quickly detect real failures.

netpoll basically controls how long before you look for a failure,
how often you retry, and how long between retries. You’ve taken
ALL of these options to a minimum value, basically saying that give
up (tear down the connection, ie the vc) after the slightest failure.

The default values are: netpoll -i10 -p10 -r6 – which will take about
10 minutes to finally give up on another node.

You’re numbers will give up on another node after about 2 seconds.

You may want to choose some values in between.

There are other parameters that can affect network resiliency, some
in the command line to Net, some in the driver command line.

On the how long before fail:
Net.driver:
-n tx_num_retries max number of tx retries after timeout (default 3)
-t tx_retry_ticks number of 50ms ticks between tx retries (default 20)

And on the recover side:
Net:
-t tx_fail_time time in seconds before retry failed network for node (40)

Net.driver:
-f tx_forget_time seconds until rxd nack is forgotten about for txing

-David

QNX Training Services
http://www.qnx.com/support/training/
Please followup in this newsgroup if you have further questions.