Network slowdown after reboot

We have a fast-ethernet network with 32 PC’s running QNX 4.25E.
26 nodes are using RTL8139-based NICs (Net.rtl) while 6 nodes are using Intel82559-based NICs (Net.ether82552).

The network is completely switched (i.e. no collisions) and it
is normally moderately fast. Copying a big file between two nodes
using “cp -V” gives a transfer rate of about 1300Kb/sec. (almost
all CPUs are old Pentium 90 or 120 Mhz).

The problem is that when we reboot one or two nodes, sometimes
the whole network becomes very slow. The “sin net” command takes 10 to
20 seconds to complete (while it normally takes about 1-2 sec);
copying the same file as above results in a transfer rate of about 150-200 Kb/sec.

Even commands affecting a single node (such as “sin args”) are
slowed down.

The only way I found to bring the network back to a sane state is
to reboot all the nodes.

Has anyone experienced similar problems?

Thanks,
Alessandro Sala


|
Gemmo Impianti S.p.A. |
Divisione Sistemi | Alessandro Sala
Viale Tunisia, 39 | Responsabile Software e Sistemi
20124 Milano |
|

“Alessandro Sala” <alex@romeo.gemmo-sistemi.it> wrote in message
news:as85om$80p$1@inn.qnx.com

We have a fast-ethernet network with 32 PC’s running QNX 4.25E.
26 nodes are using RTL8139-based NICs (Net.rtl) while 6 nodes are using
Intel82559-based NICs (Net.ether82552).

The network is completely switched (i.e. no collisions) and it
is normally moderately fast. Copying a big file between two nodes
using “cp -V” gives a transfer rate of about 1300Kb/sec. (almost
all CPUs are old Pentium 90 or 120 Mhz).

The problem is that when we reboot one or two nodes, sometimes
the whole network becomes very slow. The “sin net” command takes 10 to
20 seconds to complete (while it normally takes about 1-2 sec);
copying the same file as above results in a transfer rate of about 150-200
Kb/sec.

Even commands affecting a single node (such as “sin args”) are
slowed down.

The only way I found to bring the network back to a sane state is
to reboot all the nodes.

Has anyone experienced similar problems?

Are you running nameloc process on one/more nodes ? How node mapping is done
on each node ? Is there some primary “server” node (usually node 1) which
never reboots ?

// wbr

Ian Zagorskih <ianzag@megasignal.com> wrote:

We have a fast-ethernet network with 32 PC’s running QNX 4.25E.
26 nodes are using RTL8139-based NICs (Net.rtl) while 6 nodes are using
Intel82559-based NICs (Net.ether82552).

The network is completely switched (i.e. no collisions) and it
is normally moderately fast. Copying a big file between two nodes
using “cp -V” gives a transfer rate of about 1300Kb/sec. (almost
all CPUs are old Pentium 90 or 120 Mhz).

The problem is that when we reboot one or two nodes, sometimes
the whole network becomes very slow. The “sin net” command takes 10 to
20 seconds to complete (while it normally takes about 1-2 sec);
copying the same file as above results in a transfer rate of about 150-200
Kb/sec.

Even commands affecting a single node (such as “sin args”) are
slowed down.

The only way I found to bring the network back to a sane state is
to reboot all the nodes.


Are you running nameloc process on one/more nodes ? How node mapping is done

I’m running 10 nameloc processes nodes 7, 9, 11, 13, 16, 19, 21, 25, 30, 33
(this is because the network is very sparse, with groups of 2-4 PC connected
by long F.O. links and I want to make every group working even if the F.O.
links go down, so I have a nameloc for every group).

Node numbers range from 1 to 35. Numbers 1, 2 and 3 are not used, they are
reserved for connecting notebooks when I need to: 1 and 2 are masked while 3
is simply deleted from the netmap, so I can connect a notebook numbered 3 from
any place on the network and have it automatically inserted in the netmap.

Every nameloc uses the same ‘-e 35’ argument to avoid polling nodes
beyond node 35 (I have other licenses installed which I don’t use at the
moment).

Node mapping is done statically using /etc/config/netmap which is the same
on all nodes.

on each node ? Is there some primary “server” node (usually node 1) which
never reboots ?

All nodes are up 24H a day (it’s a supervisory system). When one or
two nodes are switched off for maintenance, sometimes the
problem arises, and the slowdown persists even after the missing
nodes are up again.

Now that you make me think about it, perhaps the slowdown is
more likely to happen if one of the switched-off nodes was running a
nameloc, but I’m not sure.


Thanks,
Alessandro

\

|
Gemmo Impianti S.p.A. |
Divisione Sistemi | Alessandro Sala
Viale Tunisia, 39 | Responsabile Software e Sistemi
20124 Milano |
|

“Alessandro Sala” <alex@romeo.gemmo-sistemi.it> wrote in message
news:asf79u$t7k$1@inn.qnx.com

Ian Zagorskih <> ianzag@megasignal.com> > wrote:

I’m running 10 nameloc processes nodes 7, 9, 11, 13, 16, 19, 21, 25, 30,
33
(this is because the network is very sparse, with groups of 2-4 PC
connected
by long F.O. links and I want to make every group working even if the F.O.
links go down, so I have a nameloc for every group).

Node numbers range from 1 to 35. Numbers 1, 2 and 3 are not used, they are
reserved for connecting notebooks when I need to: 1 and 2 are masked while
3
is simply deleted from the netmap, so I can connect a notebook numbered 3
from
any place on the network and have it automatically inserted in the netmap.

Every nameloc uses the same ‘-e 35’ argument to avoid polling nodes
beyond node 35 (I have other licenses installed which I don’t use at the
moment).

Node mapping is done statically using /etc/config/netmap which is the same
on all nodes.

on each node ? Is there some primary “server” node (usually node 1)
which
never reboots ?

All nodes are up 24H a day (it’s a supervisory system). When one or
two nodes are switched off for maintenance, sometimes the
problem arises, and the slowdown persists even after the missing
nodes are up again.

Now that you make me think about it, perhaps the slowdown is
more likely to happen if one of the switched-off nodes was running a
nameloc, but I’m not sure.

Hm… Sorry, at this moment i have no explicit ideas. Try to run “sin vc” on
slow down nodes while other are rebooting and check for existing virtual
circuits.

// wbr

Node mapping is done statically using /etc/config/netmap which is the same
on all nodes.

on each node ? Is there some primary “server” node (usually node 1)
which
never reboots ?

All nodes are up 24H a day (it’s a supervisory system). When one or
two nodes are switched off for maintenance, sometimes the
problem arises, and the slowdown persists even after the missing
nodes are up again.

Now that you make me think about it, perhaps the slowdown is
more likely to happen if one of the switched-off nodes was running a
nameloc, but I’m not sure.


Hm… Sorry, at this moment i have no explicit ideas. Try to run “sin vc” on
slow down nodes while other are rebooting and check for existing virtual
circuits.

Today it happened again. This time I run ‘sin format tina’ and saw that
the Net process, on almost all nodes, had a very high UTIME (above 10000):
I think this explains the PC and network slowness.
I then slayed Net on node 4 and restarted it, together with Net.rtl
(the nic driver), and after a few seconds the performace of PC’s and
network came back to the usual level.
I noticed that UTIME of Net on node 4 dropped to about 80, and on
most nodes dropped below 10000.

It seems the problem is in Net or in something related to it.

I’ll follow you suggestion next time, though I hope not to have
a “next time” :frowning:

Alessandro

// wbr


|
Gemmo Impianti S.p.A. |
Divisione Sistemi | Alessandro Sala
Viale Tunisia, 39 | Responsabile Software e Sistemi
20124 Milano |
|

Alessandro Sala <alex@romeo.gemmo-sistemi.it> wrote:

Node mapping is done statically using /etc/config/netmap which is the same
on all nodes.

on each node ? Is there some primary “server” node (usually node 1)
which
never reboots ?

All nodes are up 24H a day (it’s a supervisory system). When one or
two nodes are switched off for maintenance, sometimes the
problem arises, and the slowdown persists even after the missing
nodes are up again.

Now that you make me think about it, perhaps the slowdown is
more likely to happen if one of the switched-off nodes was running a
nameloc, but I’m not sure.



Today it happened again. This time I run ‘sin format tina’ and saw that
the Net process, on almost all nodes, had a very high UTIME (above 10000):
I think this explains the PC and network slowness.
I then slayed Net on node 4 and restarted it, together with Net.rtl
(the nic driver), and after a few seconds the performace of PC’s and
network came back to the usual level.
I noticed that UTIME of Net on node 4 dropped to about 80, and on
most nodes dropped below 10000.

Ooops! I completely misunderstood the output of sin. Of course UTIME is not
the CPU utilization of the precess!
Anyway slaying and restarting Net in fact solved the problem.
Yesterday I rebooted a group of 4 nodes and after some time another group
of 6 nodes and no slowdown happened.

Alessandro


|
Gemmo Impianti S.p.A. |
Divisione Sistemi | Alessandro Sala
Viale Tunisia, 39 | Responsabile Software e Sistemi
20124 Milano |
|