Multiple io-net crashes

John_Nagle1 · November 18, 2003, 5:56am

We’ve been experiencing multiple io-net crashes on
QNX 6.2.1PE. We’ve now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We’re running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
“ls: readdir of ‘/net/gcrear0’ failed (Bad file descriptor)”
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

Xiaodan_Tang1 · November 18, 2003, 5:21pm

Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <nagle@overbot.com> wrote in message
news:3FB9B47F.5020308@overbot.com…

We’ve been experiencing multiple io-net crashes on
QNX 6.2.1PE. We’ve now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We’re running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
“ls: readdir of ‘/net/gcrear0’ failed (Bad file descriptor)”
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

John_Nagle1 · November 18, 2003, 7:17pm

We will try that as a debugging effort. But fixing
a fundamental reliability problem by adjusting time
delays is not a solution.

We will also send in some crash dumps of io-net.

Neither of those options is documented in the Helpviewer
database, incidentally.

John Nagle
Team Overbot

Xiaodan Tang wrote:

Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <> nagle@overbot.com> > wrote in message
news:> 3FB9B47F.5020308@overbot.com> …

We’ve been experiencing multiple io-net crashes on
QNX 6.2.1PE. We’ve now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We’re running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
“ls: readdir of ‘/net/gcrear0’ failed (Bad file descriptor)”
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot
\

Bill_Caroselli1 · November 18, 2003, 8:04pm

I believe that your adjusting a time-out, not a delay.

The driver is assuming that something is wrong even though everything
was still working fine, just slow.

John Nagle <nagle@downside.com> wrote:
JN > We will try that as a debugging effort. But fixing
JN > a fundamental reliability problem by adjusting time
JN > delays is not a solution.

JN > We will also send in some crash dumps of io-net.

JN > Neither of those options is documented in the Helpviewer
JN > database, incidentally.

JN > John Nagle
JN > Team Overbot

JN > Xiaodan Tang wrote:

Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

Robert_Rutherford1 · November 18, 2003, 11:52pm

We have also recently seen a couple of io-net crashes.

This is on standard machines runing 6.2.1A with dual Intel 82557 NICs. The
network is hardwired - there is no wireless LAN anywhere.

Coincidentally (or not?) the crashes have only occured after we started
implementing inter-node native IPC over Ethernet.

We haven’t spent any effort to get to the bottom of this yet (as it is only
very intermittent and we have more pressing bugs to fix) but I thought I
would mention it as possibly relevant to this thread.

Rob Rutherford

On Mon, 17 Nov 2003 21:56:15 -0800, John Nagle wrote:

We’ve been experiencing multiple io-net crashes on
QNX 6.2.1PE. We’ve now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We’re running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
“ls: readdir of ‘/net/gcrear0’ failed (Bad file descriptor)”
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

Xiaodan_Tang1 · November 19, 2003, 1:35am

To get better response for “QNET over Ethernet (LAN)”, QNET is tuned
for use on ethernet by default. The aggressive timeout then effect to links
that have higher packet lost rate. (QNET does recognize if the interface
under
it is a PPP, and adjust the timeout automaticly, but unfortunatly, the
wireless
thing claim they are “ethernet”)

But you are right that this should never core.

-xtang

John Nagle <nagle@downside.com> wrote in message
news:3FBA7035.5000008@downside.com…

We will try that as a debugging effort. But fixing
a fundamental reliability problem by adjusting time
delays is not a solution.

We will also send in some crash dumps of io-net.

Neither of those options is documented in the Helpviewer
database, incidentally.

John Nagle
Team Overbot

Xiaodan Tang wrote:
Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <> nagle@overbot.com> > wrote in message
news:> 3FB9B47F.5020308@overbot.com> …

We’ve been experiencing multiple io-net crashes on
QNX 6.2.1PE. We’ve now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We’re running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
“ls: readdir of ‘/net/gcrear0’ failed (Bad file descriptor)”
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

\

John_Nagle1 · November 19, 2003, 5:51am

Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

If you spawn a process on another node, it’s a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented “no
zombies” flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn’t work across node boundaries.
The “maproot” command to QNET seems to affect all UIDs, not just
root. If we set “maproot=99”, but don’t specify “mapany”,
and user 99 is “nobody”, we can only use the “on” command across nodes if
running as user “nobody”. If we don’t specify
“maproot=99”, inter-node “on” works.

Our sysadmin should be trying the suggested timing tweaks. Where
do we send the io-net dumps?

John Nagle
Team Overbot

Robert Rutherford wrote:

We have also recently seen a couple of io-net crashes.

This is on standard machines runing 6.2.1A with dual Intel 82557 NICs. The
network is hardwired - there is no wireless LAN anywhere.

Coincidentally (or not?) the crashes have only occured after we started
implementing inter-node native IPC over Ethernet.

We haven’t spent any effort to get to the bottom of this yet (as it is only
very intermittent and we have more pressing bugs to fix) but I thought I
would mention it as possibly relevant to this thread.

Rob Rutherford

Rennie_Allen2 · November 19, 2003, 3:46pm

John Nagle wrote:

Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

If you spawn a process on another node, it’s a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented “no
zombies” flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn’t work across node boundaries.

My $0.02:

I happen to think that the parent/child relationship should extend
across node boundries. I guess the problem comes when the network
is severed and the child later terminates, who would do the waitpid ?

I think that it is OK to change ownership of the child to io-net,
if the virtual-circuit (or other bookkeeping entity) that represents
the connection between the remote parent and local child is destroyed
due to a network failure; and yes, io-net should be able to find out
when the child that it adopted in this way terminates,and perform
the waitpid.

David_Gibbs1 · November 19, 2003, 3:55pm

Rennie Allen <rallen@csical.com> wrote:

John Nagle wrote:
Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

If you spawn a process on another node, it’s a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented “no
zombies” flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn’t work across node boundaries.

My $0.02:

I think that it is OK to change ownership of the child to io-net,
if the virtual-circuit (or other bookkeeping entity) that represents
the connection between the remote parent and local child is destroyed
due to a network failure; and yes, io-net should be able to find out
when the child that it adopted in this way terminates,and perform
the waitpid.

In this case, I think the child should be re-parented to Proc. This
is consistent with the local case, where if the parent of a child
exits/terminates, the child gets reparented to Proc.

-David

QNX Training Services
http://www.qnx.com/support/training/
Please followup in this newsgroup if you have further questions.

John_Nagle1 · November 19, 2003, 6:39pm

That would be nice, but it would require an API change.
“getppid()”, etc. would have to return a node ID as well
as a process ID. That’s not unreasonable, considering that
QNX provides a form of “kill” that accepts a node ID.
QNX already supports things like creating a pipe and
passing one end to a spawned process, so you can create
parent/child pipe connections. We’ve found that a useful
means of monitoring child death. When the child dies, the
pipe breaks.

I’m more concerned about the unkillable zombies piling up
under io-net, which is clearly a defect. But even for that
we have a workaround.

The io-net crashes are the serious problem.

Remember, we’re putting all this on a robot vehicle.
If io-net crashes, the hardware watchdog timer slams on
the brakes and kills the engine in about 200ms. The words
“QNX NET FAILED” appear in a big LED sign. Then other
watchdog timers reboot all the computers, and the
vehicle starts up again after a minute or so.

John Nagle
Team Overbot

Rennie Allen wrote:

John Nagle wrote:

Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

If you spawn a process on another node, it’s a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented “no
zombies” flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn’t work across node boundaries.

My $0.02:

I happen to think that the parent/child relationship should extend
across node boundries. I guess the problem comes when the network
is severed and the child later terminates, who would do the waitpid ?

I think that it is OK to change ownership of the child to io-net,
if the virtual-circuit (or other bookkeeping entity) that represents
the connection between the remote parent and local child is destroyed
due to a network failure; and yes, io-net should be able to find out
when the child that it adopted in this way terminates,and perform
the waitpid.

Bill_Caroselli1 · November 19, 2003, 6:51pm

Hi John

Just curious, is your robotic vehicle just for research or does it
have a practicle reason for being?

John Nagle <nagle@downside.com> wrote:
JN > That would be nice, but it would require an API change.
JN > “getppid()”, etc. would have to return a node ID as well
JN > as a process ID. That’s not unreasonable, considering that
JN > QNX provides a form of “kill” that accepts a node ID.
JN > QNX already supports things like creating a pipe and
JN > passing one end to a spawned process, so you can create
JN > parent/child pipe connections. We’ve found that a useful
JN > means of monitoring child death. When the child dies, the
JN > pipe breaks.

JN > I’m more concerned about the unkillable zombies piling up
JN > under io-net, which is clearly a defect. But even for that
JN > we have a workaround.

JN > The io-net crashes are the serious problem.

JN > Remember, we’re putting all this on a robot vehicle.
JN > If io-net crashes, the hardware watchdog timer slams on
JN > the brakes and kills the engine in about 200ms. The words
JN > “QNX NET FAILED” appear in a big LED sign. Then other
JN > watchdog timers reboot all the computers, and the
JN > vehicle starts up again after a minute or so.

JN > John Nagle
JN > Team Overbot

Khian_Hao_Lim · November 19, 2003, 8:10pm

Hi Xiandan,

I am with John Nagle. I tried your changes on our setup. It still causes the
same sloginfo errors. Can you please have a look at the rc.local and see if
I did anything wrong implementing your changes.
The issue really seems to be latency dependent like what you said. Taking
out wep encryption, the hubs in between the links occasionally helped remove
the sloginfo errors (latency?) Any other fixes you recommend?

My setup:

/etc/rc.d/rc.local

mount -T io-net -o busvendor=0x8086,busdevice=0x103a /lib/dll/devn-speedo.so

Restart TCP/IP networking so that new Ethernet driver is attached to

it.
netmanager -r all

Start QNX native networking.

map user to vehicle

mount -T io-net -o “ticksize=200,sstimer=0x00140014” /lib/dll/npm-qnet.so

Between node0 and node1 (node0 calls spawn with node1’s nd):

linksys wet11 bridge and linksys wap11 access point with 128 bit encryption
between
2 hops of hubs

qnet and io-net are both Jan 18 2003 versions

sloginfo output after internode spawning:

Nov 19 11:34:58 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

Nov 19 11:34:58 7 15 0 npm-qnet(kif): nd(00010003, 00010003),
server_id (40000026, 4000001f), client_id (00000020, 00000020), v->buffer 0
at kif_client.c:705
(Bad file descriptor)

Nov 19 11:34:58 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad file
descriptor)

Nov 19 11:34:58 7 15 0 npm-qnet(kif): nd(00010003, 00010003),
server_id (40000026, 4000001f), client_id (00000020, 00000020), v->buffer 0
at kif_client.c:705
(Bad file descriptor)

After internode spawning, I got the following:

node1$ ls /net/node0
ls: readdir of ‘/net/node0’ failed (Bad file descriptor)

Khian Hao Lim

“Xiaodan Tang” <xtang@qnx.com> wrote in message
news:bpdk8q$lkj$1@nntp.qnx.com…

Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <> nagle@overbot.com> > wrote in message
news:> 3FB9B47F.5020308@overbot.com> …
We’ve been experiencing multiple io-net crashes on
QNX 6.2.1PE. We’ve now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We’re running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
“ls: readdir of ‘/net/gcrear0’ failed (Bad file descriptor)”
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

Khian_Hao_Lim · November 19, 2003, 8:10pm

Hi Xiandan,

I am with John Nagle. I tried your changes on our setup. It still causes the
same sloginfo errors. Can you please have a look at the rc.local and see if
I did anything wrong implementing your changes.
The issue really seems to be latency dependent like what you said. Taking
out wep encryption, the hubs in between the links occasionally helped remove
the sloginfo errors (latency?) Any other fixes you recommend?

My setup:

/etc/rc.d/rc.local

mount -T io-net -o busvendor=0x8086,busdevice=0x103a /lib/dll/devn-speedo.so

Restart TCP/IP networking so that new Ethernet driver is attached to

it.
netmanager -r all

Start QNX native networking.

map user to vehicle

mount -T io-net -o “ticksize=200,sstimer=0x00140014” /lib/dll/npm-qnet.so

Between node0 and node1 (node0 calls spawn with node1’s nd):

linksys wet11 bridge and linksys wap11 access point with 128 bit encryption
between
2 hops of hubs

qnet and io-net are both Jan 18 2003 versions

sloginfo output after internode spawning:

Nov 19 11:34:58 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

Nov 19 11:34:58 7 15 0 npm-qnet(kif): nd(00010003, 00010003),
server_id (40000026, 4000001f), client_id (00000020, 00000020), v->buffer 0
at kif_client.c:705
(Bad file descriptor)

Nov 19 11:34:58 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad file
descriptor)

Nov 19 11:34:58 7 15 0 npm-qnet(kif): nd(00010003, 00010003),
server_id (40000026, 4000001f), client_id (00000020, 00000020), v->buffer 0
at kif_client.c:705
(Bad file descriptor)

After internode spawning, I got the following:

node1$ ls /net/node0
ls: readdir of ‘/net/node0’ failed (Bad file descriptor)

Khian Hao Lim

“Xiaodan Tang” <xtang@qnx.com> wrote in message
news:bpdk8q$lkj$1@nntp.qnx.com…

Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <> nagle@overbot.com> > wrote in message
news:> 3FB9B47F.5020308@overbot.com> …
We’ve been experiencing multiple io-net crashes on
QNX 6.2.1PE. We’ve now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We’re running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
“ls: readdir of ‘/net/gcrear0’ failed (Bad file descriptor)”
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

John_Nagle1 · November 30, 2003, 5:34am

Xiaodan Tang has identified a buffer overflow in io-net which is
causing some of our problems.

Currently, inter-node spawn is unreliable. Sometimes it
works, sometimes it fails, and on rare occasions, io-net
on the destination machine gets a segmentation fault.
Some sequences of chroot/spawn/exec seem to bring out the defect.
Details have been provided to Xiaodan by our Khian Hao.

Because we designed our system assuming this QNX feature
works, we have a serious problem. Who do we need to talk to
to get this fixed quickly? We’ve tried various workarounds,
but nothing really satisfactory or that we can trust has
emerged.

John Nagle
Team Overbot
650-326-9109

Xiaodan Tang wrote:

To get better response for “QNET over Ethernet (LAN)”, QNET is tuned
for use on ethernet by default. The aggressive timeout then effect to links
that have higher packet lost rate. (QNET does recognize if the interface
under
it is a PPP, and adjust the timeout automaticly, but unfortunatly, the
wireless
thing claim they are “ethernet”)

But you are right that this should never core.

-xtang

John Nagle <> nagle@downside.com> > wrote in message
news:> 3FBA7035.5000008@downside.com> …

We will try that as a debugging effort. But fixing
a fundamental reliability problem by adjusting time
delays is not a solution.

We will also send in some crash dumps of io-net.

Neither of those options is documented in the Helpviewer
database, incidentally.

John Nagle
Team Overbot

Xiaodan Tang wrote:

Hello John,

Would you please try this, when you mount the QNET, start it with these
options:

io-net -d driver -p qnet ticksize=200,sstimer=0x00140014

See if this makes it better works on slow links.

I will think of any other alternative.

-xtang

John Nagle <> nagle@overbot.com> > wrote in message
news:> 3FB9B47F.5020308@overbot.com> …

We’ve been experiencing multiple io-net crashes on
QNX 6.2.1PE. We’ve now seen this on three different
sets of hardware. Bug reports, with dumps, have been
submitted. Earlier, we thought this was a versioning
problem, but we put full installs of 6.2.1PE on the
relevant machines and still have problems.

We’re running QNET over Ethernet, not over IP.
All machines use ordinary Ethernet interfaces, but
the LAN is bridged with wireless bridges. Operating
over hard-wired 100baseT seems to work fine.

Operating over a path with a slow 802-11b bridge seems to cause QNET
serious problems, including io-net crashes.

Spawning programs and messaging using QNET across a
the 802.11b bridge seems to get io-net into bad states.
At the user level, we get messages like
“ls: readdir of ‘/net/gcrear0’ failed (Bad file descriptor)”
In syslog, we see

Nov 17 21:26:36 7 15 0 npm-qnet(stats): kif_client@82
kif_net_client Underflow(4294967295)

====

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:42 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(kif): nd(00010004, 00010004),
server_id (40000029, 4000001f), client_id (0000002b, 0000002b),
v->buffer 0 at kif_client.c:705
(Bad file descriptor)

Nov 17 21:43:43 7 15 0 npm-qnet(L4): trans_input.c:438 (Bad
file descriptor)

What does it mean?

We really need QNX messaging to work reliably. Our whole architecture
is based on it.

John Nagle
Team Overbot

\

Alain_Bonnefoy2 · December 1, 2003, 7:43am

Hum,
I’ve experienced some zombies and more rarely io-net crashs not farther
than the last week caused by smbd.
Could be the same problem.

Alain

Rennie Allen a écrit:

John Nagle wrote:

Inter-node spawn seem to have at least the following
clear problems, which may or may not be relevant to the crashes.

If you spawn a process on another node, it’s a child of io-net,
on the destination nod. When it dies, it becomes a
zombie under io-net. io-net needs to check for dead children,
but apparently does not do so. The undocumented “no
zombies” flag on spawn seems to help. This probably
should be the default on remote spawns, since the parent/child
relationship doesn’t work across node boundaries.

My $0.02:

I happen to think that the parent/child relationship should extend
across node boundries. I guess the problem comes when the network
is severed and the child later terminates, who would do the waitpid ?

I think that it is OK to change ownership of the child to io-net,
if the virtual-circuit (or other bookkeeping entity) that represents
the connection between the remote parent and local child is destroyed
due to a network failure; and yes, io-net should be able to find out
when the child that it adopted in this way terminates,and perform
the waitpid.

Xiaodan_Tang1 · December 1, 2003, 6:35pm

Here is an email I sent to Khian to explain the difference between
"on -n " and “on -f”, I thought it would be useful to post here so
people could get a better idea why they saw “weird” behavior.

-xtang

--------------- Email start from here ---------------------------------
OK

gcrear1$ on -n bobcat /bin/ls /

Will have the binary running on “bobcat” , with ROOT=/net/gcrear1, or
the command above could translate to (if we put all network path on)

gcrear1$ /net/gcrear1/bin/on -n bobcat /net/gcrear1/bin/ls /net/gcrear1

The libc.so.2 that “ls” needed, is also come from /net/gcrear1/lib/libc.so.2

On the other hande “on -f bobcat /bin/ls /”, would have a
ROOT=/net/bobcat, which means:

gcrear1$ /net/gcrear1/bin/on -f bobcat /net/bobcat/bin/ls /net/bobcat

So “on -f …” is more equivlent with telnet into the box and run a
program.This is usually most people think what it would like to
“spawn remote”.

The “on -n” will have a lot of “wired behavar”. For example, if you
“on -n bobcat ping yahoo.com”, and try to tcpdump the traffic, you
can see the packet src address is “gcrear1” , but not “bobcat” as
most people would expect, why?

Because the “ping” is running with ROOT=/net/gcear1. So when it call
“socket()”, it is a open("/dev/sock/2, “) underkneeth. Which, because
of the “ROOT”, is then translated by procnto to
open(”/net/gcrear1/dev/socket/2"…),
which using the tcpip stack back on gcrear1.

Since almost everything on QNX involve a pathname resovle, and message
passing. So if you “on -n bobcot application”, and your application is
opening a
pipe, it will try open("/dev/pipe", ), which translate back to
“/net/gcrear1/dev/pipe”,
it then use the pipe manager on gcrear1 to creat a pipe. If your application
call
“mq_open()”, it goes all the way back to “/net/gcrear1/dev/mqueue”…
Some manager may refuse being accessed from remote, which means
some call will fail.

So, in a short, “on -f bobcat” will give the behavior everybody usually
expect Hope this could clear the mud a little bit.

-Xiaodan Tang

John_Nagle1 · December 2, 2003, 7:35am

We understand that. But that isn’t the major problem.
io-net crashing is the major problem. Most of the other
problems we encounter stem from trying to find workarounds
for io-net crashing. If we can get that fixed, we can
probably deal with the rest of the issues.

Whom do we need to talk to get this fixed? Thanks.

John Nagle

Xiaodan Tang wrote:

Here is an email I sent to Khian to explain the difference between
"on -n " and “on -f”, I thought it would be useful to post here so
people could get a better idea why they saw “weird” behavior.

-xtang

--------------- Email start from here ---------------------------------
OK >

gcrear1$ on -n bobcat /bin/ls /

Will have the binary running on “bobcat” , with ROOT=/net/gcrear1, or
the command above could translate to (if we put all network path on)

gcrear1$ /net/gcrear1/bin/on -n bobcat /net/gcrear1/bin/ls /net/gcrear1

The libc.so.2 that “ls” needed, is also come from /net/gcrear1/lib/libc.so.2

On the other hande “on -f bobcat /bin/ls /”, would have a
ROOT=/net/bobcat, which means:

gcrear1$ /net/gcrear1/bin/on -f bobcat /net/bobcat/bin/ls /net/bobcat

So “on -f …” is more equivlent with telnet into the box and run a
program.This is usually most people think what it would like to
“spawn remote”.

The “on -n” will have a lot of “wired behavar”. For example, if you
“on -n bobcat ping yahoo.com”, and try to tcpdump the traffic, you
can see the packet src address is “gcrear1” , but not “bobcat” as
most people would expect, why?

Because the “ping” is running with ROOT=/net/gcear1. So when it call
“socket()”, it is a open("/dev/sock/2, “) underkneeth. Which, because
of the “ROOT”, is then translated by procnto to
open(”/net/gcrear1/dev/socket/2"…),
which using the tcpip stack back on gcrear1.

Since almost everything on QNX involve a pathname resovle, and message
passing. So if you “on -n bobcot application”, and your application is
opening a
pipe, it will try open("/dev/pipe", ), which translate back to
“/net/gcrear1/dev/pipe”,
it then use the pipe manager on gcrear1 to creat a pipe. If your application
call
“mq_open()”, it goes all the way back to “/net/gcrear1/dev/mqueue”…
Some manager may refuse being accessed from remote, which means
some call will fail.

So, in a short, “on -f bobcat” will give the behavior everybody usually
expect > > Hope this could clear the mud a little bit.

-Xiaodan Tang
\

Multiple io-net crashes

-David

My setup:

Restart TCP/IP networking so that new Ethernet driver is attached to

Start QNX native networking.

map user to vehicle

qnet and io-net are both Jan 18 2003 versions

node1$ ls /net/node0 ls: readdir of ‘/net/node0’ failed (Bad file descriptor)

My setup:

Restart TCP/IP networking so that new Ethernet driver is attached to

Start QNX native networking.

map user to vehicle

qnet and io-net are both Jan 18 2003 versions

node1$ ls /net/node0 ls: readdir of ‘/net/node0’ failed (Bad file descriptor)

node1$ ls /net/node0
ls: readdir of ‘/net/node0’ failed (Bad file descriptor)

node1$ ls /net/node0
ls: readdir of ‘/net/node0’ failed (Bad file descriptor)