io-net process crash at npm-tcpip-v6.so

We’re currently in the middle of development of a distributed equipment
control system, involving several dozens of nodes, running QNX 6.2.1 Patch
B.

Control processes communicate via QNet; user GUI connects to the network via
several TCP-to-QNet bridges (just processes that perform file operations and
return results via sockets). Most nodes are EmCORE-v611 PC104 cards, though
there are several dedicated ordinary PCs, most equipped with two RTL8139
network adapters.

io-net is started as “io-net -drtl -pqnet -ptctip”, where npm-tcpip.so is a
link to npm-tcpip-v6.so

Quite often, at least once a day, io-net process coredumps.

Before installing Patch B it crashed in npm-qnet.so. Release notes for Patch
B say that it fixes “QNET fault in certain but unpredictable situations.
(Ref# 15719)”, and I must agree, it crashed all the time in the same place,
but in unpredictable situations. We hoped that Patch B would fix the problem
completely, but unfortunately it has not happened, though nature of crash
changed. I’m enclose some information here; I can send a coredump as well.

=== Cut here ===

thread 5 SIGNALLED-SIGSEGV code=1 MAPERR refaddr=14 fltno=11
ip=0xb825845d sp=0x7f5f9c8 stkbase=0x7f73000 stksize=135168
state=STOPPED flags=84000000 last_cpu=1 timeout=00000000

pri=10 realpri=10 policy=OTHER

0xb825845a <tcpip_write+66>: mov %eax,0xffffff94(%ebp)
0xb825845d <tcpip_write+69>: mov 0x14(%eax),%eax

(gdb) i r
eax 0x0 0

#0 0xb825845d in tcpip_write () from /x86/lib/dll/npm-tcpip-v6.so

(gdb) i thr
9 process 11 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
8 process 10 0xb032a209 in SyncCondvarWait_r () from /x86/lib/libc.so.2
7 process 9 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
6 process 8 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
5 process 7 0xb0329321 in MsgReceivev () from /x86/lib/libc.so.2

  • 4 process 5 0xb825845d in tcpip_write () from
    /x86/lib/dll/npm-tcpip-v6.so
    3 process 4 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
    2 process 2 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
    1 process 1 0xb032a0e5 in SignalWaitinfo () from /x86/lib/libc.so.2

(gdb) thr 4
[Switching to thread 4 (process 5)]#0 0xb825845d in tcpip_write ()
from /x86/lib/dll/npm-tcpip-v6.so

(gdb) bt
#0 0xb825845d in tcpip_write () from /x86/lib/dll/npm-tcpip-v6.so
#1 0xb0331d08 in _resmgr_io_handler () from /x86/lib/libc.so.2
#2 0xb033136e in _resmgr_handler () from /x86/lib/libc.so.2
#3 0xb825fed1 in pcreat () from /x86/lib/dll/npm-tcpip-v6.so
#4 0xb825ff58 in pcreat () from /x86/lib/dll/npm-tcpip-v6.so
#5 0xb031c02e in _flist_first_fit () from /x86/lib/libc.so.2
#6 0x08159fc8 in ?? ()
#7 0x080e5fc8 in ?? ()
#8 0x08159fc8 in ?? ()
#9 0x080e5fc8 in ?? ()
#10 0x080e5fc8 in ?? ()
#11 0x08159fc8 in ?? ()
#12 0x08159fc8 in ?? ()
#13 0x080e5fc8 in ?? ()
#14 0x080e5fc8 in ?? ()
#15 0x080e5fc8 in ?? ()

=== Cut here ===

At the moment we do not have a clear image of a system that crashes, but the
fact that the same crash occurs on differrent hardware most likely indicates
an algorithmic error.

What would you suggest to resolve the problem, which is extremely serious?

  1. Use tiny TCP stack?
    Probably it’s a good option, but it’ll require more testing, and some
    software is to be altered to work with tiny TCP stack.

1a) Use npm-tcpip-v4.so
Is there any difference?

  1. Restart io-net as it crashes
    Run it under a guardian process, catch SIGCHILD… It’s dirty, but can be
    used as a last resort… but just to give us time to migrate to a more
    stable OS unless QNX works in network.

I would prefer the following, if possible:

  1. If this is a known problem, I would appreciate a patch
  2. If it’s not, we could install debug builds and send debug coredumps (or
    release coredumps with symbols)
  3. Source code would be appreciated
  4. I do not think it’s a good idea at all to load many DLLs in one process,
    because they can corrupt each other; I tried to run tcpip and qnet drivers
    in two io-net processes, but looks like it just does not work that way.
    Though the crash is caused by derererencing zero address, so this might not
    be the issue.

We need network subsystem to work; we have enough of our bugs to fix :slight_smile:

Any help is appreciated,
Roman

I don’t see off hand how this crash can occur. Does
it always happen in the same spot? Is it possible to
simplify the system? ie does it happen without qnet?
Does it happen with a different driver?

-seanb

Roman Pavluyk <john@eleks.lviv.ua> wrote:

We’re currently in the middle of development of a distributed equipment
control system, involving several dozens of nodes, running QNX 6.2.1 Patch
B.

Control processes communicate via QNet; user GUI connects to the network via
several TCP-to-QNet bridges (just processes that perform file operations and
return results via sockets). Most nodes are EmCORE-v611 PC104 cards, though
there are several dedicated ordinary PCs, most equipped with two RTL8139
network adapters.

io-net is started as “io-net -drtl -pqnet -ptctip”, where npm-tcpip.so is a
link to npm-tcpip-v6.so

Quite often, at least once a day, io-net process coredumps.

Before installing Patch B it crashed in npm-qnet.so. Release notes for Patch
B say that it fixes “QNET fault in certain but unpredictable situations.
(Ref# 15719)”, and I must agree, it crashed all the time in the same place,
but in unpredictable situations. We hoped that Patch B would fix the problem
completely, but unfortunately it has not happened, though nature of crash
changed. I’m enclose some information here; I can send a coredump as well.

=== Cut here ===

thread 5 SIGNALLED-SIGSEGV code=1 MAPERR refaddr=14 fltno=11
ip=0xb825845d sp=0x7f5f9c8 stkbase=0x7f73000 stksize=135168
state=STOPPED flags=84000000 last_cpu=1 timeout=00000000

pri=10 realpri=10 policy=OTHER

0xb825845a <tcpip_write+66>: mov %eax,0xffffff94(%ebp)
0xb825845d <tcpip_write+69>: mov 0x14(%eax),%eax

(gdb) i r
eax 0x0 0

#0 0xb825845d in tcpip_write () from /x86/lib/dll/npm-tcpip-v6.so

(gdb) i thr
9 process 11 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
8 process 10 0xb032a209 in SyncCondvarWait_r () from /x86/lib/libc.so.2
7 process 9 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
6 process 8 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
5 process 7 0xb0329321 in MsgReceivev () from /x86/lib/libc.so.2

  • 4 process 5 0xb825845d in tcpip_write () from
    /x86/lib/dll/npm-tcpip-v6.so
    3 process 4 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
    2 process 2 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
    1 process 1 0xb032a0e5 in SignalWaitinfo () from /x86/lib/libc.so.2

(gdb) thr 4
[Switching to thread 4 (process 5)]#0 0xb825845d in tcpip_write ()
from /x86/lib/dll/npm-tcpip-v6.so

(gdb) bt
#0 0xb825845d in tcpip_write () from /x86/lib/dll/npm-tcpip-v6.so
#1 0xb0331d08 in _resmgr_io_handler () from /x86/lib/libc.so.2
#2 0xb033136e in _resmgr_handler () from /x86/lib/libc.so.2
#3 0xb825fed1 in pcreat () from /x86/lib/dll/npm-tcpip-v6.so
#4 0xb825ff58 in pcreat () from /x86/lib/dll/npm-tcpip-v6.so
#5 0xb031c02e in _flist_first_fit () from /x86/lib/libc.so.2
#6 0x08159fc8 in ?? ()
#7 0x080e5fc8 in ?? ()
#8 0x08159fc8 in ?? ()
#9 0x080e5fc8 in ?? ()
#10 0x080e5fc8 in ?? ()
#11 0x08159fc8 in ?? ()
#12 0x08159fc8 in ?? ()
#13 0x080e5fc8 in ?? ()
#14 0x080e5fc8 in ?? ()
#15 0x080e5fc8 in ?? ()

=== Cut here ===

At the moment we do not have a clear image of a system that crashes, but the
fact that the same crash occurs on differrent hardware most likely indicates
an algorithmic error.

What would you suggest to resolve the problem, which is extremely serious?

  1. Use tiny TCP stack?
    Probably it’s a good option, but it’ll require more testing, and some
    software is to be altered to work with tiny TCP stack.

1a) Use npm-tcpip-v4.so
Is there any difference?

  1. Restart io-net as it crashes
    Run it under a guardian process, catch SIGCHILD… It’s dirty, but can be
    used as a last resort… but just to give us time to migrate to a more
    stable OS unless QNX works in network.

I would prefer the following, if possible:

  1. If this is a known problem, I would appreciate a patch
  2. If it’s not, we could install debug builds and send debug coredumps (or
    release coredumps with symbols)
  3. Source code would be appreciated
  4. I do not think it’s a good idea at all to load many DLLs in one process,
    because they can corrupt each other; I tried to run tcpip and qnet drivers
    in two io-net processes, but looks like it just does not work that way.
    Though the crash is caused by derererencing zero address, so this might not
    be the issue.

We need network subsystem to work; we have enough of our bugs to fix > :slight_smile:

Any help is appreciated,
Roman

We reported some serious QNET problems a few months back, when
we were running QNET over a wireless Ethernet bridge. There
were some buffering bugs in 6.2.1 that appeared when you ran QNET
over what the node thought was a fast Ethernet but which
actually contained a slower link in the middle. This
could exercise buffering situations not encountered when
QNET was run entirely over a local LAN. io-net would crash
almost every time we did this.

Search my previous postings (someone does archive the
QNX groups somewhere, I hope) for details.

John Nagle
Team Overbot

Sean Boudreau wrote:

I don’t see off hand how this crash can occur. Does
it always happen in the same spot? Is it possible to
simplify the system? ie does it happen without qnet?
Does it happen with a different driver?

-seanb

Roman Pavluyk <> john@eleks.lviv.ua> > wrote:

We’re currently in the middle of development of a distributed equipment
control system, involving several dozens of nodes, running QNX 6.2.1 Patch
B.


Control processes communicate via QNet; user GUI connects to the network via
several TCP-to-QNet bridges (just processes that perform file operations and
return results via sockets). Most nodes are EmCORE-v611 PC104 cards, though
there are several dedicated ordinary PCs, most equipped with two RTL8139
network adapters.


io-net is started as “io-net -drtl -pqnet -ptctip”, where npm-tcpip.so is a
link to npm-tcpip-v6.so


Quite often, at least once a day, io-net process coredumps.


Before installing Patch B it crashed in npm-qnet.so. Release notes for Patch
B say that it fixes “QNET fault in certain but unpredictable situations.
(Ref# 15719)”, and I must agree, it crashed all the time in the same place,
but in unpredictable situations. We hoped that Patch B would fix the problem
completely, but unfortunately it has not happened, though nature of crash
changed. I’m enclose some information here; I can send a coredump as well.


=== Cut here ===


thread 5 SIGNALLED-SIGSEGV code=1 MAPERR refaddr=14 fltno=11
ip=0xb825845d sp=0x7f5f9c8 stkbase=0x7f73000 stksize=135168
state=STOPPED flags=84000000 last_cpu=1 timeout=00000000


pri=10 realpri=10 policy=OTHER


0xb825845a <tcpip_write+66>: mov %eax,0xffffff94(%ebp)
0xb825845d <tcpip_write+69>: mov 0x14(%eax),%eax


(gdb) i r
eax 0x0 0


#0 0xb825845d in tcpip_write () from /x86/lib/dll/npm-tcpip-v6.so


(gdb) i thr
9 process 11 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
8 process 10 0xb032a209 in SyncCondvarWait_r () from /x86/lib/libc.so.2
7 process 9 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
6 process 8 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
5 process 7 0xb0329321 in MsgReceivev () from /x86/lib/libc.so.2

  • 4 process 5 0xb825845d in tcpip_write () from
    /x86/lib/dll/npm-tcpip-v6.so
    3 process 4 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
    2 process 2 0xb03294c1 in MsgReceive () from /x86/lib/libc.so.2
    1 process 1 0xb032a0e5 in SignalWaitinfo () from /x86/lib/libc.so.2


    (gdb) thr 4
    [Switching to thread 4 (process 5)]#0 0xb825845d in tcpip_write ()
    from /x86/lib/dll/npm-tcpip-v6.so


    (gdb) bt
    #0 0xb825845d in tcpip_write () from /x86/lib/dll/npm-tcpip-v6.so
    #1 0xb0331d08 in _resmgr_io_handler () from /x86/lib/libc.so.2
    #2 0xb033136e in _resmgr_handler () from /x86/lib/libc.so.2
    #3 0xb825fed1 in pcreat () from /x86/lib/dll/npm-tcpip-v6.so
    #4 0xb825ff58 in pcreat () from /x86/lib/dll/npm-tcpip-v6.so
    #5 0xb031c02e in _flist_first_fit () from /x86/lib/libc.so.2
    #6 0x08159fc8 in ?? ()
    #7 0x080e5fc8 in ?? ()
    #8 0x08159fc8 in ?? ()
    #9 0x080e5fc8 in ?? ()
    #10 0x080e5fc8 in ?? ()
    #11 0x08159fc8 in ?? ()
    #12 0x08159fc8 in ?? ()
    #13 0x080e5fc8 in ?? ()
    #14 0x080e5fc8 in ?? ()
    #15 0x080e5fc8 in ?? ()


    === Cut here ===


    At the moment we do not have a clear image of a system that crashes, but the
    fact that the same crash occurs on differrent hardware most likely indicates
    an algorithmic error.


    What would you suggest to resolve the problem, which is extremely serious?

    \
  1. Use tiny TCP stack?
    Probably it’s a good option, but it’ll require more testing, and some
    software is to be altered to work with tiny TCP stack.


    1a) Use npm-tcpip-v4.so
    Is there any difference?

    \
  2. Restart io-net as it crashes
    Run it under a guardian process, catch SIGCHILD… It’s dirty, but can be
    used as a last resort… but just to give us time to migrate to a more
    stable OS unless QNX works in network.


    I would prefer the following, if possible:
  3. If this is a known problem, I would appreciate a patch
  4. If it’s not, we could install debug builds and send debug coredumps (or
    release coredumps with symbols)
  5. Source code would be appreciated
  6. I do not think it’s a good idea at all to load many DLLs in one process,
    because they can corrupt each other; I tried to run tcpip and qnet drivers
    in two io-net processes, but looks like it just does not work that way.
    Though the crash is caused by derererencing zero address, so this might not
    be the issue.


    We need network subsystem to work; we have enough of our bugs to fix > :slight_smile:


    Any help is appreciated,
    Roman