Send/Reply problem over virtual circuit

Hi

Let me start by giving the essential software and hardware versions. I’m
running QNX 4.25 as installed from the May 2001 CD. I’m running the software
on Compaq DP4000 6233MMX PCs with 3Com509 cards or on an Avantech PCM-5824
Geode PC104 card with an RTL 8139 Ethernet chip.

We are developing a medical device where for safety reasons we need to have
two CPUs. In our product we have devised a token ring plus a CPU handshaking
mechanism to make sure that everything is alive and well. The CPU
handshaking is done by establishing virtual circuits with qnx_name_attach()
and the CPUs take turns in initiating the handshake after the local token
has done a full lap on a CPU. This will happen once per second in our
system.

The handshake is an integer and the receiving side takes the integer and
does “~(handshake+3)” before the reply handshake is sent. This seemed to
work fine, but during longer test runs suddenly the handshake reply was
scrambled. For some reason it seems that the Reply() data is lost and
instead the data from the next Send() is sent twice!

The following is perhaps not that easy to follow, but hopefully the debug
printouts illustrate the problem;


Protective CPU:
sending = 1
Sending: sendHshk.theHndshk: 858 counter: 474
Reply handshake OK: -862 counter 475
sending = 0
Now we’re waiting for other CPU…
Received 839 , old counter 421 will reply newhndshk: -843, new counter: 422
sending = 1
Sending: sendHshk.theHndshk: 860 counter: 475
Unable to send handshake, errno: No such process
Resetting… PROCESSING_ERROR Child process died

Control CPU:
sending = 0
Now we’re waiting for other CPU…
Received 858 , old counter 474 will reply newhndshk: -862, new counter: 475
sending = 1
Sending: sendHshk.theHndshk: 839 counter: 421 recvHshk: 0
Reply handshake was wrong: s/b -843, is 860 counter: 475
Resetting… HANDSHAKE_SCRAMBLED Child process died

The last handshake that Control sends is 839 and the counter is 421. The
printout from Protective says that it has calculated the new values -843 and
422 and that’s what it will reply. The reply never reaches Control though,
as the printout shows. If I remove the code that kills the handshake process
and just keep on running, I will get the result that the next handshake (h
860 c 475) that Protective sends both replaces the “real” Reply() data and
also arrives at Control as the new handshake.

The problem can occur after 5 seconds or after a day. I’m running Phindows
on NT (one session for each CPU) and I can provoke the system to crash by
dragging a Windows window over the Phindows windows to cause Phindows to do
redraws. If I look at the netinfo output at around the time of the crash I
can find messages such as these;


17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail

I searched on QSSL’s website and found these two items;

http://qdn.qnx.com/support/bok/solution.qnx?9499
http://qdn.qnx.com/support/bok/solution.qnx?9556

At first I was suspecting Pentium cache problems, but after trying my code
on different types of hardware layouts I’ve now dismissed that suspicion.
Virtual circuits and networking now seems to be the most likely cause of the
problem.

Has someone else experienced something like this?

I’m attaching a stripped version of our handshaking software (running at
full speed instead of once per second). If anyone wants to test it, compile
the two executables with:

cc -o masterCtrl miniPFS.c
cc -DPROTECTIVE -o masterProt miniPFS.c

Start the executables on two QNX nodes and watch what happens (of course a
bug will never show up when you want it to :slight_smile:

Thanks in advance,
P-O Håkansson




begin 666 miniPFS.c
M(VEN8VQU9&4@/’-T9&QI8BYH/@HC:6YC;‘5D92 <WES+VME<FYE;"YH/@HC
M:6YC;‘5D92 <WES+VYA;64N:#X*(VEN8VQU9&4@/&5R<FYO+F@^"B-I;F-L
M=61E(#QS=&1I;RYH/B (VEN8VQU9&4@/'1I;64N:#X@"@H+RH@268@=V4@
M87)E(&-O;7!I;&EN9R!F;W(@82!0<F]T96-T:79E(’!R;V-E<W-O<BP@=&AE
M;B!T:&4@:&%N9’-H86MI;F<(" @<&%R=&YE<B!I<R!T:&4@0V]N=’)O;$UA
M<W1E<E!&4RP@86YD(‘9I8V4M=F5R<V$@9F]R(‘1H92!#;VYT<F]L(’!R;V-E
M<W-O<B J+PHC:69D968@4%)/5$5#5$E610HC9&5F:6YE($A.1%-(2U]005)4
M3D52(" B+T-O;G1R;VQ-87-T97)01E,B"B-D969I;F4@35E?4D5’7TY!344@
M(" @("(O4’)O=&5C=&EV94UA<W1E<E!&4R(
(V5L<V4*(V1E9FEN92!(3D13
M2$M?4$%25$Y%4B @(B]0<F]T96-T:79E36%S=&5R4$93(@HC9&5F:6YE($U9
M7U)%1U].04U%(" @(" B+T-O;G1R;VQ-87-T97)01E,B"B-E;F1I9@H*;6%I
M;B@I"GL*(" @<&ED7W0@("!R;710:60[“B @(&EN=” @(" @97)R=F%L=64[
M"B @(&EN=" @(" @:&YD<VAK.PH*("!T>7!E9&5F(’-T<G5C="!H<VAK7W-T
M<GL*(" @("!I;G0@=&AE2&YD<VAK.PH@(" @(&EN="!T:&5#;W5N=&5R.PH@
M("!]($AS:&M4>7!E.PH@("!(<VAK5’EP92!S96YD2’-H:RQR96-V2’-H:SL*
M(" @:6YT(" @("!N97=H;F1S:&L],“QO;&1H;F1S:&LL;VQD8V]U;G1E<CL*
M(” @:6YT(" @("!S96YD:6YG.PH@("!I;G0@(" @(&9O=6YD(#T@,#L*“B-I
M9F1E9B!04D]414-4259%“B @(’-E;F1I;F<@/2 P.PH@(”!H;F1S:&L@/2 R
M,CL*(” @<V5N9$AS:&LN=&AE0V]U;G1E<CTU-3L*(V5L<V4*(" @<V5N9&EN
M9R ](#$[“B @(&AN9’-H:R ](#$[“B @(’-E;F1(<VAK+G1H94-O=6YT97(]
M,3L*(V5N9&EF”@H*(” @+RH@071T86-H(‘1H:7,@<’)O8V5S<V5S(&YA;64@
MB*(" @97)R=F%L=64@/2!Q;GA?;F%M95]A='1A8V@H,"P@35E?4D5’7TY!
M344I.PH
(" @<’)I;G1F*")&:6YD:6YG(&AA;F1S:&%K92!P87)T;F5R)W,@
M<6YX(&YA;65<;B(I.PH*(" @+RH@3&]C871E(‘1H92!R96UO=&4@;F%M92X@
M(%-I;F-E(‘1H92!O=&AE<B!P<F]C97-S(&UA>2!N;W0@8F4@86-T:79E+ H@
M(" @("!A(&QO;W @:7,@<F5Q=6ER960N("!(;W=E=F5R+"!W92!O;FQY(&QO
M;W @82!N=6UB97(@;V8@=&EM97,L(’=H:6-H"B @(" @(&ES(&=O=F5R;F5D
M(&)Y($1%3$%97U1)345?4T5#+B @16%C:"!T:6UE(‘1H;W5G:"!D96QA>7,@
M87!P<F]X:6UA=&5L>0H@(" @("!O;F4@<V5C;VYD(“HO( H@(”!W:&EL92 H
M(69O=6YD0H@("![“B @(” @(&EF("@H<FUT4&ED(#T@<6YX7VYA;65?;&]C
M871E
# L($A.1%-(2U]005)43D52+" P+"!.54Q,2D@/3T@+3$I"B @(" @
M('L
(" @(" @(" @<VQE97 @#$I.PH@(" @("!]“B @(” @(&5L<V4@9F]U
M;F0@/2 Q.PH@("!]"@H@("!P<FEN=&8H(D9O=6YD(&AA;F1S:&%K92!P87)T
M;F5R)W,@<6YX(&YA;65<;B(I.PH
(" @=VAI;&4@#$I"B @('L"B @(" @
M(&EF(“AS96YD:6YG2 @+RH@5&AI<R!F;&%G(&-O;G1R;VQS(’=H;R!S96YD
M<R!T:&4@:&%N9’-H86ME(&9I<G-T(“HO"B @(” @('L
(” @(" @(" @+RH@
M4V5N9"!T:&4@:&%N9’-H86ME(&UE<W-A9V4@=&@=&AE(&]T:&5R(&UA<W1E
M<E!&4R @B*“2!R96-V2’-H:RYT:&5(;F1S:&L];F5W:&YD<VAK.PH)(’-E
M;F1(<VAK+G1H94AN9’-H:SUH;F1S:&L[”@D@<V5N9$AS:&LN=&AE0V]U;G1E
M<BLK.PH)(’!R:6YT9B@B4V5N9&EN9SH@<V5N9$AS:&LN=&AE2&YD<VAK.B E
M9"!C;W5N=&5R.B E9"!R96-V2’-H:SH@)61<;B(L"@D)<V5N9$AS:&LN=&AE
M2&YD<VAK+’-E;F1(<VAK+G1H94-O=6YT97(L<F5C=DAS:&LN=&AE2&YD<VAK
M
3L*“2 (" @(" @(" @97)R=F%L=64@/2 H4V5N9"AR;710:60L("AC:&%R
M("HI)G-E;F1(<VAK+ H)"0D@(“9R96-V2’-H:RP@”@D)"2 @<VEZ96]F
$AS
M:&M4>7!E2P"0D)(”!S:7IE;V8H2’-H:U1Y<&4I2 ]/2 M,2D["@H@(" @
M(" @(" O
B!#:&5C:R!M97-S86=E(‘9A;&ED:71Y(“HO"B @(” @(" @(&EF
M("AR96-V2’-H:RYT:&5(;F1S:&L@(3T@?BAH;F1S:&L@R S2D@“B @(” @
M(" @(‘L*“2 @('1I;65?=”!T:6UE7V]F7V1A>3L*“2 @(&-H87(@8G5F6S(V
M73L*”@D@("!T:6UE7V]F7V1A>2 ](‘1I;64H3E5,3"D["@D@("!P<FEN=&8H
M(E)E<&QY(&AA;F1S:&%K92!W87,@=W)O;F<Z(’,O8B E9"P@:7,@)60@8V]U
M;G1E<CH@)61<;E1I;64Z(“5S(BP@”@D)("!^&AN9’-H:R K(#,I+"!R96-V
M2’-H:RYT:&5(;F1S:&LL"@D)("!R96-V2’-H:RYT:&5#;W5N=&5R+%]C=&EM
M92@F=&EM95]O9E]D87DL8G5F
2D[“B )(” @97AI="@Q3L(" @(" @(" @
M?0H)(&5L<V4*“2![”@D@("!P<FEN=&8H(E)E<&QY(&AA;F1S:&%K92!/2SH@
M)60@8V]U;G1E<B E9%QN(BP*"0D@(’)E8W9(<VAK+G1H94AN9’-H:RP*“0D@
M(’)E8W9(<VAK+G1H94-O=6YT97(I.PH)('T*(” @(" @(" @+RH@26YC<F5M
M96YT(&AA;F1S:&%K92P@:&%N9&QI;F<@=W)A<&%R;W5N9" J+PH@(" @(" @
M("!I9B H:&YD<VAK(#X@-C4P,# I"@D@("!H;F1S:&L@/2 Q.PH@(" @(" @
M("!E;’-E"@D@("!H;F1S:&LK/3(["@H@(" @("!]"@H@(" @("!E;’-E(" @
M+RH@3F]T(&9I<G-T(’-E;F1E<B J+PH@(" @("![“B @(” @(" @("\J(%=A
M:70@9F]R(&UE<W-A9V4@9G)O;2!O=&AE<B!P<F]C97-S(&-O;G1A:6YI;F<@
M;F5W( H@(" @(" @(" @("!H86YD<VAA:V4@;65S<V%G92 J+PH)(’!R:6YT
M9B@B3F]W(’=E)W)E(’=A:71I;F<@9F]R(&]T:&5R($-052XN+EQN(BD"B @
M(" @(" @(’)M=%!I9"
E8V5I=F4H,“P@)G)E8W9(<VAK+ H)“0D@(’-I
M>F5O9BA(<VAK5’EP92DI.PH*(” @(” @(" @+RH@0VAA;F=E(&AA;F1S:&%K
M92!A;F0@<V5N9"!I="!B86-K(“HO”@D@;VQD8V]U;G1E<B ](’)E8W9(<VAK
M+G1H94-O=6YT97(KSL"2!O;&1H;F1S:&L@/2!R96-V2’-H:RYT:&5(;F1S
M:&L[“B @(” @(" @(’)E8W9(<VAK+G1H94AN9’-H:R ]"@D@("!^’)E8W9(
M<VAK+G1H94AN9’-H:R K(#,I.PH)( H)(’!R:6YT9B@B4F5C96EV960@)60@
M+"!O;&0@8V]U;G1E<B E9"!W:6QL(’)E<&QY(&YE=VAN9’-H:SH@)60L(&YE
M=R!C;W5N=&5R.B E9%QN(BP
"0EO;&1H;F1S:&LL;VQD8V]U;G1E<BQR96-V
M2’-H:RYT:&5(;F1S:&LL<F5C=DAS:&LN=&AE0V]U;G1E<BD[“B @(” @(" @
M(&EF(“A297!L>2AR;710:60L(“9R96-V2’-H:RP*“0D@(”!S:7IE;V8H2’-H
M:U1Y<&4I2 ]/2 M,2D@“B @(” @(" @('L(” @(” @(" @(" @<’)I;G1F
M*")5;F%B;&4@=&@<F5P;‘D@=&@:&YD<VAK+"!E<G)N;SH@)7-<;B(L(’-T
M<F5R<F]R*&5R<FYO2D[“B @(” @(" @(" @(&5X:70H,2D[“B @(” @(" @
M('T
"@H@(" @("!](" OB!%;F0@268M16QS92 J+PH(" @(" @;F5W:&YD
M<VAK/3 [“B @(” @(’-E;F1I;F<@/2 H<V5N9&EN9R ]/2 Q2 _(# @.B Q
M.PH@(" @("!P<FEN=&8H(G-E;F1I;F<@/2 E9%QN(BQS96YD:6YG
3L*(" @
6?2 @+RH@16YD(’=H:6QE(“HO”@I]"@``
`
end

“P-O Håkansson” <par-olof.hakansson@gambro.com> wrote in message
news:9q1q3v$jj0$1@inn.qnx.com

Hi

Let me start by giving the essential software and hardware versions. I’m
running QNX 4.25 as installed from the May 2001 CD. I’m running the
software
on Compaq DP4000 6233MMX PCs with 3Com509 cards or on an Avantech PCM-5824
Geode PC104 card with an RTL 8139 Ethernet chip.

We are developing a medical device where for safety reasons we need to
have
two CPUs. In our product we have devised a token ring plus a CPU
handshaking
mechanism to make sure that everything is alive and well. The CPU
handshaking is done by establishing virtual circuits with
qnx_name_attach()
and the CPUs take turns in initiating the handshake after the local token
has done a full lap on a CPU. This will happen once per second in our
system.

The handshake is an integer and the receiving side takes the integer and
does “~(handshake+3)” before the reply handshake is sent. This seemed to
work fine, but during longer test runs suddenly the handshake reply was
scrambled. For some reason it seems that the Reply() data is lost and
instead the data from the next Send() is sent twice!

The following is perhaps not that easy to follow, but hopefully the debug
printouts illustrate the problem;


Protective CPU:
sending = 1
Sending: sendHshk.theHndshk: 858 counter: 474
Reply handshake OK: -862 counter 475
sending = 0
Now we’re waiting for other CPU…
Received 839 , old counter 421 will reply newhndshk: -843, new counter:
422
sending = 1
Sending: sendHshk.theHndshk: 860 counter: 475
Unable to send handshake, errno: No such process
Resetting… PROCESSING_ERROR Child process died

Control CPU:
sending = 0
Now we’re waiting for other CPU…
Received 858 , old counter 474 will reply newhndshk: -862, new counter:
475
sending = 1
Sending: sendHshk.theHndshk: 839 counter: 421 recvHshk: 0
Reply handshake was wrong: s/b -843, is 860 counter: 475
Resetting… HANDSHAKE_SCRAMBLED Child process died

The last handshake that Control sends is 839 and the counter is 421. The
printout from Protective says that it has calculated the new values -843
and
422 and that’s what it will reply. The reply never reaches Control though,
as the printout shows. If I remove the code that kills the handshake
process
and just keep on running, I will get the result that the next handshake (h
860 c 475) that Protective sends both replaces the “real” Reply() data and
also arrives at Control as the new handshake.

The problem can occur after 5 seconds or after a day. I’m running Phindows
on NT (one session for each CPU) and I can provoke the system to crash by
dragging a Windows window over the Phindows windows to cause Phindows to
do
redraws. If I look at the netinfo output at around the time of the crash I
can find messages such as these;


17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail

I searched on QSSL’s website and found these two items;

http://qdn.qnx.com/support/bok/solution.qnx?9499
http://qdn.qnx.com/support/bok/solution.qnx?9556

At first I was suspecting Pentium cache problems, but after trying my code
on different types of hardware layouts I’ve now dismissed that suspicion.
Virtual circuits and networking now seems to be the most likely cause of
the
problem.

Has someone else experienced something like this?

I’ve seen message being mixed up when two Replies are done on the same
message.
I think this was fixed (the second Reply would return error)

I’m attaching a stripped version of our handshaking software (running at
full speed instead of once per second). If anyone wants to test it,
compile
the two executables with:

cc -o masterCtrl miniPFS.c
cc -DPROTECTIVE -o masterProt miniPFS.c

It’s been running over an hour on a 100Mbits without any error (I think).
It would be nice if the test program would stop on an error.

Start the executables on two QNX nodes and watch what happens (of course a
bug will never show up when you want it to > :slight_smile:

Thanks in advance,
P-O Håkansson

\

Hi Mario

Thank you for taking the time to test the file I attached. In
qdn.public.qnx4.devtools I found the following in a post from June 6th by
Ray Threadgould (I’m taking the liberty to CC Ray Threadgould on this since
it seems he never got any answer to his post, the current thread is on
qdn.public.qnx4):

I have been experiencing problems with virtual circuits failing when more
than one node is “sending” to a process on another node, the problem seems
to be worse if a process or processes on the remote node are also talking
to
the remote process.

This is exactly my problem too. Mario, if you have the time, would you like
to try my example in a simple ditto session to see if that makes the problem
occur? Something like this perhaps;

On Node 1:
Don’t run Photon, just boot up into a a login shell. Start masterCtrl.

On Node 2:
Create two pterms in Photon. In the first pterm run masterProt. In the
second pterm use ditto to view masterCtrl on Node 1, e.g. ditto -k -t1
//1/dev/con1

When I run masterCtrl/Prot between any two of my eight nodes, one of them
will stop because a faulty handshake was replied to Send().

Thanks,
P-O Håkansson