Hi
Let me start by giving the essential software and hardware versions. I’m
running QNX 4.25 as installed from the May 2001 CD. I’m running the software
on Compaq DP4000 6233MMX PCs with 3Com509 cards or on an Avantech PCM-5824
Geode PC104 card with an RTL 8139 Ethernet chip.
We are developing a medical device where for safety reasons we need to have
two CPUs. In our product we have devised a token ring plus a CPU handshaking
mechanism to make sure that everything is alive and well. The CPU
handshaking is done by establishing virtual circuits with qnx_name_attach()
and the CPUs take turns in initiating the handshake after the local token
has done a full lap on a CPU. This will happen once per second in our
system.
The handshake is an integer and the receiving side takes the integer and
does “~(handshake+3)” before the reply handshake is sent. This seemed to
work fine, but during longer test runs suddenly the handshake reply was
scrambled. For some reason it seems that the Reply() data is lost and
instead the data from the next Send() is sent twice!
The following is perhaps not that easy to follow, but hopefully the debug
printouts illustrate the problem;
Protective CPU:
sending = 1
Sending: sendHshk.theHndshk: 858 counter: 474
Reply handshake OK: -862 counter 475
sending = 0
Now we’re waiting for other CPU…
Received 839 , old counter 421 will reply newhndshk: -843, new counter: 422
sending = 1
Sending: sendHshk.theHndshk: 860 counter: 475
Unable to send handshake, errno: No such process
Resetting… PROCESSING_ERROR Child process died
Control CPU:
sending = 0
Now we’re waiting for other CPU…
Received 858 , old counter 474 will reply newhndshk: -862, new counter: 475
sending = 1
Sending: sendHshk.theHndshk: 839 counter: 421 recvHshk: 0
Reply handshake was wrong: s/b -843, is 860 counter: 475
Resetting… HANDSHAKE_SCRAMBLED Child process died
The last handshake that Control sends is 839 and the counter is 421. The
printout from Protective says that it has calculated the new values -843 and
422 and that’s what it will reply. The reply never reaches Control though,
as the printout shows. If I remove the code that kills the handshake process
and just keep on running, I will get the result that the next handshake (h
860 c 475) that Protective sends both replaces the “real” Reply() data and
also arrives at Control as the new handshake.
The problem can occur after 5 seconds or after a day. I’m running Phindows
on NT (one session for each CPU) and I can provoke the system to crash by
dragging a Windows window over the Phindows windows to cause Phindows to do
redraws. If I look at the netinfo output at around the time of the crash I
can find messages such as these;
…
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
17:24:09 1 (3175) rtl ( tx) Heartbeat Fail
…
I searched on QSSL’s website and found these two items;
http://qdn.qnx.com/support/bok/solution.qnx?9499
http://qdn.qnx.com/support/bok/solution.qnx?9556
At first I was suspecting Pentium cache problems, but after trying my code
on different types of hardware layouts I’ve now dismissed that suspicion.
Virtual circuits and networking now seems to be the most likely cause of the
problem.
Has someone else experienced something like this?
I’m attaching a stripped version of our handshaking software (running at
full speed instead of once per second). If anyone wants to test it, compile
the two executables with:
cc -o masterCtrl miniPFS.c
cc -DPROTECTIVE -o masterProt miniPFS.c
Start the executables on two QNX nodes and watch what happens (of course a
bug will never show up when you want it to
Thanks in advance,
P-O Håkansson
begin 666 miniPFS.c
M(VEN8VQU9&4@/’-T9&QI8BYH/@HC:6YC;‘5D92 <WES+VME<FYE;"YH/@HC
M:6YC;‘5D92 <WES+VYA;64N:#X*(VEN8VQU9&4@/&5R<FYO+F@^"B-I;F-L
M=61E(#QS=&1I;RYH/B (VEN8VQU9&4@/'1I;64N:#X@"@H+RH@268@=V4@
M87)E(&-O;7!I;&EN9R!F;W(@82!0<F]T96-T:79E(’!R;V-E<W-O<BP@=&AE
M;B!T:&4@:&%N9’-H86MI;F<(" @<&%R=&YE<B!I<R!T:&4@0V]N=’)O;$UA
M<W1E<E!&4RP@86YD(‘9I8V4M=F5R<V$@9F]R(‘1H92!#;VYT<F]L(’!R;V-E
M<W-O<B J+PHC:69D968@4%)/5$5#5$E610HC9&5F:6YE($A.1%-(2U]005)4
M3D52(" B+T-O;G1R;VQ-87-T97)01E,B"B-D969I;F4@35E?4D5’7TY!344@
M(" @("(O4’)O=&5C=&EV94UA<W1E<E!&4R((V5L<V4*(V1E9FEN92!(3D13
M2$M?4$%25$Y%4B @(B]0<F]T96-T:79E36%S=&5R4$93(@HC9&5F:6YE($U9
M7U)%1U].04U%(" @(" B+T-O;G1R;VQ-87-T97)01E,B"B-E;F1I9@H*;6%I
M;B@I"GL*(" @<&ED7W0@("!R;710:60[“B @(&EN=” @(" @97)R=F%L=64[
M"B @(&EN=" @(" @:&YD<VAK.PH*("!T>7!E9&5F(’-T<G5C="!H<VAK7W-T
M<GL*(" @("!I;G0@=&AE2&YD<VAK.PH@(" @(&EN="!T:&5#;W5N=&5R.PH@
M("!]($AS:&M4>7!E.PH@("!(<VAK5’EP92!S96YD2’-H:RQR96-V2’-H:SL*
M(" @:6YT(" @("!N97=H;F1S:&L],“QO;&1H;F1S:&LL;VQD8V]U;G1E<CL*
M(” @:6YT(" @("!S96YD:6YG.PH@("!I;G0@(" @(&9O=6YD(#T@,#L*“B-I
M9F1E9B!04D]414-4259%“B @(’-E;F1I;F<@/2 P.PH@(”!H;F1S:&L@/2 R
M,CL*(” @<V5N9$AS:&LN=&AE0V]U;G1E<CTU-3L*(V5L<V4*(" @<V5N9&EN
M9R ](#$[“B @(&AN9’-H:R ](#$[“B @(’-E;F1(<VAK+G1H94-O=6YT97(]
M,3L*(V5N9&EF”@H*(” @+RH@071T86-H(‘1H:7,@<’)O8V5S<V5S(&YA;64@
MB*(" @97)R=F%L=64@/2!Q;GA?;F%M95]A='1A8V@H,"P@35E?4D5’7TY!
M344I.PH(" @<’)I;G1F*")&:6YD:6YG(&AA;F1S:&%K92!P87)T;F5R)W,@
M<6YX(&YA;65<;B(I.PH*(" @+RH@3&]C871E(‘1H92!R96UO=&4@;F%M92X@
M(%-I;F-E(‘1H92!O=&AE<B!P<F]C97-S(&UA>2!N;W0@8F4@86-T:79E+ H@
M(" @("!A(&QO;W @:7,@<F5Q=6ER960N("!(;W=E=F5R+"!W92!O;FQY(&QO
M;W @82!N=6UB97(@;V8@=&EM97,L(’=H:6-H"B @(" @(&ES(&=O=F5R;F5D
M(&)Y($1%3$%97U1)345?4T5#+B @16%C:"!T:6UE(‘1H;W5G:"!D96QA>7,@
M87!P<F]X:6UA=&5L>0H@(" @("!O;F4@<V5C;VYD(“HO( H@(”!W:&EL92 H
M(69O=6YD0H@("![“B @(” @(&EF("@H<FUT4&ED(#T@<6YX7VYA;65?;&]C
M871E# L($A.1%-(2U]005)43D52+" P+"!.54Q,2D@/3T@+3$I"B @(" @
M('L(" @(" @(" @<VQE97 @#$I.PH@(" @("!]“B @(” @(&5L<V4@9F]U
M;F0@/2 Q.PH@("!]"@H@("!P<FEN=&8H(D9O=6YD(&AA;F1S:&%K92!P87)T
M;F5R)W,@<6YX(&YA;65<;B(I.PH(" @=VAI;&4@#$I"B @('L"B @(" @
M(&EF(“AS96YD:6YG2 @+RH@5&AI<R!F;&%G(&-O;G1R;VQS(’=H;R!S96YD
M<R!T:&4@:&%N9’-H86ME(&9I<G-T(“HO"B @(” @('L(” @(" @(" @+RH@
M4V5N9"!T:&4@:&%N9’-H86ME(&UE<W-A9V4@=&@=&AE(&]T:&5R(&UA<W1E
M<E!&4R @B*“2!R96-V2’-H:RYT:&5(;F1S:&L];F5W:&YD<VAK.PH)(’-E
M;F1(<VAK+G1H94AN9’-H:SUH;F1S:&L[”@D@<V5N9$AS:&LN=&AE0V]U;G1E
M<BLK.PH)(’!R:6YT9B@B4V5N9&EN9SH@<V5N9$AS:&LN=&AE2&YD<VAK.B E
M9"!C;W5N=&5R.B E9"!R96-V2’-H:SH@)61<;B(L"@D)<V5N9$AS:&LN=&AE
M2&YD<VAK+’-E;F1(<VAK+G1H94-O=6YT97(L<F5C=DAS:&LN=&AE2&YD<VAK
M3L*“2 (" @(" @(" @97)R=F%L=64@/2 H4V5N9"AR;710:60L("AC:&%R
M("HI)G-E;F1(<VAK+ H)"0D@(“9R96-V2’-H:RP@”@D)"2 @<VEZ96]F$AS
M:&M4>7!E2P"0D)(”!S:7IE;V8H2’-H:U1Y<&4I2 ]/2 M,2D["@H@(" @
M(" @(" OB!#:&5C:R!M97-S86=E(‘9A;&ED:71Y(“HO"B @(” @(" @(&EF
M("AR96-V2’-H:RYT:&5(;F1S:&L@(3T@?BAH;F1S:&L@R S2D@“B @(” @
M(" @(‘L*“2 @('1I;65?=”!T:6UE7V]F7V1A>3L*“2 @(&-H87(@8G5F6S(V
M73L*”@D@("!T:6UE7V]F7V1A>2 ](‘1I;64H3E5,3"D["@D@("!P<FEN=&8H
M(E)E<&QY(&AA;F1S:&%K92!W87,@=W)O;F<Z(’,O8B E9"P@:7,@)60@8V]U
M;G1E<CH@)61<;E1I;64Z(“5S(BP@”@D)("!^&AN9’-H:R K(#,I+"!R96-V
M2’-H:RYT:&5(;F1S:&LL"@D)("!R96-V2’-H:RYT:&5#;W5N=&5R+%]C=&EM
M92@F=&EM95]O9E]D87DL8G5F2D[“B )(” @97AI="@Q3L(" @(" @(" @
M?0H)(&5L<V4*“2![”@D@("!P<FEN=&8H(E)E<&QY(&AA;F1S:&%K92!/2SH@
M)60@8V]U;G1E<B E9%QN(BP*"0D@(’)E8W9(<VAK+G1H94AN9’-H:RP*“0D@
M(’)E8W9(<VAK+G1H94-O=6YT97(I.PH)('T*(” @(" @(" @+RH@26YC<F5M
M96YT(&AA;F1S:&%K92P@:&%N9&QI;F<@=W)A<&%R;W5N9" J+PH@(" @(" @
M("!I9B H:&YD<VAK(#X@-C4P,# I"@D@("!H;F1S:&L@/2 Q.PH@(" @(" @
M("!E;’-E"@D@("!H;F1S:&LK/3(["@H@(" @("!]"@H@(" @("!E;’-E(" @
M+RH@3F]T(&9I<G-T(’-E;F1E<B J+PH@(" @("![“B @(” @(" @("\J(%=A
M:70@9F]R(&UE<W-A9V4@9G)O;2!O=&AE<B!P<F]C97-S(&-O;G1A:6YI;F<@
M;F5W( H@(" @(" @(" @("!H86YD<VAA:V4@;65S<V%G92 J+PH)(’!R:6YT
M9B@B3F]W(’=E)W)E(’=A:71I;F<@9F]R(&]T:&5R($-052XN+EQN(BD"B @
M(" @(" @(’)M=%!I9" E8V5I=F4H,“P@)G)E8W9(<VAK+ H)“0D@(’-I
M>F5O9BA(<VAK5’EP92DI.PH*(” @(” @(" @+RH@0VAA;F=E(&AA;F1S:&%K
M92!A;F0@<V5N9"!I="!B86-K(“HO”@D@;VQD8V]U;G1E<B ](’)E8W9(<VAK
M+G1H94-O=6YT97(KSL"2!O;&1H;F1S:&L@/2!R96-V2’-H:RYT:&5(;F1S
M:&L[“B @(” @(" @(’)E8W9(<VAK+G1H94AN9’-H:R ]"@D@("!^’)E8W9(
M<VAK+G1H94AN9’-H:R K(#,I.PH)( H)(’!R:6YT9B@B4F5C96EV960@)60@
M+"!O;&0@8V]U;G1E<B E9"!W:6QL(’)E<&QY(&YE=VAN9’-H:SH@)60L(&YE
M=R!C;W5N=&5R.B E9%QN(BP"0EO;&1H;F1S:&LL;VQD8V]U;G1E<BQR96-V
M2’-H:RYT:&5(;F1S:&LL<F5C=DAS:&LN=&AE0V]U;G1E<BD[“B @(” @(" @
M(&EF(“A297!L>2AR;710:60L(“9R96-V2’-H:RP*“0D@(”!S:7IE;V8H2’-H
M:U1Y<&4I2 ]/2 M,2D@“B @(” @(" @('L(” @(” @(" @(" @<’)I;G1F
M*")5;F%B;&4@=&@<F5P;‘D@=&@:&YD<VAK+"!E<G)N;SH@)7-<;B(L(’-T
M<F5R<F]R*&5R<FYO2D[“B @(” @(" @(" @(&5X:70H,2D[“B @(” @(" @
M('T"@H@(" @("!](" OB!%;F0@268M16QS92 J+PH(" @(" @;F5W:&YD
M<VAK/3 [“B @(” @(’-E;F1I;F<@/2 H<V5N9&EN9R ]/2 Q2 _(# @.B Q
M.PH@(" @("!P<FEN=&8H(G-E;F1I;F<@/2 E9%QN(BQS96YD:6YG3L*(" @
6?2 @+RH@16YD(’=H:6QE(“HO”@I]"@``
`
end