Data corruption with Send()/Receive()

Very occasionally I’m seeing that data sent with a send() is not correctly
received. I have a hard time believing this is the OS’s fault but I was
verifying the data in the line before the send. Hopefully, someone will
have experienced something like this (stack corruption, etc.)

I have noticed this problem with 3 separate programs but they need to run
for long periods of time (> 4 hours) before the error occurs. The messages
are sent frequently (200 Hz) but always to processes on the same processor.
I am using QNX 4.23-4.25

I defined the first 4 bytes of the message to be a long integer identifier
for the message type. When the problem occurs, this identifier was set to
zero which is not a type I have defined. Because it is not defined, my
program doesn’t reply to it which means the sender is reply blocked and
never runs again. I could easily detect this and reply NACK but I don’t
want go back and add a retry protocol to all my sends since send/receive is
supposed to be reliable. Also, I don’t want to hide a problem.

Why is my data (long integer) being set to zero?

“Curt Stienstra” <curt.stienstra.nospam@bepco.com> wrote in message
news:9seufl$3md$1@inn.qnx.com

Very occasionally I’m seeing that data sent with a send() is not correctly
received. I have a hard time believing this is the OS’s fault but I was
verifying the data in the line before the send. Hopefully, someone will
have experienced something like this (stack corruption, etc.)

I have noticed this problem with 3 separate programs but they need to run
for long periods of time (> 4 hours) before the error occurs. The messages
are sent frequently (200 Hz) but always to processes on the same
processor.
I am using QNX 4.23-4.25

I defined the first 4 bytes of the message to be a long integer identifier
for the message type. When the problem occurs, this identifier was set to
zero which is not a type I have defined. Because it is not defined, my
program doesn’t reply to it which means the sender is reply blocked and
never runs again. I could easily detect this and reply NACK but I don’t
want go back and add a retry protocol to all my sends since send/receive
is
supposed to be reliable. Also, I don’t want to hide a problem.

Why is my data (long integer) being set to zero?

  1. do you receive system messages ? you will receive it if any of process
    flags like _PPF_SERVER or _PPF_INFORM or _PPF_SIGCATCH is set. system
    messages are described at sys/sys_msg.h. for sent message first two bytes
    are message type and second are message sub-type. so pair
    _SYSMSG/_SYSMSG_SUBTYPE_DEATH will give you a zero first long word → is
    _PPF_INFORM is set ? but if it is you would get a “strange” message after
    the first any local process shutdown. sure it depends on your system
    design but i guess processes are started/stopped more often then once per 4
    hours.

  2. do you check returned from Receive() result ? i believe you are otherwise
    there can be anything…

  3. i guess before calling Receive(0,&MyMsg,sizeof(MyMsg)) or alike you clear
    this MyMsg buffer with zeros. note, that if sender process sends data which
    size is less then you’r expecting there will be only actually sent data
    placed into buffer specified for Receive() → if sender sends a zero-length
    message and you clear receive buffer with zeroes you’ll get noted about
    result. it’s a good idea not to specify a receive buffer for Receive() call
    at all but read real received message with Readmsg()/Readmsgmx(). this
    functions return actual size of data been transfered into specified receive
    buffer.

  4. can you share the piece of code where you catch this problem ? it would
    help a lot.

// wbr

“ian zagorskih” <ianzag@novosoft-us.com> wrote in message
news:<9sf7ot$91o$1@inn.qnx.com>…

“Curt Stienstra” <> curt.stienstra.nospam@bepco.com> > wrote in message
news:9seufl$3md$> 1@inn.qnx.com> …
Very occasionally I’m seeing that data sent with a send() is not
correctly
received. I have a hard time believing this is the OS’s fault but I was
verifying the data in the line before the send. Hopefully, someone will
have experienced something like this (stack corruption, etc.)

I have noticed this problem with 3 separate programs but they need to
run
for long periods of time (> 4 hours) before the error occurs. The
messages
are sent frequently (200 Hz) but always to processes on the same
processor.
I am using QNX 4.23-4.25

I defined the first 4 bytes of the message to be a long integer
identifier
for the message type. When the problem occurs, this identifier was set
to
zero which is not a type I have defined. Because it is not defined, my
program doesn’t reply to it which means the sender is reply blocked and
never runs again. I could easily detect this and reply NACK but I don’t
want go back and add a retry protocol to all my sends since send/receive
is
supposed to be reliable. Also, I don’t want to hide a problem.

Why is my data (long integer) being set to zero?

\

  1. do you receive system messages ? you will receive it if any of process
    flags like _PPF_SERVER or _PPF_INFORM or _PPF_SIGCATCH is set. system
    messages are described at sys/sys_msg.h. for sent message first two bytes
    are message type and second are message sub-type. so pair
    _SYSMSG/_SYSMSG_SUBTYPE_DEATH will give you a zero first long word → is
    _PPF_INFORM is set ? but if it is you would get a “strange” message after
    the first any local process shutdown. sure it depends on your system
    design but i guess processes are started/stopped more often then once per
    4
    hours.

Only _PPF_32BIT _PPF_FLAT _PPF_NOZOMBIE _PPF_SLEADER are set. The processes
have no children.

  1. do you check returned from Receive() result ? i believe you are
    otherwise
    there can be anything…

Yes, I do check the return code from Receive() and it matches the pid of the
sender.

  1. i guess before calling Receive(0,&MyMsg,sizeof(MyMsg)) or alike you
    clear
    this MyMsg buffer with zeros. note, that if sender process sends data
    which
    size is less then you’r expecting there will be only actually sent data
    placed into buffer specified for Receive() → if sender sends a
    zero-length
    message and you clear receive buffer with zeroes you’ll get noted about
    result. it’s a good idea not to specify a receive buffer for Receive()
    call
    at all but read real received message with Readmsg()/Readmsgmx(). this
    functions return actual size of data been transfered into specified
    receive
    buffer.

Yes, I do clear the first 4 bytes of the msg buffer before I receive(). I
will try the readmsg() as it would indicate if no data is being received or
if the data has been corrupted.

  1. can you share the piece of code where you catch this problem ? it would
    help a lot.

// wbr

This is the code that is receiving the bad messages. It is similar to the
other programs that I have had the problem with. Of course, the default
reply was added for debugging and testing.

#define INPUT_BUF_LEN 1200
void main(int argc, char * argv[])
{
pid_t pid;
long rcv_buf[INPUT_BUF_LEN/4];

while(!shut_down)
{
rcv_buf[0] = 0;
pid = Receive(0, rcv_buf, INPUT_BUF_LEN);
process_msg(pid, rcv_buf);
}
}


char process_msg(pid_t pid, long *r_buf)
{
long rbuf[40];

if (r_buf[0] == REG_SPD_ACC_TRQ)
{
//do stuff
rbuf[0] = ACK;
Reply(pid, rbuf, sizeof(long));
}
else
{
printf(“replying to pid %d despite bad command %ld\n”,
pid,r_buf[0]);
rbuf[0] = NACK;
Reply(pid, rbuf, sizeof(long));
}
return(1);
}


This code does the sending and retrying, it seems to only ever take 1
retry (limited testing.)

void update_log_reg(void)
{
long s_buf[INPUT_BUF_LEN/sizeof(long)],
r_buf[INPUT_BUF_LEN/sizeof(long)];
float *fs_buf;
int ii;

for (ii=0;ii<3;ii++)
{
fs_buf = (float *)(s_buf);
s_buf[0] = REG_SPD_ACC_TRQ;
fs_buf[1] = flt_spd;
fs_buf[2] = flt_accel;
fs_buf[3] = flt_force;
r_buf[0] = 0;
Send(Regulator_pid, s_buf, r_buf, 4 * sizeof(float),
INPUT_BUF_LEN);

if (r_buf[0] == ACK) break;
else printf(“retry #%d sending to reg\n”,ii+1);
}
}

“Curt Stienstra” <curt.stienstra.nospam@bepco.com> wrote in message
news:9sh55m$j87$1@inn.qnx.com

“ian zagorskih” <> ianzag@novosoft-us.com> > wrote in message
news:<9sf7ot$91o$> 1@inn.qnx.com> >…

“Curt Stienstra” <> curt.stienstra.nospam@bepco.com> > wrote in message
news:9seufl$3md$> 1@inn.qnx.com> …
Very occasionally I’m seeing that data sent with a send() is not
correctly
received. I have a hard time believing this is the OS’s fault but I
was
verifying the data in the line before the send. Hopefully, someone
will
have experienced something like this (stack corruption, etc.)

I have noticed this problem with 3 separate programs but they need to
run
for long periods of time (> 4 hours) before the error occurs. The
messages
are sent frequently (200 Hz) but always to processes on the same
processor.
I am using QNX 4.23-4.25

I defined the first 4 bytes of the message to be a long integer
identifier
for the message type. When the problem occurs, this identifier was
set
to
zero which is not a type I have defined. Because it is not defined,
my
program doesn’t reply to it which means the sender is reply blocked
and
never runs again. I could easily detect this and reply NACK but I
don’t
want go back and add a retry protocol to all my sends since
send/receive
is
supposed to be reliable. Also, I don’t want to hide a problem.

Why is my data (long integer) being set to zero?

\

  1. do you receive system messages ? you will receive it if any of
    process
    flags like _PPF_SERVER or _PPF_INFORM or _PPF_SIGCATCH is set. system
    messages are described at sys/sys_msg.h. for sent message first two
    bytes
    are message type and second are message sub-type. so pair
    _SYSMSG/_SYSMSG_SUBTYPE_DEATH will give you a zero first long word → is
    _PPF_INFORM is set ? but if it is you would get a “strange” message
    after
    the first any local process shutdown. sure it depends on your system
    design but i guess processes are started/stopped more often then once
    per
    4
    hours.


    Only _PPF_32BIT _PPF_FLAT _PPF_NOZOMBIE _PPF_SLEADER are set. The
    processes
    have no children.

  2. do you check returned from Receive() result ? i believe you are
    otherwise
    there can be anything…


    Yes, I do check the return code from Receive() and it matches the pid of
    the
    sender.

  3. i guess before calling Receive(0,&MyMsg,sizeof(MyMsg)) or alike you
    clear
    this MyMsg buffer with zeros. note, that if sender process sends data
    which
    size is less then you’r expecting there will be only actually sent data
    placed into buffer specified for Receive() → if sender sends a
    zero-length
    message and you clear receive buffer with zeroes you’ll get noted about
    result. it’s a good idea not to specify a receive buffer for Receive()
    call
    at all but read real received message with Readmsg()/Readmsgmx(). this
    functions return actual size of data been transfered into specified
    receive
    buffer.


    Yes, I do clear the first 4 bytes of the msg buffer before I receive(). I
    will try the readmsg() as it would indicate if no data is being received
    or
    if the data has been corrupted.

  4. can you share the piece of code where you catch this problem ? it
    would
    help a lot.

// wbr


This is the code that is receiving the bad messages. It is similar to the
other programs that I have had the problem with. Of course, the default
reply was added for debugging and testing.

#define INPUT_BUF_LEN 1200
void main(int argc, char * argv[])
{
pid_t pid;
long rcv_buf[INPUT_BUF_LEN/4];

while(!shut_down)
{
rcv_buf[0] = 0;
pid = Receive(0, rcv_buf, INPUT_BUF_LEN);
process_msg(pid, rcv_buf);
}
}

char process_msg(pid_t pid, long *r_buf)
{
long rbuf[40];

if (r_buf[0] == REG_SPD_ACC_TRQ)
{
file://do stuff
rbuf[0] = ACK;
Reply(pid, rbuf, sizeof(long));
}
else
{
printf(“replying to pid %d despite bad command %ld\n”,
pid,r_buf[0]);
rbuf[0] = NACK;
Reply(pid, rbuf, sizeof(long));
}
return(1);
}

i would do it as smth like this:

static int ReadData(pid_t InputPid, unsigned Offset, void DataBuf, size_t
DataSize) {
int ReadSize;
int result;
result = -1;
ReadSize = Readmsg(InputPid,Offset,DataBuf,DataSize);
if (ReadSize != -1) {
if (ReadSize == DataSize) {
result = 0;
} else {
fprintf(stderr,“Read message error: read %d of
%d\n”,ReadSize,DataSize);
}
} else {
fprintf(stderr,“Read message error %s\n”,strerror(errno));
}
return result;
} /
ReadData */

static long ProcessMessage(pid_t InputPid) {
long RecBuf[(INPUT_BUF_LEN - 1)/sizeof(long)];
long MsgStatus;
MsgStatus = NACK;
if (ReadData(InputPid,sizeof(long),RecBuf,sizeof(RecBuf)) != -1) {
/* Process data here /
/
/
MsgStatus = ACK;
}
return MsgStatus;
} /
ProcessMessage */

int main(int argc, char argv[]) {
pid_t InputPid;
long MsgType;
long MsgStatus;
while (ShutDown == 0) {
InputPid = Receive(0,NULL,0);
if (InputPid != -1) {
MsgStatus = NACK;
if (ReadData(InputPid,0,&MsgType,sizeof(MsgType)) != -1) {
switch (MsgType) {
case REG_SPD_ACC_TRQ :
NsgStatus = ProcessMessage(InputPid);
break;
default :
fprintf(stderr,“Unknown message type %d\n”,MsgType);
break;
}
}
Reply(InputPid,&MsgStatis,sizeof(MsgStatus));
} else {
fprintf(stderr,“Receive message error %s\n”,strerror(errno));
}
}
return 0;
} /
main */

This code does the sending and retrying, it seems to only ever take 1
retry (limited testing.)

void update_log_reg(void)
{
long s_buf[INPUT_BUF_LEN/sizeof(long)],
r_buf[INPUT_BUF_LEN/sizeof(long)];
float *fs_buf;
int ii;

for (ii=0;ii<3;ii++)
{
fs_buf = (float *)(s_buf);
s_buf[0] = REG_SPD_ACC_TRQ;
fs_buf[1] = flt_spd;
fs_buf[2] = flt_accel;
fs_buf[3] = flt_force;
r_buf[0] = 0;
Send(Regulator_pid, s_buf, r_buf, 4 * sizeof(float),
INPUT_BUF_LEN);

if (r_buf[0] == ACK) break;
else printf(“retry #%d sending to reg\n”,ii+1);
}
}

// wbr

For what it’s worth, it is possible to get corrupted data in a
network-sent message if the hardware is bad. We have seen this
on a number of occassions with particular network cards.
Replacing the network card has resolved the problem. We diagnosed
the problem by writing a little relay program which you run on
two different nodes. The relays pass random packets back and forth and
verify their contents after the round trip.

“Norton Allen” <allen@huarp.harvard.edu> wrote in message
news:3BFA5152.C0416AB4@huarp.harvard.edu

For what it’s worth, it is possible to get corrupted data in a
network-sent message if the hardware is bad. We have seen this
on a number of occassions with particular network cards.

Yes. Although ethernet packet have checksum, FLEET itself
doesn’t have a checksum. If the data gets corrupted going from the network
card to the system memory, it’s impossible to detect.

I have seen this myself, with ISA card.

Replacing the network card has resolved the problem. We diagnosed
the problem by writing a little relay program which you run on
two different nodes. The relays pass random packets back and forth and
verify their contents after the round trip.