data deled during backup

C_hughes · September 20, 2002, 2:21pm

i need some help in figuring out(and please escuse me for cross posting),
why during a backup from my main
server to a hard drive on another system(backup system) most of the
data on my server got deleted

My setup up is I have a main server running QNX 4.24 with a Perceptive
Solutions, Inc PCI-2000A Raid Controller with RAID 5, P4B motherboard
with 2.0 Gig CPU

The backup system has an adaptec 2940, P4 with a 2.0 Gig CPU. We were
using aha7scsi SCSI driver. The system has two hard drives. One we
backup onto with to the system is mounted as /DATABKUP(its a 4 gig
drive) The other hard drive gets a copy of the data when changes are
made on the main servers hard drive through a custom written program(is
mounted as /hard).

For a few days “/bin/Fsys terminated (SIGSEGV)” has been displayed on
the backup system. When we try to access its drive we then see that
they are not accessible. We then reboot the backup server and all is
good most of the times we don’t see that message again for a day or so

However yestesday when a manual backup was being run at the slave
server(happens each morning) to its hard drive, machine language
started showing up where normally file names would. A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)". At
first I thought the backup’s systems adaptor driver had failed, so I
rebooted it, I then realised that the main server could was also stuck
and I then rebooted it. The main server come up and after the os
showed the raid controller(after sinit) it displayed a message no such
file or directory. We then tried to boot off of the backup server
which displayed the same error message (important os boot files were
gone)

We were able to find another hard drive and boot up off of it however
what we observed was that most of the diretories on the main server’s
hard disk(RAID) no longer exist. The disk on the slave server that
accept the automatic copy looked like the servers hard drive. The
hard drive that recives the manual backup had files on it but many of
them were corrupt(a ls showed machine lanuage where files should be).
I figure that the driver for the scsi card on the slave died but the
rest I can’t explain. Can anyone offer me some insight? After the
backup we also run chkfsys and have not seen any errors in months.
The manual backup script is listed below:

The backup script:
echo Starting backup to Hard Drive.

if test -d /DATABKUP/home.old
then
mv /DATABKUP/home.old /DATABKUP/home.rm
fi

if test -d /DATABKUP/home
then
mv /DATABKUP/home /DATABKUP/home.old
fi

if test -d /DATABKUP/home.rm
then
rm -Rf /DATABKUP/home.rm &
fi

if test -d /home
then
cp -Rvp /home /DATABKUP/home
fi

echo Done.
freeze -cdz /etc/logo.F

Richard_Kramer · September 20, 2002, 6:06pm

Is node 1 the slave? If not, then “//1/rm terminated (SIGSEGV)” is
rather frightening unless you spawn the script to run on node 1.

The only time I ever saw “/bin/Fsys terminated (SIGSEGV)” was with a
beta version, and on defective hardware at that.

How do you know that the script is the problem?

If you are absoulutely sure that the files had not been deleted by hand,
consider replacing hardware. If you can’t replace the whole machine(s),
at least replace the memory and power supply, if this is a desktop PC.

If the script is running on the slave then the only way it can delete
files on the server is if it’s picture of the network & filesystems is
screwed up - say by trashed memory. You can try things like “sin in” and
“sin -n1 in” to get an idea of what each node thinks is where.

I’m not sure this will provide insight, but you seem desparate.

Richard

“C . hughes” wrote:

i need some help in figuring out(and please escuse me for cross posting),
why during a backup from my main
server to a hard drive on another system(backup system) most of the
data on my server got deleted

My setup up is I have a main server running QNX 4.24 with a Perceptive
Solutions, Inc PCI-2000A Raid Controller with RAID 5, P4B motherboard
with 2.0 Gig CPU

The backup system has an adaptec 2940, P4 with a 2.0 Gig CPU. We were
using aha7scsi SCSI driver. The system has two hard drives. One we
backup onto with to the system is mounted as /DATABKUP(its a 4 gig
drive) The other hard drive gets a copy of the data when changes are
made on the main servers hard drive through a custom written program(is
mounted as /hard).

For a few days “/bin/Fsys terminated (SIGSEGV)” has been displayed on
the backup system. When we try to access its drive we then see that
they are not accessible. We then reboot the backup server and all is
good most of the times we don’t see that message again for a day or so

However yestesday when a manual backup was being run at the slave
server(happens each morning) to its hard drive, machine language
started showing up where normally file names would. A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)". At
first I thought the backup’s systems adaptor driver had failed, so I
rebooted it, I then realised that the main server could was also stuck
and I then rebooted it. The main server come up and after the os
showed the raid controller(after sinit) it displayed a message no such
file or directory. We then tried to boot off of the backup server
which displayed the same error message (important os boot files were
gone)

We were able to find another hard drive and boot up off of it however
what we observed was that most of the diretories on the main server’s
hard disk(RAID) no longer exist. The disk on the slave server that
accept the automatic copy looked like the servers hard drive. The
hard drive that recives the manual backup had files on it but many of
them were corrupt(a ls showed machine lanuage where files should be).
I figure that the driver for the scsi card on the slave died but the
rest I can’t explain. Can anyone offer me some insight? After the
backup we also run chkfsys and have not seen any errors in months.
The manual backup script is listed below:

The backup script:
echo Starting backup to Hard Drive.

if test -d /DATABKUP/home.old
then
mv /DATABKUP/home.old /DATABKUP/home.rm
fi

if test -d /DATABKUP/home
then
mv /DATABKUP/home /DATABKUP/home.old
fi

if test -d /DATABKUP/home.rm
then
rm -Rf /DATABKUP/home.rm &
fi

if test -d /home
then
cp -Rvp /home /DATABKUP/home
fi

echo Done.
freeze -cdz /etc/logo.F

Igor_Kovalenko2 · September 21, 2002, 6:14pm

This sounds like very severe filesystem corruption. I never had seen
behavior like that from QNX, but that does not mean it is impossible. Cold
comfort for you of course

My suspicion would be the RAID system. I have no experience with your RAID
model, but in general RAID systems are tricky. Normally on a supported OS a
RAID setup includes some kind of ‘health monitor’ which allows RAID status
to be seen from remote hosts over TCP/IP, alarms to be sent when something
goes wrong, reconfiguration to be done, etc. If Perceptive Solutions did not
provide such functionality for QNX, then the only way to see that something
is wrong is watch RAID BIOS during boot (that is, if there is RAID BIOS
diagnostics). When first hard drive fails with RAID5 you don’t notice that
other than by slowness or by alarm from ‘health monitor’. At that point
you’re supposed to take immediate recovery action. When second one fails (or
controller/cable malfunctions) then you notice it. Either by alarm or by
filesystem corruption…

The morale is, you should not have ignored the early warnings you had (Fsys
crashes). Fsys is filesystem manager and of course you can’t access any data
if it has crashed. You should have contacted your hardware vendor for
support right then and as soon as reason is isolated the OS should be
reinstalled and data recovered from backup. It is good you have some
backups, at least you have recovery path. Some people neglect doing backups
of RAID systems because they consider them infalliable

– igor

“C . hughes” <chughes@playlegal.com> wrote in message
news:amfae6$gk4$1@inn.qnx.com…

i need some help in figuring out(and please escuse me for cross posting),
why during a backup from my main
server to a hard drive on another system(backup system) most of the
data on my server got deleted

My setup up is I have a main server running QNX 4.24 with a Perceptive
Solutions, Inc PCI-2000A Raid Controller with RAID 5, P4B motherboard
with 2.0 Gig CPU

The backup system has an adaptec 2940, P4 with a 2.0 Gig CPU. We were
using aha7scsi SCSI driver. The system has two hard drives. One we
backup onto with to the system is mounted as /DATABKUP(its a 4 gig
drive) The other hard drive gets a copy of the data when changes are
made on the main servers hard drive through a custom written program(is
mounted as /hard).

For a few days “/bin/Fsys terminated (SIGSEGV)” has been displayed on
the backup system. When we try to access its drive we then see that
they are not accessible. We then reboot the backup server and all is
good most of the times we don’t see that message again for a day or so

However yestesday when a manual backup was being run at the slave
server(happens each morning) to its hard drive, machine language
started showing up where normally file names would. A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)". At
first I thought the backup’s systems adaptor driver had failed, so I
rebooted it, I then realised that the main server could was also stuck
and I then rebooted it. The main server come up and after the os
showed the raid controller(after sinit) it displayed a message no such
file or directory. We then tried to boot off of the backup server
which displayed the same error message (important os boot files were
gone)

We were able to find another hard drive and boot up off of it however
what we observed was that most of the diretories on the main server’s
hard disk(RAID) no longer exist. The disk on the slave server that
accept the automatic copy looked like the servers hard drive. The
hard drive that recives the manual backup had files on it but many of
them were corrupt(a ls showed machine lanuage where files should be).
I figure that the driver for the scsi card on the slave died but the
rest I can’t explain. Can anyone offer me some insight? After the
backup we also run chkfsys and have not seen any errors in months.
The manual backup script is listed below:

The backup script:
echo Starting backup to Hard Drive.

if test -d /DATABKUP/home.old
then
mv /DATABKUP/home.old /DATABKUP/home.rm
fi

if test -d /DATABKUP/home
then
mv /DATABKUP/home /DATABKUP/home.old
fi

if test -d /DATABKUP/home.rm
then
rm -Rf /DATABKUP/home.rm &
fi

if test -d /home
then
cp -Rvp /home /DATABKUP/home
fi

echo Done.
freeze -cdz /etc/logo.F

C_hughes · September 22, 2002, 5:44am

The Fsys error meesage showed up on the slave system not the server with the
PSI raid card. My imediate response was to change the Adaptec 2940 card on
the slave system but this did not solve the problem. One particular time
when I got this error message and I ran the disk file usage(on the slave
system) it did not list any of the drives that should should have been
mounted on the slave system. Infact even the ram for the slave system was
not listed. The hard disks on the main server are fine, the raid card on
bootup reported they werre fine. The only issue was that data was missing.
Ran a chkfsys on the server(system with raid) and it found no problems just
that the important data files no longer existed. I found a undel utility (on
qnx web site)and was able to undelete some system files but not the
important data files. The utility did see the directories but could not
restore them

I just need to figure out what went wrong, so that this nightmare could not
happen again
“Igor Kovalenko” <kovalenko@attbi.com> wrote in message
news:amicf9$l0k$1@inn.qnx.com…

This sounds like very severe filesystem corruption. I never had seen
behavior like that from QNX, but that does not mean it is impossible. Cold
comfort for you of course >

My suspicion would be the RAID system. I have no experience with your RAID
model, but in general RAID systems are tricky. Normally on a supported OS
a
RAID setup includes some kind of ‘health monitor’ which allows RAID status
to be seen from remote hosts over TCP/IP, alarms to be sent when something
goes wrong, reconfiguration to be done, etc. If Perceptive Solutions did
not
provide such functionality for QNX, then the only way to see that
something
is wrong is watch RAID BIOS during boot (that is, if there is RAID BIOS
diagnostics). When first hard drive fails with RAID5 you don’t notice that
other than by slowness or by alarm from ‘health monitor’. At that point
you’re supposed to take immediate recovery action. When second one fails
(or
controller/cable malfunctions) then you notice it. Either by alarm or by
filesystem corruption…

The morale is, you should not have ignored the early warnings you had
(Fsys
crashes). Fsys is filesystem manager and of course you can’t access any
data
if it has crashed. You should have contacted your hardware vendor for
support right then and as soon as reason is isolated the OS should be
reinstalled and data recovered from backup. It is good you have some
backups, at least you have recovery path. Some people neglect doing
backups
of RAID systems because they consider them infalliable >

– igor

“C . hughes” <> chughes@playlegal.com> > wrote in message
news:amfae6$gk4$> 1@inn.qnx.com> …
i need some help in figuring out(and please escuse me for cross
posting),
why during a backup from my main
server to a hard drive on another system(backup system) most of the
data on my server got deleted

My setup up is I have a main server running QNX 4.24 with a Perceptive
Solutions, Inc PCI-2000A Raid Controller with RAID 5, P4B motherboard
with 2.0 Gig CPU

The backup system has an adaptec 2940, P4 with a 2.0 Gig CPU. We were
using aha7scsi SCSI driver. The system has two hard drives. One we
backup onto with to the system is mounted as /DATABKUP(its a 4 gig
drive) The other hard drive gets a copy of the data when changes are
made on the main servers hard drive through a custom written program(is
mounted as /hard).

For a few days “/bin/Fsys terminated (SIGSEGV)” has been displayed on
the backup system. When we try to access its drive we then see that
they are not accessible. We then reboot the backup server and all is
good most of the times we don’t see that message again for a day or so

However yestesday when a manual backup was being run at the slave
server(happens each morning) to its hard drive, machine language
started showing up where normally file names would. A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)". At
first I thought the backup’s systems adaptor driver had failed, so I
rebooted it, I then realised that the main server could was also stuck
and I then rebooted it. The main server come up and after the os
showed the raid controller(after sinit) it displayed a message no such
file or directory. We then tried to boot off of the backup server
which displayed the same error message (important os boot files were
gone)

We were able to find another hard drive and boot up off of it however
what we observed was that most of the diretories on the main server’s
hard disk(RAID) no longer exist. The disk on the slave server that
accept the automatic copy looked like the servers hard drive. The
hard drive that recives the manual backup had files on it but many of
them were corrupt(a ls showed machine lanuage where files should be).
I figure that the driver for the scsi card on the slave died but the
rest I can’t explain. Can anyone offer me some insight? After the
backup we also run chkfsys and have not seen any errors in months.
The manual backup script is listed below:

The backup script:
echo Starting backup to Hard Drive.

if test -d /DATABKUP/home.old
then
mv /DATABKUP/home.old /DATABKUP/home.rm
fi

if test -d /DATABKUP/home
then
mv /DATABKUP/home /DATABKUP/home.old
fi

if test -d /DATABKUP/home.rm
then
rm -Rf /DATABKUP/home.rm &
fi

if test -d /home
then
cp -Rvp /home /DATABKUP/home
fi

echo Done.
freeze -cdz /etc/logo.F
\

Rick_Duff · September 23, 2002, 2:11pm

“C . hughes” wrote:

The Fsys error meesage showed up on the slave system not the server with the
PSI raid card. My imediate response was to change the Adaptec 2940 card on
the slave system but this did not solve the problem. One particular time
when I got this error message and I ran the disk file usage(on the slave
system) it did not list any of the drives that should should have been
mounted on the slave system. Infact even the ram for the slave system was
not listed. The hard disks on the main server are fine, the raid card on
bootup reported they werre fine. The only issue was that data was missing.
Ran a chkfsys on the server(system with raid) and it found no problems just
that the important data files no longer existed. I found a undel utility (on
qnx web site)and was able to undelete some system files but not the
important data files. The utility did see the directories but could not
restore them

I just need to figure out what went wrong, so that this nightmare could not
happen again

If you system has been stable, I would suspect either memory of the hard
disk controller. The only time I have ever seen qnx 4’s Fsys die was
when I had some bad memory (and a beta version which is excusable). The
Fsys manager is very reliable - I would suggest that you have a hardware
problem (or two).

Rick..

Rick Duff Internet: rick@astranetwork.com
Astra Network QUICS: rgduff
QNX Consulting and Custom Programming URL:
http://www.astranetwork.com
+1 (204) 987-7475 Fax: +1 (204) 987-7479

Cornel_Hughes · September 23, 2002, 6:33pm

But could this have caused data from my main server to be deleted during a
backup. I have included the backup script for reference:

if test -d /DATABKUP/home.old
then
mv /DATABKUP/home.old /DATABKUP/home.rm
fi

if test -d /DATABKUP/home
then
mv /DATABKUP/home /DATABKUP/home.old
fi

if test -d /DATABKUP/home.rm
then
rm -Rf /DATABKUP/home.rm &
fi

if test -d /home
then
cp -Rvp /home /DATABKUP/home
fi

echo Done.
freeze -cdz /etc/logo.F

“Rick Duff” <rick@astranetwork.com> wrote in message
news:3D8F2116.66B79004@astranetwork.com…

“C . hughes” wrote:

The Fsys error meesage showed up on the slave system not the server with
the
PSI raid card. My imediate response was to change the Adaptec 2940 card
on
the slave system but this did not solve the problem. One particular
time
when I got this error message and I ran the disk file usage(on the slave
system) it did not list any of the drives that should should have been
mounted on the slave system. Infact even the ram for the slave system
was
not listed. The hard disks on the main server are fine, the raid card on
bootup reported they werre fine. The only issue was that data was
missing.
Ran a chkfsys on the server(system with raid) and it found no problems
just
that the important data files no longer existed. I found a undel utility
(on
qnx web site)and was able to undelete some system files but not the
important data files. The utility did see the directories but could not
restore them

I just need to figure out what went wrong, so that this nightmare could
not
happen again

If you system has been stable, I would suspect either memory of the hard
disk controller. The only time I have ever seen qnx 4’s Fsys die was
when I had some bad memory (and a beta version which is excusable). The
Fsys manager is very reliable - I would suggest that you have a hardware
problem (or two).

Rick..

Rick Duff Internet: > rick@astranetwork.com
Astra Network QUICS: rgduff
QNX Consulting and Custom Programming URL:
http://www.astranetwork.com
+1 (204) 987-7475 Fax: +1 (204) 987-7479

Rick_Duff · September 23, 2002, 7:44pm

Cornel Hughes wrote:

But could this have caused data from my main server to be deleted during a
backup. I have included the backup script for reference:

if test -d /DATABKUP/home.old
then
mv /DATABKUP/home.old /DATABKUP/home.rm
fi

if test -d /DATABKUP/home
then
mv /DATABKUP/home /DATABKUP/home.old
fi

if test -d /DATABKUP/home.rm
then
rm -Rf /DATABKUP/home.rm &
fi

if test -d /home
then
cp -Rvp /home /DATABKUP/home
fi

echo Done.
freeze -cdz /etc/logo.F

It is still not very clear to me which machine is which. I thought you
said you had a main server which used a custom app to write to the
backup server. If that is the case, we are looking at the wrong place
with this script, as it runs on the backup machine?

I think you are still not supplying enough info. Are you using any qnx
networking features? ie. Are you running the scipt on one machine and
accessing files on another? Can you explain the setup again, using node
numbers rather than names? And include which machine each script is run
on (and how they are run - cron?).

Thanks,
Rick..

\

Rick Duff Internet: rick@astranetwork.com
Astra Network QUICS: rgduff
QNX Consulting and Custom Programming URL: http://www.astranetwork.com
+1 (204) 987-7475 Fax: +1 (204) 987-7479

C_hughes · September 23, 2002, 11:59pm

Node 1 is the server
Node 2 is the backup system / slave
The backup script is executed on the slave system. It copies data from the
slave system to the backup systems hard drive.

My setup up is I have a main server(Node 1) running QNX 4.24 with a
Perceptive
Solutions, Inc PCI-2000A Raid Controller with RAID 5, P4B motherboard
with 2.0 Gig CPU

The backup system (Node 2) has an adaptec 2940, P4 with a 2.0 Gig CPU. We
were
using aha7scsi SCSI driver. The system has two hard drives. One we
backup onto with to the system is mounted as /DATABKUP(its a 4 gig
drive) The other hard drive gets a copy of the data when changes are
made on the main servers hard drive (node 1) through a custom written
program(is
mounted as /hard).

For a few days “/bin/Fsys terminated (SIGSEGV)” has been displayed on
the backup system(node 2). When we try to access its drive we then see that
they are not accessible(using df nothing showing node 2 is listed). What’s
also strange is that when we run a balance report which obtains all its data
from node 1, we see records missing. We then reboot the node 2 and all is
good (reports are fine)most of the times we don’t see that message again for
a day or so

However yestesday when a manual backup was being run at node 2r(happens each
morning) to its hard drive, machine language
started showing up where normally file names would. A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)". At
first I thought the backup’s systems adaptor driver had failed, so I
rebooted it, I then realised that the main server was also stuck
and I then rebooted it. The main server come up and after the os
showed the raid controller(after sinit) it displayed a message no such
file or directory. We then tried to boot off of the backup server
which displayed the same error message (important os boot files were
gone)

We were able to find another hard drive and boot up off of it however
what we observed was that most of the diretories on the main server’s
hard disk(RAID) no longer exist. The disk on the slave server that
accept the automatic copy looked like the servers hard drive. The
hard drive that recives the manual backup had files on it but many of
them were corrupt(a ls showed machine lanuage where files should be).
I figure that the driver for the scsi card on the slave died but the
rest I can’t explain. Can anyone offer me some insight? After the
backup we also run chkfsys and have not seen any errors in months.

“Rick Duff” <rick@astranetwork.com> wrote in message
news:3D8F6F04.1030408@astranetwork.com…

Cornel Hughes wrote:

But could this have caused data from my main server to be deleted during
a
backup. I have included the backup script for reference:

if test -d /DATABKUP/home.old
then
mv /DATABKUP/home.old /DATABKUP/home.rm
fi

if test -d /DATABKUP/home
then
mv /DATABKUP/home /DATABKUP/home.old
fi

if test -d /DATABKUP/home.rm
then
rm -Rf /DATABKUP/home.rm &
fi

if test -d /home
then
cp -Rvp /home /DATABKUP/home
fi

echo Done.
freeze -cdz /etc/logo.F

It is still not very clear to me which machine is which. I thought you
said you had a main server which used a custom app to write to the
backup server. If that is the case, we are looking at the wrong place
with this script, as it runs on the backup machine?

I think you are still not supplying enough info. Are you using any qnx
networking features? ie. Are you running the scipt on one machine and
accessing files on another? Can you explain the setup again, using node
numbers rather than names? And include which machine each script is run
on (and how they are run - cron?).

Thanks,
Rick..

\

Rick Duff Internet: > rick@astranetwork.com
Astra Network QUICS: rgduff
QNX Consulting and Custom Programming URL:
http://www.astranetwork.com
+1 (204) 987-7475 Fax: +1 (204) 987-7479

Rick_Duff · September 24, 2002, 2:44pm

C . hughes wrote:

Node 1 is the server
Node 2 is the backup system / slave
The backup script is executed on the slave system. It copies data from the
slave system to the backup systems hard drive.

My setup up is I have a main server(Node 1) running QNX 4.24 with a
Perceptive
Solutions, Inc PCI-2000A Raid Controller with RAID 5, P4B motherboard
with 2.0 Gig CPU

The backup system (Node 2) has an adaptec 2940, P4 with a 2.0 Gig CPU. We
were
using aha7scsi SCSI driver. The system has two hard drives. One we
backup onto with to the system is mounted as /DATABKUP(its a 4 gig
drive) The other hard drive gets a copy of the data when changes are
made on the main servers hard drive (node 1) through a custom written
program(is
mounted as /hard).

For a few days “/bin/Fsys terminated (SIGSEGV)” has been displayed on
the backup system(node 2). When we try to access its drive we then see that
they are not accessible(using df nothing showing node 2 is listed). What’s
also strange is that when we run a balance report which obtains all its data
from node 1, we see records missing. We then reboot the node 2 and all is
good (reports are fine)most of the times we don’t see that message again for
a day or so

It makes sense you can’t see the files on Node 2 when Fsys is not
running. The fact Fsys died (to me) points to bad hardware on Node 2. I
would suspect bad ram first (having seen this problem the most often)
and a bad controller second. I don’t think a bad hard drive would cause
Fsys to die.

However yestesday when a manual backup was being run at node 2r(happens each
morning) to its hard drive, machine language
started showing up where normally file names would. A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)". At
first I thought the backup’s systems adaptor driver had failed, so I
rebooted it, I then realised that the main server was also stuck
and I then rebooted it. The main server come up and after the os
showed the raid controller(after sinit) it displayed a message no such
file or directory. We then tried to boot off of the backup server
which displayed the same error message (important os boot files were
gone)

Again the corruption points to bad ram - or perhaps the motherboard itself.

We were able to find another hard drive and boot up off of it however
what we observed was that most of the diretories on the main server’s
hard disk(RAID) no longer exist. The disk on the slave server that
accept the automatic copy looked like the servers hard drive. The
hard drive that recives the manual backup had files on it but many of
them were corrupt(a ls showed machine lanuage where files should be).
I figure that the driver for the scsi card on the slave died but the
rest I can’t explain. Can anyone offer me some insight? After the
backup we also run chkfsys and have not seen any errors in months.

If you are having hardware problem, a corrupt file system is not
surprising. I suspect you may also have problems with your primary
system (node 1) but without understanding how your custom program works,
it is hard to help out much.

Rick..

\

Rick Duff Internet: rick@astranetwork.com
Astra Network QUICS: rgduff
QNX Consulting and Custom Programming URL: http://www.astranetwork.com
+1 (204) 987-7475 Fax: +1 (204) 987-7479

C_hughes · September 24, 2002, 3:11pm

You are probably right about bad RAM or motherboard on node 2. As far as a
bad scsi controller goes, i had changed the controller but the problem
continued.

If you are having hardware problem, a corrupt file system is not
surprising. I suspect you may also have problems with your primary
system (node 1) but without understanding how your custom program works,
it is hard to help out much.

The program automatically copies any files that have changed to one of the
backup servers(node 2) hard drives . But it only copies files from our
database directories(all files and directories in /home) That’s why I can’t
understand why other directories and files were deleted eg (etc/config)
“Rick Duff” <rick@astranetwork.com> wrote in message
news:3D907A45.2070903@astranetwork.com…

C . hughes wrote:

Node 1 is the server
Node 2 is the backup system / slave
The backup script is executed on the slave system. It copies data from
the
slave system to the backup systems hard drive.

My setup up is I have a main server(Node 1) running QNX 4.24 with a
Perceptive
Solutions, Inc PCI-2000A Raid Controller with RAID 5, P4B motherboard
with 2.0 Gig CPU

The backup system (Node 2) has an adaptec 2940, P4 with a 2.0 Gig CPU.
We
were
using aha7scsi SCSI driver. The system has two hard drives. One we
backup onto with to the system is mounted as /DATABKUP(its a 4 gig
drive) The other hard drive gets a copy of the data when changes are
made on the main servers hard drive (node 1) through a custom written
program(is
mounted as /hard).

For a few days “/bin/Fsys terminated (SIGSEGV)” has been displayed on
the backup system(node 2). When we try to access its drive we then see
that
they are not accessible(using df nothing showing node 2 is listed).
What’s
also strange is that when we run a balance report which obtains all its
data
from node 1, we see records missing. We then reboot the node 2 and all
is
good (reports are fine)most of the times we don’t see that message again
for
a day or so

It makes sense you can’t see the files on Node 2 when Fsys is not
running. The fact Fsys died (to me) points to bad hardware on Node 2. I
would suspect bad ram first (having seen this problem the most often)
and a bad controller second. I don’t think a bad hard drive would cause
Fsys to die.

However yestesday when a manual backup was being run at node 2r(happens
each
morning) to its hard drive, machine language
started showing up where normally file names would. A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)". At
first I thought the backup’s systems adaptor driver had failed, so I
rebooted it, I then realised that the main server was also stuck
and I then rebooted it. The main server come up and after the os
showed the raid controller(after sinit) it displayed a message no such
file or directory. We then tried to boot off of the backup server
which displayed the same error message (important os boot files were
gone)

Again the corruption points to bad ram - or perhaps the motherboard
itself.

We were able to find another hard drive and boot up off of it however
what we observed was that most of the diretories on the main server’s
hard disk(RAID) no longer exist. The disk on the slave server that
accept the automatic copy looked like the servers hard drive. The
hard drive that recives the manual backup had files on it but many of
them were corrupt(a ls showed machine lanuage where files should be).
I figure that the driver for the scsi card on the slave died but the
rest I can’t explain. Can anyone offer me some insight? After the
backup we also run chkfsys and have not seen any errors in months.

If you are having hardware problem, a corrupt file system is not
surprising. I suspect you may also have problems with your primary
system (node 1) but without understanding how your custom program works,
it is hard to help out much.

Rick..

\

Rick Duff Internet: > rick@astranetwork.com
Astra Network QUICS: rgduff
QNX Consulting and Custom Programming URL:
http://www.astranetwork.com
+1 (204) 987-7475 Fax: +1 (204) 987-7479

Richard_Kramer · September 24, 2002, 10:30pm

“C . hughes” wrote:

Node 1 is the server
Node 2 is the backup system / slave
The backup script is executed on the slave system. It copies data from the
[snip]
slave system to the backup systems hard drive.
A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)"

What //1/rm means is that rm was running on the server. I don’t see how
your script, running on node 2 can cause this unless you start the
script like: “on -f1 script”. If you didn’t do that, then
someone/something else is doing rm on node 1.

Richard

C_hughes · September 26, 2002, 1:45am

part of the backup routine is to shut down all other systems except for node
1 and 2. The backup routine is then manually run on node 2 . My guess is
that node 2 for some strange reason became node 1 or thought it was. To me
that explains why when I try to run a report after I get “”/bin/Fsys
terminated (SIGSEGV)" on node 2, it only shows some of the records.
instead of trying to pull the report from node 1 (where the database
resides) its trying to pull the report from itself because it thinks its
node 1
“Richard Kramer” <rrkramer@kramer-smilko.com> wrote in message
news:3D90E794.D484A892@kramer-smilko.com…

“C . hughes” wrote:

Node 1 is the server
Node 2 is the backup system / slave
The backup script is executed on the slave system. It copies data from
the
[snip]
slave system to the backup systems hard drive.
A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)"

What //1/rm means is that rm was running on the server. I don’t see how
your script, running on node 2 can cause this unless you start the
script like: “on -f1 script”. If you didn’t do that, then
someone/something else is doing rm on node 1.

Richard

Richard_Kramer · September 26, 2002, 7:01pm

Another possibility (with which I have no experience) is to set up a QNX
network so there appears to be only one root dir. Perhaps yours is set
up this way. In that case, you should look for suggestions from someone
familiar with problems that can arise.

Richard

“C . hughes” wrote:

part of the backup routine is to shut down all other systems except for node
1 and 2. The backup routine is then manually run on node 2 . My guess is
that node 2 for some strange reason became node 1 or thought it was. To me
that explains why when I try to run a report after I get “”/bin/Fsys
terminated (SIGSEGV)" on node 2, it only shows some of the records.
instead of trying to pull the report from node 1 (where the database
resides) its trying to pull the report from itself because it thinks its
node 1
“Richard Kramer” <> rrkramer@kramer-smilko.com> > wrote in message
news:> 3D90E794.D484A892@kramer-smilko.com> …

“C . hughes” wrote:

Node 1 is the server
Node 2 is the backup system / slave
The backup script is executed on the slave system. It copies data from
the
[snip]
slave system to the backup systems hard drive.
A message was
displayed on the backup system saying //1/rm terminated (SIGSEGV)"

What //1/rm means is that rm was running on the server. I don’t see how
your script, running on node 2 can cause this unless you start the
script like: “on -f1 script”. If you didn’t do that, then
someone/something else is doing rm on node 1.

Richard

data deled during backup

Rick..

Rick..

Thanks, Rick.. \

Thanks, Rick.. \

Rick.. \

Rick.. \

Thanks,
Rick..

\

Thanks,
Rick..

\

Rick..

\

Rick..

\