Diagnosing Server Problems?

Mathew_Kirsch · March 19, 2003, 9:21pm

I am getting very frustrated, and the developers are getting very angry
with me because our QNX server crashes in one of several different ways,
several times a day.

Most of the time, the server is completely locked up. I have to switch
off the main power in order to reboot it. After reboot, there are no log
messages, no diagnostics, no core dumps, nothing to indicate that
there’s a problem with anything. I have spent countless hours watching
“spin” output, but there is no indication of what’s causing the problem
there, either. spin simply stops updating shortly before the server crashes.

QNX seems to completely lack any meaningful/useful logging facility. It
has syslog, and I was logging EVERYTHING for a time, but what little I
got was completely irrelevant. slogger/sloginfo output is equally
useless; according to QNX support, the messages I’m seeing there are
completely normal.

This is a simple system, a motherboard, a handful of disks, and two
network cards. It’s running Momentics 6.2.0 PE, and runs samba as its
main server application. Clients are other Momentics 6.2.0 PE
workstations, and Windows samba clients. We use QNET for file sharing,
because it’s the only thing QNX has that works at all, and keeping 27GB
of data mirrored among 30+ workstations is an impossible task. For the
most part, users just get their home directories over QNET. Development
is done on the local hard drives of the QNX workstations. There’s no way
to measure QNET traffic, so I can’t tell how hard we’re hitting the server.

I believe it’s load related. The server never crashes overnight or on
weekends, when few, if any, people are here.

How do you diagnose a server with no diagnostics? The hardware is fine,
otherwise, we’d see problems at all hours of the day.

Jens_H_Jorgensen1 · March 19, 2003, 10:23pm

Mathew,

A couple of ideas:

Did you try to switch to another type of network card and that way getting
QNX to use another network driver?
You would need to make sure that the new NIC’s use another chipset and
thereby another driver.

I don’t know what kind of data that you need to share between developers,
but did you consider using CVS?

Would it possible for you run without some of your services - such as
Samba - and see what happens?

–
Jens

“Mathew Kirsch” <mkirsch@ocdus.jnj.com> wrote in message
news:b5am23$po9$1@inn.qnx.com…

I am getting very frustrated, and the developers are getting very angry
with me because our QNX server crashes in one of several different ways,
several times a day.

Most of the time, the server is completely locked up. I have to switch
off the main power in order to reboot it. After reboot, there are no log
messages, no diagnostics, no core dumps, nothing to indicate that
there’s a problem with anything. I have spent countless hours watching
“spin” output, but there is no indication of what’s causing the problem
there, either. spin simply stops updating shortly before the server
crashes.

QNX seems to completely lack any meaningful/useful logging facility. It
has syslog, and I was logging EVERYTHING for a time, but what little I
got was completely irrelevant. slogger/sloginfo output is equally
useless; according to QNX support, the messages I’m seeing there are
completely normal.

This is a simple system, a motherboard, a handful of disks, and two
network cards. It’s running Momentics 6.2.0 PE, and runs samba as its
main server application. Clients are other Momentics 6.2.0 PE
workstations, and Windows samba clients. We use QNET for file sharing,
because it’s the only thing QNX has that works at all, and keeping 27GB
of data mirrored among 30+ workstations is an impossible task. For the
most part, users just get their home directories over QNET. Development
is done on the local hard drives of the QNX workstations. There’s no way
to measure QNET traffic, so I can’t tell how hard we’re hitting the
server.

I believe it’s load related. The server never crashes overnight or on
weekends, when few, if any, people are here.

How do you diagnose a server with no diagnostics? The hardware is fine,
otherwise, we’d see problems at all hours of the day.

Chris_McKillop1 · March 19, 2003, 10:34pm

I am getting very frustrated, and the developers are getting very angry
with me because our QNX server crashes in one of several different ways,
several times a day.

I assume you are running the QNX machine in text mode on Console 1 so if
there is any kernel diagnostics you can see the output?

You say you think it is load related and that you are watching the system
load in Spin. What is the load normally?

How much memory do you have in the system?

What hardware are you using (network cards)?

chris

–
Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Mathew_Kirsch · March 20, 2003, 2:09pm

Chris McKillop wrote:

I assume you are running the QNX machine in text mode on Console 1 so if
there is any kernel diagnostics you can see the output?

I just recently started doing this, but so far in a dozen crashes,
nothing has appeared on the screen.

You say you think it is load related and that you are watching the system
load in Spin. What is the load normally?

I can’t tell. That’s the problem. When the system is idle, spin updates
fine, but as soon as any load is placed on the system, spin stops
updating until the load has passed. So, I don’t know what put the load
on the system, or how much load was placed on the system.

How much memory do you have in the system?

2GB. pidin info shows 1600+MB free at any given moment.

What hardware are you using (network cards)?

One 3COM 906C and one SMC9432

Mathew_Kirsch · March 20, 2003, 2:16pm

Jens H Jorgensen wrote:

Did you try to switch to another type of network card and that way getting
QNX to use another network driver?
You would need to make sure that the new NIC’s use another chipset and
thereby another driver.

Why do you suspect a network problem?

I don’t know what kind of data that you need to share between developers,
but did you consider using CVS?

Yes, they already use CVS, but they need the ability to log into
multiple systems and maintain the same environment. That means their
home directories need to be accessible from a central location.

Would it possible for you run without some of your services - such as
Samba - and see what happens?

No. If I disable samba, the users can’t work. If the users can’t work,
then no load will be placed on the system. If no load is placed on the
system, it won’t crash. The test will prove nothing except my ability to
update my resume, because I will be hustled out the door quicker than
you can say “Momentics.”

Jens_H_Jorgensen1 · March 20, 2003, 3:38pm

“Mathew Kirsch” <mkirsch@ocdus.jnj.com> wrote in message
news:b5chgp$sru$1@inn.qnx.com…

Jens H Jorgensen wrote:
Did you try to switch to another type of network card and that way
getting
QNX to use another network driver?
You would need to make sure that the new NIC’s use another chipset and
thereby another driver.

Why do you suspect a network problem?

Well a couple of things:

you mentioned it is load related - it could be network load that causes
the driver lock to up the system
I have experienced problems in earlier versions of QNX with certain
network drivers

What are the NIC’s you have in the system?

I normally have good experience with Intel NIC chipsets and bad experiences
with 3Com and also sometimes Netgear/Linksys (can’t remember which one).

Please note that I am in no way sure that it is NIC related, but if I was in
your situation I would try to swap out the NIC’s.

I don’t know what kind of data that you need to share between
developers,
but did you consider using CVS?

Yes, they already use CVS, but they need the ability to log into
multiple systems and maintain the same environment. That means their
home directories need to be accessible from a central location.

Could you use NFS and/or cifs instead and then perhaps also change the
server over to Win2K?

We are running Win2K as our server with MS Services for Unix (SFU). The
Win2K server runs as:

NFS server
CVS server
Windows drive share server (can be mount with CIFS)

and then QNX machines mounts NFS drives and does source code version control
through CVS.

–
Jens

David_Bacon1 · March 20, 2003, 3:47pm

FWIW, I have had better luck with NFS than with QNET from both the
reliability and performance point of view. Presumably everywhere
you currently use QNET for file sharing, you can substitute NFS.

dB

“Jens H Jorgensen” wrote, ca. Thu, 20 Mar 2003 10:38:50 -0500:

“Mathew Kirsch” <mkirsch@ocdus.jnj.com> wrote in message
news:b5chgp$sru$1@inn.qnx.com…

Jens H Jorgensen wrote:
Did you try to switch to another type of network card and that way
getting
QNX to use another network driver?
You would need to make sure that the new NIC’s use another chipset and
thereby another driver.

Why do you suspect a network problem?

Well a couple of things:

you mentioned it is load related - it could be network load that causes
the driver lock to up the system
I have experienced problems in earlier versions of QNX with certain
network drivers

What are the NIC’s you have in the system?

I normally have good experience with Intel NIC chipsets and bad experiences
with 3Com and also sometimes Netgear/Linksys (can’t remember which one).

Please note that I am in no way sure that it is NIC related, but if I was in
your situation I would try to swap out the NIC’s.

I don’t know what kind of data that you need to share between
developers,
but did you consider using CVS?

Yes, they already use CVS, but they need the ability to log into
multiple systems and maintain the same environment. That means their
home directories need to be accessible from a central location.

Could you use NFS and/or cifs instead and then perhaps also change the
server over to Win2K?

We are running Win2K as our server with MS Services for Unix (SFU). The
Win2K server runs as:

NFS server
CVS server
Windows drive share server (can be mount with CIFS)

and then QNX machines mounts NFS drives and does source code version control
through CVS.

–
Jens

William_Morris · March 20, 2003, 4:36pm

Mathew

Mathew Kirsch wrote:

You say you think it is load related and that you are watching the system
load in Spin. What is the load normally?

I can’t tell. That’s the problem. When the system is idle, spin updates
fine, but as soon as any load is placed on the system, spin stops
updating until the load has passed. So, I don’t know what put the load
on the system, or how much load was placed on the system.

You could run spin at a higher priority so it gets a look in.

Try
nice -n -2 spin

to raise it above the rest (increase the 2 if that doesn’t work).
Hope that helps
William

Mathew_Kirsch · March 20, 2003, 6:17pm

David Bacon wrote:

FWIW, I have had better luck with NFS than with QNET from both the
reliability and performance point of view. Presumably everywhere
you currently use QNET for file sharing, you can substitute NFS.

My experience with NFS has been quite the opposite. In 6.1, NFS didn’t
work at all. You’d open file X with vi, and get file Y. It couldn’t keep
the links straight. In 6.2 PE, the NFS daemon tends to die every time
you try to move a large multi-megabyte file across. We have two people
on QNX workstations on the other end of the building that have to access
it via NFS, and it’s a royal pain.

I wouldn’t substitute NFS for QNET if you paid me. We have wrapper
scripts around the CVS commands to make it simple for users.
Additionally, these scripts maintain a build account with the latest
up-to-the-minute versions of every checked-in file. With QNET, there’s
no problem checking a new version of a file in if you don’t own it. With
NFS, even if the file is 777, it won’t let you overwrite a file unless
you own it. I’ve tried setting the “root=” option in /etc/exports, but
nfsd seems to ignore it.

Mathew_Kirsch · March 20, 2003, 6:19pm

William Morris wrote:

You could run spin at a higher priority so it gets a look in.

Try
nice -n -2 spin

to raise it above the rest (increase the 2 if that doesn’t work).

Thanks, but I’ve already tried that. I’ve tried running spin with the
highest priority, but every time the system sees any signifigant load,
spin stops updating until the load is gone.

Jens_H_Jorgensen1 · March 20, 2003, 6:32pm

My experience with NFS has been quite the opposite. In 6.1, NFS didn’t
work at all. You’d open file X with vi, and get file Y. It couldn’t keep
the links straight. In 6.2 PE, the NFS daemon tends to die every time
you try to move a large multi-megabyte file across. We have two people
on QNX workstations on the other end of the building that have to access
it via NFS, and it’s a royal pain.

NFS under QNX seems to work fine when you are client. As I mentioned earlier
we are running with Win2k/SFU as NFS server and QNX 6.1 as NFS clients and
it works fine.

Could you move your server to Win2k or Linux?

–
Jens

John_Garvey1 · March 20, 2003, 7:10pm

Mathew Kirsch <mkirsch@ocdus.jnj.com> wrote:

Most of the time, the server is completely locked up. I have to switch
off the main power in order to reboot it. After reboot, there are no log

Is this Update Ticket 49158? If so I have traced this from your core
file to a libc/resmgr bug with file locking. Someone (a client,
perhaps the NFS/CIFS server) is requesting a file lock for a region

2GB in length, which triggers the libc bug when that lock comes to be
released (perhaps during close()) and that thread in the filesystem will

run ready with that file locked (any other access to that file will then
block and so on and so on). This will need a new libc to fix (I have
already made the fix, but not sure when this can be released).

p.s. If this isn’t you, then ignore this irelevant rambling …

Chris_McKillop1 · March 20, 2003, 9:31pm

Mathew Kirsch <mkirsch@ocdus.jnj.com> wrote:

Chris McKillop wrote:
I assume you are running the QNX machine in text mode on Console 1 so if
there is any kernel diagnostics you can see the output?

I just recently started doing this, but so far in a dozen crashes,
nothing has appeared on the screen.

When it “crashes” does your PC fail the numlock test? (ie: does numlock
stop turning on and off).

I can’t tell. That’s the problem. When the system is idle, spin updates
fine, but as soon as any load is placed on the system, spin stops
updating until the load has passed. So, I don’t know what put the load
on the system, or how much load was placed on the system.

Even if you start it as “on -p63 spin”?

One 3COM 906C and one SMC9432

So devn-el900 and devn-tulip are the network drivers you are using?

Running “pidin -p io-net mem” will show which devn’s are in use.

chris

\

Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Mathew_Kirsch · March 21, 2003, 3:18pm

John Garvey wrote:

p.s. If this isn’t you, then ignore this irelevant rambling … >

Yes, that’s my ticket, but that has to do with the SCSI driver spinning
out of control, not complete lockups of the system. You think that may
be the cause of ALL the problems on this particular system?

Now, why would it be trying to lock a region >2GB in length? Should I
look for 2GB files, directories with more than 2GB in them?

John_Garvey1 · March 23, 2003, 11:13pm

Mathew Kirsch <mkirsch@ocdus.jnj.com> wrote:

John Garvey wrote:
p.s. If this isn’t you, then ignore this irelevant rambling … >
Yes, that’s my ticket, but that has to do with the SCSI driver spinning
out of control, not complete lockups of the system. You think that may
be the cause of ALL the problems on this particular system?

Well, the situation in the core file could be described as “the libc
spinning out of control”, the SCSI/filesystem is just the unfortunate
host … while trying to endlessly manipulate the negative lock range
it has the file involved locked. Suppose another process tries to
open/access that file, it must hold the parent directory locked while
it does so (to ensure the file is not being deleted) and it blocks. Now
any process trying to access any file in that directory will also block
with the higher directory locked, and so on and so on, eventually
rippling up to the filesystem root and rendering all files on your disk
inaccessable (indeed your core file shows 30 filesystem threads all
blocked in that manner). Not being able to access any disk file could
certainly look like a “complete lockup of the system” (not knowing what
ALL your problems are, this situation certainly can’t help). Anyway, I
consider this a serious bug in libc, and it will be fixed (I am also
making sure it gets into a 6.2.1 patch as well) … thanks for the
dump/core file …

Now, why would it be trying to lock a region >2GB in length? Should I
look for 2GB files, directories with more than 2GB in them?

QNX6 currently only has 32-bit filesystem formats, so no disk-hosted
file/directory will be >2GB, so nothing to look for there. But a
client can place an advisory lock on any file range. As to why one
is trying to lock this range: either it is a misunderstanding (trying
to lock the entire file using 0 … ULONG_MAX, when POSIX defined a
length of 0 as meaning the entire file), or is some 32 vs 64-bit
sign-extension/truncation bug (as was the internal libc problem), or
it is a build issue (_FILE_OFFSET_BITS definition, for example).
You could check any of your code for fcntl(F_SETLK) requests, or
perhaps it is coming from something like Samba (although that does have
code in its locking subroutines that tries to address the 32/64 issue).

Armin_Steinhoff1 · March 24, 2003, 11:05am

John Garvey wrote:

Mathew Kirsch <> mkirsch@ocdus.jnj.com> > wrote:

John Garvey wrote:

p.s. If this isn’t you, then ignore this irelevant rambling … >

Yes, that’s my ticket, but that has to do with the SCSI driver spinning
out of control, not complete lockups of the system. You think that may
be the cause of ALL the problems on this particular system?

Well, the situation in the core file could be described as “the libc
spinning out of control”, the SCSI/filesystem is just the unfortunate
host … while trying to endlessly manipulate the negative lock range
it has the file involved locked. Suppose another process tries to
open/access that file, it must hold the parent directory locked while
it does so (to ensure the file is not being deleted) and it blocks. Now
any process trying to access any file in that directory will also block
with the higher directory locked, and so on and so on, eventually
rippling up to the filesystem root and rendering all files on your disk
inaccessable (indeed your core file shows 30 filesystem threads all
blocked in that manner). Not being able to access any disk file could
certainly look like a “complete lockup of the system” (not knowing what
ALL your problems are, this situation certainly can’t help). Anyway, I
consider this a serious bug in libc, and it will be fixed (I am also
making sure it gets into a 6.2.1 patch as well) … thanks for the
dump/core file …

BTW, also POSIX mmap() is also broken … at least with 6.2.1 NC!

Every resource manager using mmap() will crash because mmap() is
returning an invalid address !

Armin

Now, why would it be trying to lock a region >2GB in length? Should I
look for 2GB files, directories with more than 2GB in them?

QNX6 currently only has 32-bit filesystem formats, so no disk-hosted
file/directory will be >2GB, so nothing to look for there. But a
client can place an advisory lock on any file range. As to why one
is trying to lock this range: either it is a misunderstanding (trying
to lock the entire file using 0 … ULONG_MAX, when POSIX defined a
length of 0 as meaning the entire file), or is some 32 vs 64-bit
sign-extension/truncation bug (as was the internal libc problem), or
it is a build issue (_FILE_OFFSET_BITS definition, for example).
You could check any of your code for fcntl(F_SETLK) requests, or
perhaps it is coming from something like Samba (although that does have
code in its locking subroutines that tries to address the 32/64 issue).

John_Garvey1 · March 24, 2003, 11:54am

Armin Steinhoff <a-steinhoff@web.de> wrote:

BTW, also POSIX mmap() is also broken … at least with 6.2.1 NC!

Please don’t attach unrelated issues to others postings. If you
have a legitimate problem then please start a new thread.

Every resource manager using mmap() will crash because mmap() is
returning an invalid address !

I don’t fully understand what you are saying; but this seems unlikey, as
all executables are mmap’d out of the disk/network filesystem by procnto,
so you’d not be able to run anything at all if mmap() was broken …

Armin_Steinhoff1 · March 24, 2003, 4:29pm

John Garvey wrote:

Armin Steinhoff <> a-steinhoff@web.de> > wrote:

BTW, also POSIX mmap() is also broken … at least with 6.2.1 NC!

Please don’t attach unrelated issues to others postings. If you
have a legitimate problem then please start a new thread.

Every resource manager using mmap() will crash because mmap() is
returning an invalid address !

I don’t fully understand what you are saying; but this seems unlikey, as
all executables are mmap’d out of the disk/network filesystem by procnto,
so you’d not be able to run anything at all if mmap() was broken …

Correct if the mapping has been done be mmap() …

Armin

Chris_McKillop1 · March 24, 2003, 6:14pm

I don’t fully understand what you are saying; but this seems unlikey, as
all executables are mmap’d out of the disk/network filesystem by procnto,
so you’d not be able to run anything at all if mmap() was broken …

Correct if the mapping has been done be mmap() …

And what is exactly broken Armin? The only thing that broke was the
mapping of /dev/mem to bring in a device memory space. Do not make general
case statements when the bug is very specific.

chris

–
Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Chris_McKillop1 · March 24, 2003, 8:51pm

And what is exactly broken Armin? The only thing that broke was the
mapping of /dev/mem to bring in a device memory space.

The only ‘thing’ ?? Half of the functionality of mmap() is broken!

Care to give another example of what is busted Armin? As John said, all
binaries and shared libs are loaded from various resource managers using
mmap() (io-blk, fs-nfs[23], fs-cifs, …).

chris

–
Chris McKillop <cdm@qnx.com> “The faster I go, the behinder I get.”
Software Engineer, QSSL – Lewis Carroll –
http://qnx.wox.org/

Diagnosing Server Problems?

chris \

chris

\