detecting resource manager exit

Doug_Owens · October 7, 2004, 3:07pm

Is there a way for a client to know when a resource manager has exited. I
have read about NTO_CHF_COID_DISCONNECT but it is not clear how to take
advantage of this from a client doing an open on a resource manager.

Thanks,
Doug Owens

John_Murphy1 · October 7, 2004, 3:20pm

I use select_attach() with the SELECT_FLAG_SRVEXCEPT flag.

Murf

Doug Owens wrote:

Is there a way for a client to know when a resource manager has exited. I
have read about NTO_CHF_COID_DISCONNECT but it is not clear how to take
advantage of this from a client doing an open on a resource manager.

Thanks,
Doug Owens

David_Gibbs1 · October 7, 2004, 4:38pm

Doug Owens <owens2@llnl.gov> wrote:

Is there a way for a client to know when a resource manager has exited. I
have read about NTO_CHF_COID_DISCONNECT but it is not clear how to take
advantage of this from a client doing an open on a resource manager.

The client must have a channel for blocking/notification pulses.

The client must set that flag when it creates the channel.

If that is set, and a server goes away, a pulse will be delivered
for each fd that is lost. I think the pulse value will be the coid
(fd) that has gone away.

Save your fds, look up which went away, cleanup appropriately.

-David

David Gibbs
QNX Training Services
dagibbs@qnx.com

Wojtek_Lerch1 · October 7, 2004, 6:20pm

David Gibbs wrote:

The client must have a channel for blocking/notification pulses.

The client must set that flag when it creates the channel.

If that is set, and a server goes away, a pulse will be delivered
for each fd that is lost. I think the pulse value will be the coid
(fd) that has gone away.

Save your fds, look up which went away, cleanup appropriately.

There’s a catch though. If the resmgr exits just before you close your
fd (or, in general, the channel you’re connected to is closed when
you’re about to close your connection), you will still receive the
pulse. This could happen, for instance, if a bug in the resource
manager causes it to crash while handling your close message. Or when
one application sends a message to another application asking it to exit.

If there’s a chance that you could open a new, perhaps completely
unrelated fd (or connection) while the pulse is still in your queue, you
may end up interpreting the pulse as referring to the new fd. In most
setups, this is practically impossible, but there are some scenarios
(especially involving messages between applications rather than resource
managers) where it can actually happen.

John_A_Murphy1 · December 2, 2004, 7:58pm

David Gibbs wrote:

Doug Owens <> owens2@llnl.gov> > wrote:

Is there a way for a client to know when a resource manager has exited. I
have read about NTO_CHF_COID_DISCONNECT but it is not clear how to take
advantage of this from a client doing an open on a resource manager.

The client must have a channel for blocking/notification pulses.

The client must set that flag when it creates the channel.

If that is set, and a server goes away, a pulse will be delivered
for each fd that is lost. I think the pulse value will be the coid
(fd) that has gone away.

Save your fds, look up which went away, cleanup appropriately.

-David

What are the tradeoffs between simply calling pulse_attach() for
_PULSE_CODE_DISCONNECT, and calling select_attach() with the
SELECT_FLAG_EXCEPT flag set? The select_attach() method seems to be
better documented, but buggy. The pulse_attach method is simpler, but
if it works, why does the select_attach()/SELECT_FLAG_EXCEPT method even
exist?

From the documentation, we concluded that select_attach() was the
“proper” way to accomplish this, but we had so many problems with it
that we finally made our own version of dispatch_select.c. We’ve
recently tried the pulse_attach() method, and it seems to work fine, but
some of our people are worried that it isn’t sufficiently documented,
and therefore not safe to use.

Your comments?

Murf

David_Gibbs1 · December 3, 2004, 4:46pm

John A. Murphy <murf@perftech.com> wrote:

David Gibbs wrote:
Doug Owens <> owens2@llnl.gov> > wrote:

Is there a way for a client to know when a resource manager has exited. I
have read about NTO_CHF_COID_DISCONNECT but it is not clear how to take
advantage of this from a client doing an open on a resource manager.

The client must have a channel for blocking/notification pulses.

The client must set that flag when it creates the channel.

If that is set, and a server goes away, a pulse will be delivered
for each fd that is lost. I think the pulse value will be the coid
(fd) that has gone away.

Save your fds, look up which went away, cleanup appropriately.

-David

What are the tradeoffs between simply calling pulse_attach() for
_PULSE_CODE_DISCONNECT,

Do you mean _PULSE_CODE_COIDDEATH?

The framework should, already, be handling _PULSE_CODE_DISCONNECT
to track clients of the resource manager going away.

_PULSE_CODE_COIDDEATH is the pulse for a server you are connected
to going away.

and calling select_attach() with the
SELECT_FLAG_EXCEPT flag set? The select_attach() method seems to be
better documented, but buggy. The pulse_attach method is simpler, but
if it works, why does the select_attach()/SELECT_FLAG_EXCEPT method even
exist?

SELECT_FLAG_EXCEPT is for “exceptional conditions” – that may or
may not include the server going away, and that may include other
situations than the server going away. SELECT_FLAG_EXCEPT exists
to provide the equivalent behaviour to select(… except_fds…).
(For instance, someone transmitting OOB data on a socket would cause
the handler for exceptional data to be invoked, but the server,
io-net, would still be alive.)

From the documentation, we concluded that select_attach() was the
“proper” way to accomplish this, but we had so many problems with it
that we finally made our own version of dispatch_select.c. We’ve
recently tried the pulse_attach() method, and it seems to work fine, but
some of our people are worried that it isn’t sufficiently documented,
and therefore not safe to use.

The notification pulses are not going away.

I am concerned about you talking about _PULSE_CODE_DISCONNECT vs
select_attach(), as they are opposite sides of the client-server
transaction, and the resmgr framework should already be tracking
clients through the close/close_ocb callbacks.

(Since the library specifically grabs DISCONNECT [and UNBLOCK]
before checking your pulse_attach(), I assume you’re not actually
using that one.)

So, I think you’re ok with the pulse_attach() method.

-David

David Gibbs
QNX Training Services
dagibbs@qnx.com

John_A_Murphy1 · December 3, 2004, 5:13pm

David Gibbs wrote:

John A. Murphy <> murf@perftech.com> > wrote:

David Gibbs wrote:

Doug Owens <> owens2@llnl.gov> > wrote:

Is there a way for a client to know when a resource manager has exited. I
have read about NTO_CHF_COID_DISCONNECT but it is not clear how to take
advantage of this from a client doing an open on a resource manager.

The client must have a channel for blocking/notification pulses.

The client must set that flag when it creates the channel.

If that is set, and a server goes away, a pulse will be delivered
for each fd that is lost. I think the pulse value will be the coid
(fd) that has gone away.

Save your fds, look up which went away, cleanup appropriately.

-David

What are the tradeoffs between simply calling pulse_attach() for
_PULSE_CODE_DISCONNECT,

Do you mean _PULSE_CODE_COIDDEATH?
Yes, of course I meant to say _PULSE_CODE_COIDDEATH.

The framework should, already, be handling _PULSE_CODE_DISCONNECT
to track clients of the resource manager going away.

_PULSE_CODE_COIDDEATH is the pulse for a server you are connected
to going away.

and calling select_attach() with the
SELECT_FLAG_EXCEPT flag set? The select_attach() method seems to be
better documented, but buggy. The pulse_attach method is simpler, but
if it works, why does the select_attach()/SELECT_FLAG_EXCEPT method even
exist?

SELECT_FLAG_EXCEPT is for “exceptional conditions” – that may or
may not include the server going away, and that may include other
situations than the server going away. SELECT_FLAG_EXCEPT exists
to provide the equivalent behaviour to select(… except_fds…).
(For instance, someone transmitting OOB data on a socket would cause
the handler for exceptional data to be invoked, but the server,
io-net, would still be alive.)
Arg, I did it again!!! I meant to say SELECT_FLAG_SRVEXCEPT, which

detects the death of a server.

From the documentation, we concluded that select_attach() was the
“proper” way to accomplish this, but we had so many problems with it
that we finally made our own version of dispatch_select.c. We’ve
recently tried the pulse_attach() method, and it seems to work fine, but
some of our people are worried that it isn’t sufficiently documented,
and therefore not safe to use.

The notification pulses are not going away.

I am concerned about you talking about _PULSE_CODE_DISCONNECT vs
select_attach(), as they are opposite sides of the client-server
transaction, and the resmgr framework should already be tracking
clients through the close/close_ocb callbacks.

(Since the library specifically grabs DISCONNECT [and UNBLOCK]
before checking your pulse_attach(), I assume you’re not actually
using that one.)

As mentioned above, I messed up just about all of the important words

in my message! The real question concerned the tradeoffs between
pulse_attach()/_PULSE_CODE_COIDDEATH and
select_attach()/SELECT_FLAG_SRVEXCEPT. Maybe I should get some sleep
before posting…

So, I think you’re ok with the pulse_attach() method.

-David

David_Gibbs1 · December 3, 2004, 6:28pm

John A. Murphy <murf@perftech.com> wrote:

David Gibbs wrote:

John A. Murphy <> murf@perftech.com> > wrote:
SELECT_FLAG_EXCEPT is for “exceptional conditions” – that may or
may not include the server going away, and that may include other
situations than the server going away. SELECT_FLAG_EXCEPT exists
to provide the equivalent behaviour to select(… except_fds…).
(For instance, someone transmitting OOB data on a socket would cause
the handler for exceptional data to be invoked, but the server,
io-net, would still be alive.)
Arg, I did it again!!! I meant to say SELECT_FLAG_SRVEXCEPT, which
detects the death of a server.

select_attach(… SELECT_FLAG_SRVEXCEPT…) does a:
pulse_attach(… _PULSE_CODE_COIDDEATH…)

As mentioned above, I messed up just about all of the important words
in my message! The real question concerned the tradeoffs between
pulse_attach()/_PULSE_CODE_COIDDEATH and
select_attach()/SELECT_FLAG_SRVEXCEPT. Maybe I should get some sleep
before posting…

They both go through the pulse_attach() path, select_attach() is going
to be heavier – it is overloading select for something else, and
does more checking, including automatic removal/rearm. But, it does
hide/cover the knowledge of the underlying OS pulse.

Hm…select_attach() would appear to allow you multiple handlers for this,
while pulse_attach() would only allow one. (Not likely a concern.)

-David

David Gibbs
QNX Training Services
dagibbs@qnx.com