select() doesn't return when file descriptor becomes invalid

Hello, I created a resource manager and an application that uses it. It opens the device alright and then uses select() to wait for any interesting things from the driver. When the driver terminates (due to a crash or a manual command) select doesn’t return - I would expect it to return with errno==EBADF.

Is there a way to achieve this behaviour? I read Xiaodan’s article on http://sendreceivereply.wordpress.com/2008/05/12/daddy-the-network-is-down/
but this means having a dedicated watcher thread setup and then raise a signal against the thread blocked in select() (because, from seeing SIGWAITINFO, I conclude that select() is implemented using signals), or use pthread_cancel() which should work in this state

The other possibility is to use the timeout facility of select() and call it in a loop but really this is not what I want.

Thanks for your advice.

Regards,
Albrecht

Does your resource manager handle io_unblock to specifically MsgError() with EINTR to client threads that unexpectedly unblock?

No, it doesn’t. My understanding of this callback is that it notifies the resource manager if a clients that is REPLY-blocked on it wishes to terminate. It could then answer the queued message with whatever is appropriate and cleanup any client-specific resources. Correct?

The case I have is the other way round: The clients sits waiting in select() on the filedescriptor pointing to a resource manager, and that resource manager dies. I want the client to return from the select() call. Any ideas?

Regards,
Albrecht

You’re right, I think in qnx4 select was built on messaging but in qnx6 it’s built around io_notify.

Some rambling follows.

The client library code is blocked waiting for an io_notify message from the resmgr, which doesn’t come if the driver dies unexpectedly.

Maybe the driver should have clean-up code that runs to notify the client of input or output conditions (whichever it is waiting on), then
when the client attempts to read or write the dead driver will be discovered. But that won’t work if the driver dies in a way that the cleanup code never runs. So the client should probably have a timeout anyway.

Interesting problem.

What “the driver dies” means could be important. I suppose you mean that the driver had an error and it sigsegv’d, and not that it decided on its own to exit. So the situation you are describing is already one where the driver is a less reliable piece of software than the application.

When faced with this situation, I would expect some supervisory program to detect the driver dying which would be responsible for restarting the driver and the application.

It is true that if an application were reading a device, and waiting for input and the driver died, it would become unblocked, but it is not completely clear to me that this is the right behavior for a select. When you select, you are asking to be awoken when one of a number of events occurs on possibly more than one device. The driver crashing is not one of these events. An application might want to continue waiting for another event related to another driver, so it is not clear to me that this behavior should be changed, even it if course.

The driver crashing could of course be an event that select returns on, but this would be an enhancement to select’s standard behavior. For another OS you might expect that a crashing driver would result in the whole system crashing.

Of course there is nothing to prevent you from creating a mechanism outside of select to detect this condition, and to relay it to your application. Alternatively, you could create a function like select that operates the way you prefer.

That’s why the standard doesn’t define a behaviour after a driver crash for select() - there is normally no return within that application. But we are in QNX, right, and one could argue that the third set of filedescriptors passed to select() - the ones that report error conditions - could well be the one that is filled with the fd of the dead driver, as an extension to the definition of “error condition”. So the fact that a fd becomes invalid due to the underlying driver being dead could be thought of as an error condition on that fd, and should be reported by select().

Of course, a supervisory program like HAM offers several other advantages, like restarting the dead driver, logging and so on. However, there still is the following implementation detail: A thread sitting in select is blocked SIGWAITINFO. From my understanding it works in a way that the kernel would deliver a signal to the thread whenever any of the requested conditions are met. If the HAM would do that in case of driver crash, whould it be directed to the correct thread?

Regards,
Albrecht

Hi,
I have also come across a problem in using resource reclaim mechanism provided by HAM.

Requirement:
I need to reclaim a file descriptor which is become invalid while the program is being run. A program is written using
HAM client library functions.

Problem:
The problem is that, the handler function is not get invoked when the file descriptor become invalid (EBADF). The file which is being accessed is located on a remote machine and is accessed using samba. To test if the handler function is invoked upon descriptor become invalid, I did kill the samba client process (to make the descriptor invalid). As expected, the system call (read/write) would returns EBADF but the handler function is not get invoked!

Also I tried with a Null resource manager. i.e, A program is written using HAM client library function which continuously execute read and write calls to the file descriptor returned by ha_open( /dev/Null ,…). To test if the handler function is invoked upon descriptor become invalid, I did kill the Null resource manager (to make the descriptor invalid). As expected, the system call (read/write) would returns EBADF but the handler function is not get invoked! why is it so? please help.

Regards,
Princi