resource manager misses 'close dup' when lsof runs?

Hello,

I have a strange behaviour in a resource manager that seems to be
related to the lsof program, and as far as I can tell, occurs totally
within the resource manager library. The symptom is that if I run
lsof while there are open OCBs on my resource manager, then when a
program with an open OCB exits, the OCB does not close. By looking at
the sequence of events in the resource manager, it appears that the
resource manager misses a “close dup” when the lsof program exits.

The sequence of events is this:

  1. the resource manager, nserve, starts.
  2. the client program, datahub (OCB 8057b00), starts and performs
    _IO_CONNECT, _IO_DEVCTL, _IO_WRITE, _IO_READ
  3. the client program, gamma (OCB 8057b88), starts and performs
    _IO_CONNECT, _IO_WRITE, _IO_READ, _IO_WRITE, _IO_READ
  4. datahub performs _IO_WRITE, _IO_READ
  5. lsof starts and does many things
  6. lsof stops
  7. gamma stops, but no close OCB occurs
    :sunglasses: datahub stops, and close OCB occurs for datahub only

Gamma and datahub maintain an open OCB on nserve until they stop
running. This OCB is associated with a filename in nserve’s virtual
file system. The “close ocb” event is used for cleanup within nserve.
The problem is that when gamma stops, there is no “close ocb” message
generated. It appears to be related to the fact that lsof seems to
create a dup on gamma’s OCB, but never closes that dup.

This only occurs when I run lsof while gamma is running.

Since I cannot put hooks in for dup() to be able to generate a debug
message, I can only assume that the dup is being done by lsof, and
never being cleaned up. Is there a way to intercept the dup and
generate a message, then perform the default processing? Why does
iofunc_dup_default not exist?

In any case, does anybody have an idea why this is occurring? Is
there something (stupid) that I could be doing to create these
symptoms?

I have included a run trace from nserve, the resource manager.
Everything within square brackets, [], is my annotation. You can
search on [ to find where significant events occur. Every message
arriving from resmgr_block() is printed with “**** Message:”. The
message “reference attr” and “dereference attr” are places where I
manually manipulate the count on the iofunc_attr_t associated with
files and directories in the virtual file system of nserve. The OCB
and attr numbers are simply the memory addresses of the structures.

If there is anything that I can add to this trace to make things
clearer, please let me know. At the moment I am stumped as to how to
debug this. Any help would be appreciated.

Thanks,
Andrew


[Start of program nserve]
AVF_Mkdir: /dev/nserve
AVF_Mkfile: /dev/nserve
Create attr 805798c
attr->io_attr.mode = 0x81b6
AVF_Mkfile: .control
Create attr 8057538
Reference attr 8057538 to 1 from AVF_Mkfile
attr->io_attr.mode = 0x81b6
AVF_Mkfile: .inform
Create attr 80570c4
Reference attr 80570c4 to 1 from AVF_Mkfile
attr->io_attr.mode = 0x81b6
Root attribute is 805798c
Reference attr 805798c to 1 from main
Reference attr 8057538 to 2 from main
Reference attr 80570c4 to 2 from main

[Start of program datahub]
**** Message: rcvid=3, id=-1, size=49 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 8057538 to 3 from nserve_ocb_calloc
Create OCB 8057b00 (size 56 base 32) with owner 0
[the OCB for datahub is 8057b00]

**** Message: rcvid=3, id=-1, size=16 type=_IO_DEVCTL
I got a lock ocb: 8057b00 (count 4)
Devctl: type 0x106, class 0x3, command 0x18, len 0, data 0
I got a unlock ocb: 8057b00 (count 4)

**** Message: rcvid=3, id=-1, size=77 type=_IO_WRITE
I got a lock ocb: 8057b00 (count 4)
Control: (add “/dh/default” “default” “/dh/default” “0” 0 17420339 1)
Create path /dh
create terminal component dh
AVF_Mkdir: dh
AVF_Mkfile: dh
Create attr 8056358
Reference attr 8056358 to 1 from AVF_Mkfile
attr->io_attr.mode = 0x81b6
Reference attr 8056358 to 1 from AVF_Mkdir
AVF_Mkfile: default
Create attr 80562b0
Reference attr 80562b0 to 1 from AVF_Mkfile
attr->io_attr.mode = 0x81b6
Refcount on /dh/default ->> 1 (CmdAdd)
Reference attr 80562b0 to 2 from CmdAdd
I got a unlock ocb: 8057b00 (count 4)

**** Message: rcvid=3, id=-1, size=16 type=_IO_READ
I got a lock ocb: 8057b00 (count 4)
I got a read ocb: 8057b00, on attr 0x08057538
Read control: 2: t
I got a unlock ocb: 8057b00 (count 4)

[Start of program gamma]
**** Message: rcvid=4, id=-1, size=49 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 8057538 to 5 from nserve_ocb_calloc
Create OCB 8057b88 (size 56 base 32) with owner 0
[the OCB for gamma is 8057b88]

**** Message: rcvid=4, id=-1, size=67 type=_IO_WRITE
I got a lock ocb: 8057b88 (count 6)
Control: (add “dbtest” “default” “dbtest” “0” 0 17424431 1)
AVF_Mkfile: dbtest
Create attr 80561b0
Reference attr 80561b0 to 1 from AVF_Mkfile
attr->io_attr.mode = 0x81b6
Refcount on dbtest ->> 1 (CmdAdd)
Tell a task: 0x1
Send async: 0
Reference attr 80561b0 to 2 from CmdAdd
I got a unlock ocb: 8057b88 (count 6)

**** Message: rcvid=4, id=-1, size=16 type=_IO_READ
I got a lock ocb: 8057b88 (count 6)
I got a read ocb: 8057b88, on attr 0x08057538
Read control: 2: t
I got a unlock ocb: 8057b88 (count 6)

**** Message: rcvid=4, id=-1, size=37 type=_IO_WRITE
I got a lock ocb: 8057b88 (count 6)
Control: (find “/dh/default”)
I got a unlock ocb: 8057b88 (count 6)

**** Message: rcvid=4, id=-1, size=16 type=_IO_READ
I got a lock ocb: 8057b88 (count 6)
I got a read ocb: 8057b88, on attr 0x08057538
Read control: 77: ("/dh/default" “default” “/dh/default” “andrewhome.cogent.ca” 0 17420339 1)

I got a unlock ocb: 8057b88 (count 6)

**** Message: rcvid=3, id=-1, size=38 type=_IO_WRITE
I got a lock ocb: 8057b00 (count 6)
Control: (findid 0 17424431 1)
I got a unlock ocb: 8057b00 (count 6)

**** Message: rcvid=3, id=-1, size=16 type=_IO_READ
I got a lock ocb: 8057b00 (count 6)
I got a read ocb: 8057b00, on attr 0x08057538
Read control: 67: (“dbtest” “default” “dbtest” “andrewhome.cogent.ca” 0 17424431 1)

I got a unlock ocb: 8057b00 (count 6)

[start of program lsof]
**** Message: rcvid=9, id=-1, size=56 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 805798c to 2 from nserve_ocb_calloc
Create OCB 8057b44 (size 56 base 32) with owner 0
I got a lock ocb: 8057b44 (count 3)
I got a stat: 8057b44 (count 3)
I got a unlock ocb: 8057b44 (count 3)
I got a close dup: 8057b44 (count 3)
I got a close ocb: 8057b44 (count 3)
Free OCB 8057b44
Dereference attr 805798c to 1 from nserve_ocb_free

**** Message: rcvid=0, id=-1, size=0 type=Unknown: 0x0


**** Message: rcvid=10, id=-1, size=41 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 805798c to 2 from nserve_ocb_calloc
Create OCB 8057b44 (size 56 base 32) with owner 0

**** Message: rcvid=10, id=-1, size=16 type=_IO_DEVCTL
I got a lock ocb: 8057b44 (count 3)
Devctl: type 0x106, class 0x2, command 0x10, len 256, data 0
I got a unlock ocb: 8057b44 (count 3)

**** Message: rcvid=10, id=-1, size=16 type=_IO_DEVCTL
I got a lock ocb: 8057b44 (count 3)
Devctl: type 0x106, class 0x2, command 0xd, len 168, data 0
I got a unlock ocb: 8057b44 (count 3)

**** Message: rcvid=10, id=-1, size=56 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 805798c to 4 from nserve_ocb_calloc
Create OCB 805c8d4 (size 56 base 32) with owner 0
I got a lock ocb: 805c8d4 (count 5)
I got a stat: 805c8d4 (count 5)
I got a unlock ocb: 805c8d4 (count 5)
I got a close dup: 805c8d4 (count 5)
I got a close ocb: 805c8d4 (count 5)
Free OCB 805c8d4
Dereference attr 805798c to 3 from nserve_ocb_free

**** Message: rcvid=10, id=-1, size=76 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 805798c to 4 from nserve_ocb_calloc
Create OCB 805c8d4 (size 56 base 32) with owner 0
I got a lock ocb: 805c8d4 (count 5)
Devctl: type 0x106, class 0x1, command 0x2, len 4, data 0
I got a stat: 805c8d4 (count 5)
I got a unlock ocb: 805c8d4 (count 5)

**** Message: rcvid=10, id=-1, size=16 type=_IO_READ
I got a lock ocb: 805c8d4 (count 5)
I got a read ocb: 805c8d4, on attr 0x0805798c
I got a unlock ocb: 805c8d4 (count 5)

**** Message: rcvid=10, id=-1, size=64 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 8057538 to 7 from nserve_ocb_calloc
Create OCB 805c918 (size 56 base 32) with owner 0
I got a lock ocb: 805c918 (count :sunglasses:
I got a stat: 805c918 (count :sunglasses:
I got a unlock ocb: 805c918 (count :sunglasses:
I got a close dup: 805c918 (count :sunglasses:
I got a close ocb: 805c918 (count :sunglasses:
Free OCB 805c918
Dereference attr 8057538 to 6 from nserve_ocb_free

**** Message: rcvid=10, id=-1, size=16 type=_IO_READ
I got a lock ocb: 805c8d4 (count 5)
I got a read ocb: 805c8d4, on attr 0x0805798c
I got a unlock ocb: 805c8d4 (count 5)

**** Message: rcvid=10, id=-1, size=56 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 80570c4 to 3 from nserve_ocb_calloc
Create OCB 805c918 (size 56 base 32) with owner 0
I got a lock ocb: 805c918 (count 4)
I got a stat: 805c918 (count 4)
I got a unlock ocb: 805c918 (count 4)
I got a close dup: 805c918 (count 4)
I got a close ocb: 805c918 (count 4)
Free OCB 805c918
Dereference attr 80570c4 to 2 from nserve_ocb_free

**** Message: rcvid=10, id=-1, size=16 type=_IO_READ
I got a lock ocb: 805c8d4 (count 5)
I got a read ocb: 805c8d4, on attr 0x0805798c
I got a unlock ocb: 805c8d4 (count 5)

**** Message: rcvid=10, id=-1, size=56 type=_IO_CONNECT
I got a lock ocb: 805c5d8 (count 2)
I got a stat: 805c5d8 (count 2)
I got a unlock ocb: 805c5d8 (count 2)
I got a close dup: 805c5d8 (count 2)
I got a close ocb: 805c5d8 (count 2)

**** Message: rcvid=10, id=-1, size=76 type=_IO_CONNECT
I got a lock ocb: 805c5d8 (count 2)
Devctl: type 0x106, class 0x1, command 0x2, len 4, data 0
I got a stat: 805c5d8 (count 2)
I got a unlock ocb: 805c5d8 (count 2)

**** Message: rcvid=10, id=-1, size=16 type=_IO_READ
I got a lock ocb: 805c5d8 (count 2)
I got a read ocb: 805c5d8, on attr 0x08056358
I got a unlock ocb: 805c5d8 (count 2)

**** Message: rcvid=10, id=-1, size=64 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 80562b0 to 3 from nserve_ocb_calloc
Create OCB 805c918 (size 56 base 32) with owner 8057b00
I got a lock ocb: 805c918 (count 4)
I got a stat: 805c918 (count 4)
I got a unlock ocb: 805c918 (count 4)
I got a close dup: 805c918 (count 4)
I got a close ocb: 805c918 (count 4)
Free OCB 805c918
Dereference attr 80562b0 to 2 from nserve_ocb_free

**** Message: rcvid=10, id=-1, size=16 type=_IO_READ
I got a lock ocb: 805c5d8 (count 2)
I got a read ocb: 805c5d8, on attr 0x08056358
I got a unlock ocb: 805c5d8 (count 2)

**** Message: rcvid=10, id=-1, size=4 type=_IO_CLOSE
I got a close dup: 805c5d8 (count 2)
I got a close ocb: 805c5d8 (count 2)

**** Message: rcvid=10, id=-1, size=16 type=_IO_READ
I got a lock ocb: 805c8d4 (count 5)
I got a read ocb: 805c8d4, on attr 0x0805798c
I got a unlock ocb: 805c8d4 (count 5)

**** Message: rcvid=10, id=-1, size=56 type=_IO_CONNECT
Allocate ocb of 56 bytes
Reference attr 80561b0 to 3 from nserve_ocb_calloc
Create OCB 805c918 (size 56 base 32) with owner 8057b88
I got a lock ocb: 805c918 (count 4)
I got a stat: 805c918 (count 4)
I got a unlock ocb: 805c918 (count 4)
I got a close dup: 805c918 (count 4)
I got a close ocb: 805c918 (count 4)
Free OCB 805c918
Dereference attr 80561b0 to 2 from nserve_ocb_free

**** Message: rcvid=10, id=-1, size=16 type=_IO_READ
I got a lock ocb: 805c8d4 (count 5)
I got a read ocb: 805c8d4, on attr 0x0805798c
I got a unlock ocb: 805c8d4 (count 5)

**** Message: rcvid=10, id=-1, size=4 type=_IO_CLOSE
I got a close dup: 805c8d4 (count 5)
I got a close ocb: 805c8d4 (count 5)
Free OCB 805c8d4
Dereference attr 805798c to 3 from nserve_ocb_free

**** Message: rcvid=10, id=-1, size=60 type=_IO_DUP

**** Message: rcvid=10, id=-1, size=8 type=_IO_STAT
I got a lock ocb: 8057b88 (count 6)
I got a stat: 8057b88 (count 6)
I got a unlock ocb: 8057b88 (count 6)

**** Message: rcvid=10, id=-1, size=60 type=_IO_DUP

**** Message: rcvid=10, id=-1, size=8 type=_IO_STAT
I got a lock ocb: 8057b00 (count 6)
I got a stat: 8057b00 (count 6)
I got a unlock ocb: 8057b00 (count 6)

**** Message: rcvid=10, id=-1, size=8 type=_IO_STAT
I got a lock ocb: 8057b44 (count 3)
I got a stat: 8057b44 (count 3)
I got a unlock ocb: 8057b44 (count 3)

**** Message: rcvid=10, id=-1, size=4 type=_IO_CLOSE
I got a close dup: 8057b44 (count 3)
I got a close ocb: 8057b44 (count 3)
Free OCB 8057b44
Dereference attr 805798c to 1 from nserve_ocb_free

**** Message: rcvid=0, id=-1, size=0 type=Unknown: 0x0

I got a close dup: 8057b00 (count 6)
[end of processing on termination of program lsof]

[It appears that an open dup on datahub successfully closes (but only
after a mystery event). Symmetry would suggest that a close dup
should also occur on gamma, but it does not.]

[end of program gamma]
**** Message: rcvid=4, id=-1, size=4 type=_IO_CLOSE
I got a close dup: 8057b88 (count 6)

**** Message: rcvid=0, id=-1, size=0 type=Unknown: 0x0
[end of processing on termination of program gamma]

[end of program datahub]
**** Message: rcvid=3, id=-1, size=4 type=_IO_CLOSE
I got a close dup: 8057b00 (count 6)
I got a close ocb: 8057b00 (count 6)
Free 99 byte OCB output buffer with 0 bytes remaining
Free related dirent: default
Dereference attr 80562b0 to 1 from nserve_ocb_free
Dereference attr 80562b0 to 0 from avf_rmchild
Free attr 80562b0
avf_free_attr_data on /dh/default from avf_attr_free
Removed
Tell a task: 0x2
Send async: 0
Refcount on /dh/default → 0 (avf_free_attr_data)
Destroy name: /dh/default
Dereference attr 8056358 to 0 from avf_rmchild
Free attr 8056358
avf_free_attr_data on NULL from avf_attr_free
Free OCB 8057b00
Dereference attr 8057538 to 4 from nserve_ocb_free

**** Message: rcvid=0, id=-1, size=0 type=Unknown: 0x0
[end of processing on termination of program datahub]

\

Andrew Thomas, President, Cogent Real-Time Systems Inc.
2430 Meadowpine Boulevard, Suite 105, Mississauga, Ontario, Canada L5N 6S2
Email: andrew@cogent.ca WWW: http://www.cogent.ca

“Andrew Thomas” <andrew@cogent.ca> wrote in message
news:x7y9o8w5wi.fsf@cogent.ca

[snip]

Since I cannot put hooks in for dup() to be able to generate a debug
message, I can only assume that the dup is being done by lsof, and
never being cleaned up. Is there a way to intercept the dup and
generate a message, then perform the default processing?

There’s linker option (starting from binutils 2.95.x) allowing to substitute
all references to some symbol for your own function. Linker takes care of
renaming real function so you can call it from your own. I don’t remember
the option right now but i’m sure you can find it in ld man page.

  • igor