Fix epoll_ctl dropping registrations in multi-threaded guests#73
Conversation
18ea38e to
5fe3f6a
Compare
jserv
left a comment
There was a problem hiding this comment.
Solid fix for a real bug -- swapping the dup-as-ident for the persistent host fd from fd_table is the right call. A few follow-ups before this lands.
P1: the snapshot pattern leaves a residual TOCTOU around the kqueue ident, and the FIOCLEX/FIONCLEX path has the same ABA shape on linux_flags. fd_entry_t.generation already exists in the codebase for exactly this race; consider using it.
P2: FIOCLEX/FIONCLEX should accept O_PATH fds; Linux's fcntl(F_SETFD) does.
P3 (must-fix for coverage): the new regression test is not listed in tests/manifest.txt, so tests/driver.sh won't run it under make check or tests/test-matrix.sh.
| * validated. Result mapping uses udata (the guest fd), so the ident only | ||
| * needs to stay open and refer to the same open file description. */ | ||
| fd_entry_t target_snap; | ||
| if (!fd_snapshot(fd, &target_snap)) { |
There was a problem hiding this comment.
After fd_snapshot() releases fd_lock, target_snap.host_fd is no longer guaranteed to match fd_table[fd] by the time kevent() runs. A sibling can close(fd) and the host can reuse that fd number before EV_ADD posts -- the kqueue then holds a knote keyed on a different open file than the guest fd believed it registered. A later EPOLL_CTL_DEL after a close+reopen ABA would EV_DELETE the wrong knote.
fd_entry_t already carries a generation counter for this class of race (abi.h:708; used in net.c:311,323). Consider stamping the generation into epoll_reg_t at ADD and rejecting DEL/MOD when fd_table[fd].generation no longer matches.
Adjacent (pre-existing, but more observable after this change): sys_close does not clear inst->regs[fd].active for epoll instances holding the fd, so a closed-then-reopened guest fd still looks active to epoll. Worth filing separately.
| * fd_lock and re-check the slot is still open so a concurrent | ||
| * close/reuse cannot flip CLOEXEC on a different fd that took the slot. | ||
| */ | ||
| pthread_mutex_lock(&fd_lock); |
There was a problem hiding this comment.
Two issues here:
-
ABA race despite the
FD_CLOSEDre-check.sys_ioctlvalidates the fd viahost_fd_ref_open_regular_io()at the top of the function without holdingfd_lock; between that validation and the lock taken here, a sibling canclose(fd)and another open can land a new file in the same slot. The re-check sees a non-CLOSED slot but it belongs to a different file, so the CLOEXEC flip lands on the wrong fd. Same shape asF_SETFDinsys_fcntl(fs.ccase 2). Usefd_entry_t.generationto close it, or document the divergence. -
host_fd_ref_open_regular_io()(the dispatch atio.c:1480) rejects FD_PATH with-EBADF. Linux allowsFIOCLEX/FIONCLEX/fcntl(F_SETFD)on O_PATH descriptors. Dispatch the two cloexec ioctls before the host-fd-ref open so O_PATH fds work.
| @@ -0,0 +1,147 @@ | |||
| /* Multi-threaded epoll registration regression test | |||
There was a problem hiding this comment.
The binary builds via the wildcard rule in mk/config.mk:35 but tests/driver.sh:90 reads the test list strictly from tests/manifest.txt, which still only lists test-epoll and test-epoll-edge. Without an entry there, this regression will not run under make check or tests/test-matrix.sh, and the next host_fd_ref refactor can silently re-break the multi-threaded epoll path. Add test-epoll-mt to the I/O subsystem section next to the other two.
15305a6 to
521e032
Compare
In a multi-threaded guest host_fd_ref_open() hands back a dup of the target fd that host_fd_ref_close() closes when the syscall returns. sys_epoll_ctl() used that transient dup as the kqueue knote ident, and the kernel drops a knote the moment its fd is closed -- so every epoll registration made while multi-threaded was torn down the instant epoll_ctl() returned, and epoll_pwait() never reported readiness again. Single-threaded guests borrow the raw fd (no dup, no close) and never hit it. Node's libuv DelayedTaskScheduler (eventfd + epoll backing uv_async_send) relied on this path and hung forever at process exit: the main thread blocked in WorkerThreadsTaskRunner::Shutdown -> uv_thread_join on a scheduler thread that could no longer be woken. Key the knote on the persistent host fd from the fd table. Take it from the same atomic fd_snapshot() that validates the fd, so the ident comes from the entry that was validated rather than a second fd_to_host() lookup that could race a concurrent close/reopen. Result mapping already uses udata (the guest fd), so the ident only needs to stay open and refer to the same open file description. Guard the close+reopen ABA with a per-slot generation counter. fd_table entries now carry a monotonic generation bumped on every allocation; epoll registrations stamp it at ADD/MOD. If the guest closes a watched fd and reopens it (reusing the guest fd number), the kernel has already dropped the original knote, yet reg->active still looks live -- a later DEL/MOD would EV_DELETE the wrong knote on the reused host fd. A mismatched generation now marks the registration gone, so DEL/MOD report ENOENT (matching Linux's auto-removal on close) and ADD starts fresh. Also implement the FIONBIO / FIOCLEX / FIONCLEX ioctls, which were falling through to ENOTTY. libuv's uv_pipe_open() sets non-blocking via FIONBIO, so Node's console.log() to a pipe threw "open ENOTTY". FIONBIO maps to F_SETFL O_NONBLOCK (status flag, shared across the dup). FIOCLEX/FIONCLEX mirror F_SETFD by toggling the fd_table cloexec bit rather than the host fd's FD_CLOEXEC, which is per-descriptor and lost on the dup. They need no host fd, so they dispatch before host_fd_ref_open_regular_io() -- which rejects O_PATH (FD_PATH) with EBADF, while Linux allows these ioctls (like F_SETFD) on O_PATH fds -- and validate the slot and flip the flag in a single fd_lock section, so there is no validate-then-mutate window for a concurrent close/reuse to flip cloexec on a different file. Add tests/test-epoll-mt.c: a CLONE_THREAD sibling keeps the guest multi-threaded across epoll_ctl, then asserts a registered eventfd and pipe still deliver an EPOLLIN edge. It fails without the poll.c fix. Add tests/test-ioctl-cloexec.c covering FIOCLEX/FIONCLEX round-trip on both a regular and an O_PATH fd. Both are listed in tests/manifest.txt so the driver runs them under make check. With these, node:alpine (node v26.3.0) runs JavaScript, timers, the libuv threadpool, and promises, and exits cleanly.
521e032 to
b8569ab
Compare
|
Thank @Max042004 for contributing! |
In a multi-threaded guest host_fd_ref_open() hands back a dup of the
target fd that host_fd_ref_close() closes when the syscall returns.
sys_epoll_ctl() used that transient dup as the kqueue knote ident, and
the kernel drops a knote the moment its fd is closed -- so every epoll
registration made while multi-threaded was torn down the instant
epoll_ctl() returned, and epoll_pwait() never reported readiness again.
Single-threaded guests borrow the raw fd (no dup, no close) and never
hit it. Node's libuv DelayedTaskScheduler (eventfd + epoll backing
uv_async_send) relied on this path and hung forever at process exit:
the main thread blocked in WorkerThreadsTaskRunner::Shutdown ->
uv_thread_join on a scheduler thread that could no longer be woken.
Key the knote on the persistent host fd from the fd table. Take it from
the same atomic fd_snapshot() that validates the fd, so the ident comes
from the entry that was validated rather than a second fd_to_host()
lookup that could race a concurrent close/reopen. Result mapping already
uses udata (the guest fd), so the ident only needs to stay open and
refer to the same open file description.
Also implement the FIONBIO / FIOCLEX / FIONCLEX ioctls, which were
falling through to ENOTTY. libuv's uv_pipe_open() sets non-blocking via
FIONBIO, so Node's console.log() to a pipe threw "open ENOTTY". FIONBIO
maps to F_SETFL O_NONBLOCK (status flag, shared across the dup);
FIOCLEX/FIONCLEX mirror F_SETFD by toggling the fd_table cloexec bit
under fd_lock (re-checking the slot is still open) rather than the host
fd's FD_CLOEXEC, which is per-descriptor and lost on the dup.
Add tests/test-epoll-mt.c: a CLONE_THREAD sibling keeps the guest
multi-threaded across epoll_ctl, then asserts a registered eventfd and
pipe still deliver an EPOLLIN edge. It fails without the poll.c fix.
This surfaced while bringing up node:alpine via
oci runfrom the OCIimage support PR (#34); the fix lives in the core syscall layer and is
independent of that PR (based on main), but it also unblocks running
multi-threaded guests through
oci run. With these, node:alpine(node v26.3.0) runs JavaScript, timers, the libuv threadpool, and
promises, and exits cleanly.