Skip to content

Fix epoll_ctl dropping registrations in multi-threaded guests#73

Merged
jserv merged 1 commit into
sysprog21:mainfrom
Max042004:fix-epoll-mt-dup
Jun 6, 2026
Merged

Fix epoll_ctl dropping registrations in multi-threaded guests#73
jserv merged 1 commit into
sysprog21:mainfrom
Max042004:fix-epoll-mt-dup

Conversation

@Max042004
Copy link
Copy Markdown
Collaborator

@Max042004 Max042004 commented Jun 5, 2026

In a multi-threaded guest host_fd_ref_open() hands back a dup of the
target fd that host_fd_ref_close() closes when the syscall returns.
sys_epoll_ctl() used that transient dup as the kqueue knote ident, and
the kernel drops a knote the moment its fd is closed -- so every epoll
registration made while multi-threaded was torn down the instant
epoll_ctl() returned, and epoll_pwait() never reported readiness again.
Single-threaded guests borrow the raw fd (no dup, no close) and never
hit it. Node's libuv DelayedTaskScheduler (eventfd + epoll backing
uv_async_send) relied on this path and hung forever at process exit:
the main thread blocked in WorkerThreadsTaskRunner::Shutdown ->
uv_thread_join on a scheduler thread that could no longer be woken.

Key the knote on the persistent host fd from the fd table. Take it from
the same atomic fd_snapshot() that validates the fd, so the ident comes
from the entry that was validated rather than a second fd_to_host()
lookup that could race a concurrent close/reopen. Result mapping already
uses udata (the guest fd), so the ident only needs to stay open and
refer to the same open file description.

Also implement the FIONBIO / FIOCLEX / FIONCLEX ioctls, which were
falling through to ENOTTY. libuv's uv_pipe_open() sets non-blocking via
FIONBIO, so Node's console.log() to a pipe threw "open ENOTTY". FIONBIO
maps to F_SETFL O_NONBLOCK (status flag, shared across the dup);
FIOCLEX/FIONCLEX mirror F_SETFD by toggling the fd_table cloexec bit
under fd_lock (re-checking the slot is still open) rather than the host
fd's FD_CLOEXEC, which is per-descriptor and lost on the dup.

Add tests/test-epoll-mt.c: a CLONE_THREAD sibling keeps the guest
multi-threaded across epoll_ctl, then asserts a registered eventfd and
pipe still deliver an EPOLLIN edge. It fails without the poll.c fix.

This surfaced while bringing up node:alpine via oci run from the OCI
image support PR (#34); the fix lives in the core syscall layer and is
independent of that PR (based on main), but it also unblocks running
multi-threaded guests through oci run. With these, node:alpine
(node v26.3.0) runs JavaScript, timers, the libuv threadpool, and
promises, and exits cleanly.

cubic-dev-ai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix for a real bug -- swapping the dup-as-ident for the persistent host fd from fd_table is the right call. A few follow-ups before this lands.

P1: the snapshot pattern leaves a residual TOCTOU around the kqueue ident, and the FIOCLEX/FIONCLEX path has the same ABA shape on linux_flags. fd_entry_t.generation already exists in the codebase for exactly this race; consider using it.

P2: FIOCLEX/FIONCLEX should accept O_PATH fds; Linux's fcntl(F_SETFD) does.

P3 (must-fix for coverage): the new regression test is not listed in tests/manifest.txt, so tests/driver.sh won't run it under make check or tests/test-matrix.sh.

Comment thread src/syscall/poll.c
* validated. Result mapping uses udata (the guest fd), so the ident only
* needs to stay open and refer to the same open file description. */
fd_entry_t target_snap;
if (!fd_snapshot(fd, &target_snap)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After fd_snapshot() releases fd_lock, target_snap.host_fd is no longer guaranteed to match fd_table[fd] by the time kevent() runs. A sibling can close(fd) and the host can reuse that fd number before EV_ADD posts -- the kqueue then holds a knote keyed on a different open file than the guest fd believed it registered. A later EPOLL_CTL_DEL after a close+reopen ABA would EV_DELETE the wrong knote.

fd_entry_t already carries a generation counter for this class of race (abi.h:708; used in net.c:311,323). Consider stamping the generation into epoll_reg_t at ADD and rejecting DEL/MOD when fd_table[fd].generation no longer matches.

Adjacent (pre-existing, but more observable after this change): sys_close does not clear inst->regs[fd].active for epoll instances holding the fd, so a closed-then-reopened guest fd still looks active to epoll. Worth filing separately.

Comment thread src/syscall/io.c Outdated
* fd_lock and re-check the slot is still open so a concurrent
* close/reuse cannot flip CLOEXEC on a different fd that took the slot.
*/
pthread_mutex_lock(&fd_lock);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues here:

  1. ABA race despite the FD_CLOSED re-check. sys_ioctl validates the fd via host_fd_ref_open_regular_io() at the top of the function without holding fd_lock; between that validation and the lock taken here, a sibling can close(fd) and another open can land a new file in the same slot. The re-check sees a non-CLOSED slot but it belongs to a different file, so the CLOEXEC flip lands on the wrong fd. Same shape as F_SETFD in sys_fcntl (fs.c case 2). Use fd_entry_t.generation to close it, or document the divergence.

  2. host_fd_ref_open_regular_io() (the dispatch at io.c:1480) rejects FD_PATH with -EBADF. Linux allows FIOCLEX/FIONCLEX/fcntl(F_SETFD) on O_PATH descriptors. Dispatch the two cloexec ioctls before the host-fd-ref open so O_PATH fds work.

Comment thread tests/test-epoll-mt.c
@@ -0,0 +1,147 @@
/* Multi-threaded epoll registration regression test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary builds via the wildcard rule in mk/config.mk:35 but tests/driver.sh:90 reads the test list strictly from tests/manifest.txt, which still only lists test-epoll and test-epoll-edge. Without an entry there, this regression will not run under make check or tests/test-matrix.sh, and the next host_fd_ref refactor can silently re-break the multi-threaded epoll path. Add test-epoll-mt to the I/O subsystem section next to the other two.

In a multi-threaded guest host_fd_ref_open() hands back a dup of the
target fd that host_fd_ref_close() closes when the syscall returns.
sys_epoll_ctl() used that transient dup as the kqueue knote ident, and
the kernel drops a knote the moment its fd is closed -- so every epoll
registration made while multi-threaded was torn down the instant
epoll_ctl() returned, and epoll_pwait() never reported readiness again.
Single-threaded guests borrow the raw fd (no dup, no close) and never
hit it. Node's libuv DelayedTaskScheduler (eventfd + epoll backing
uv_async_send) relied on this path and hung forever at process exit:
the main thread blocked in WorkerThreadsTaskRunner::Shutdown ->
uv_thread_join on a scheduler thread that could no longer be woken.

Key the knote on the persistent host fd from the fd table. Take it from
the same atomic fd_snapshot() that validates the fd, so the ident comes
from the entry that was validated rather than a second fd_to_host()
lookup that could race a concurrent close/reopen. Result mapping already
uses udata (the guest fd), so the ident only needs to stay open and
refer to the same open file description.

Guard the close+reopen ABA with a per-slot generation counter. fd_table
entries now carry a monotonic generation bumped on every allocation;
epoll registrations stamp it at ADD/MOD. If the guest closes a watched
fd and reopens it (reusing the guest fd number), the kernel has already
dropped the original knote, yet reg->active still looks live -- a later
DEL/MOD would EV_DELETE the wrong knote on the reused host fd. A
mismatched generation now marks the registration gone, so DEL/MOD report
ENOENT (matching Linux's auto-removal on close) and ADD starts fresh.

Also implement the FIONBIO / FIOCLEX / FIONCLEX ioctls, which were
falling through to ENOTTY. libuv's uv_pipe_open() sets non-blocking via
FIONBIO, so Node's console.log() to a pipe threw "open ENOTTY". FIONBIO
maps to F_SETFL O_NONBLOCK (status flag, shared across the dup).
FIOCLEX/FIONCLEX mirror F_SETFD by toggling the fd_table cloexec bit
rather than the host fd's FD_CLOEXEC, which is per-descriptor and lost on
the dup. They need no host fd, so they dispatch before
host_fd_ref_open_regular_io() -- which rejects O_PATH (FD_PATH) with
EBADF, while Linux allows these ioctls (like F_SETFD) on O_PATH fds --
and validate the slot and flip the flag in a single fd_lock section, so
there is no validate-then-mutate window for a concurrent close/reuse to
flip cloexec on a different file.

Add tests/test-epoll-mt.c: a CLONE_THREAD sibling keeps the guest
multi-threaded across epoll_ctl, then asserts a registered eventfd and
pipe still deliver an EPOLLIN edge. It fails without the poll.c fix.
Add tests/test-ioctl-cloexec.c covering FIOCLEX/FIONCLEX round-trip on
both a regular and an O_PATH fd. Both are listed in tests/manifest.txt
so the driver runs them under make check.

With these, node:alpine (node v26.3.0) runs JavaScript, timers, the
libuv threadpool, and promises, and exits cleanly.
@jserv jserv merged commit 8410040 into sysprog21:main Jun 6, 2026
4 checks passed
@jserv
Copy link
Copy Markdown
Contributor

jserv commented Jun 6, 2026

Thank @Max042004 for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants