Skip to content

Switch ppoll/pselect6 to host_fd_ref_t and tighten#72

Merged
jserv merged 1 commit into
mainfrom
hotpath-cleanup
Jun 5, 2026
Merged

Switch ppoll/pselect6 to host_fd_ref_t and tighten#72
jserv merged 1 commit into
mainfrom
hotpath-cleanup

Conversation

@jserv
Copy link
Copy Markdown
Contributor

@jserv jserv commented Jun 5, 2026

sys_ppoll previously walked a hand-rolled FD lookup cache to translate guest fds to host fds, with no protection against a concurrent close(2) on another thread retiring the fd between lookup and poll(). The replacement is host_fd_ref_open_io plus a per-call host_fd_ref_t array that keeps each host fd alive until poll() returns and the refs are released. sys_pselect6 follows the same pattern.

For invalid guest fds, the bad-fd slot is no longer suppressed by an early return. Force a non-blocking poll() when any slot is invalid, re-stamp POLLNVAL on those entries after the call (POSIX poll resets revents to 0 for fd < 0), and credit them to the return count. Linux reports POLLNVAL on bad fds alongside revents on the good ones in the same call; the prior early-return dropped ready events on valid fds.

ppoll and pselect6 also stop silently dropping a bad sigmask pointer: guest_read_small failures now return -EFAULT, and pselect6 returns -EINVAL when ss_len does not match sizeof(sigset_t).

pselect6 grows a poll(2) fallback for cases where a host fd or the wakeup pipe exceeds FD_SETSIZE. select cannot represent those fds, so the fallback drives a struct pollfd array and maps poll revents back to read/write/except fd_sets (POLLIN|POLLHUP|POLLERR for read, POLLOUT|POLLHUP|POLLERR for write, POLLPRI for except).

Add the Linux IP small-int sockopts to socket_opt_uses_small_int: IP_TOS, IP_TTL, IP_HDRINCL, IP_PKTINFO, IP_RECVTTL, IP_RECVTOS. macOS rejects setsockopt with optlen < sizeof(int) for these, so the host call always forwards a zero-extended int regardless of guest optlen. getsockopt mirrors Linux ip_sockglue copyval: when the caller buffer is shorter than int and the value fits in a byte, report and write a single byte. Factor ip_copyval_clamp to share the check between the cached fast path and the post-host path.

Normalize SO_PASSCRED and the four IP boolean toggles (IP_HDRINCL, IP_PKTINFO, IP_RECVTTL, IP_RECVTOS) to 0 or 1 in
socket_small_int_normalize so setsockopt(IP_PKTINFO, 5) caches 1 and getsockopt returns 1, matching the kernel's !!val convention. SO_PASSCRED also joins the small-int set so its cached round-trip survives the gate change from "guest_optlen <= sizeof(int)" to socket_opt_uses_small_int(level, optname).

rt_sigreturn switches its rt_sigframe read to guest_read_small to pick up the bounded direct-mapping fast path; guest_read_small falls back to guest_read when the bounded mapping fails so the frame size stays correct.


Summary by cubic

Make ppoll/pselect6 race-free and Linux-correct, and align IP and SO_PASSCRED sockopts with Linux semantics on macOS. Adds a poll fallback for oversized FDs and tightens small-int handling.

  • Bug Fixes

    • Match Linux on invalid FDs and masks: no early return in ppoll; restamp POLLNVAL and include it in the count; return -EFAULT for bad sigmask pointers; pselect6 returns -EINVAL if ss_len != sizeof(sigset_t).
    • Add pselect6 poll(2) fallback when any fd or the wakeup pipe exceeds FD_SETSIZE; map poll revents back to read/write/except sets.
    • Socket options: treat IP_TOS, IP_TTL, IP_HDRINCL, IP_PKTINFO, IP_RECVTTL, IP_RECVTOS, and SO_PASSCRED as small-int; require optlen > 0; normalize booleans to 0/1; always pass sizeof(int) to the host; clamp IP getsockopt to 1 byte when the buffer is short and the value fits; accept() inherits SO_PASSCRED from the listener (including while accept blocks); only inject SCM_CREDENTIALS on AF_UNIX sockets.
  • Refactors

    • Switch ppoll/pselect6 to host_fd_ref_t to keep host FDs alive during the call and avoid close(2) races.
    • Track per-fd generation and guard socket-opt reads to avoid stale cache (used by accept inheritance); use guest_read_small in rt_sigreturn for the bounded fast path.

Written for commit d369caf. Summary will update on new commits.

Review in cubic

cubic-dev-ai[bot]

This comment was marked as resolved.

@sysprog21 sysprog21 deleted a comment from cubic-dev-ai Bot Jun 5, 2026
cubic-dev-ai[bot]

This comment was marked as resolved.

sys_ppoll previously walked a hand-rolled FD lookup cache to translate
guest fds to host fds, with no protection against a concurrent close(2)
on another thread retiring the fd between lookup and poll(). The
replacement is host_fd_ref_open_io plus a per-call host_fd_ref_t array
that keeps each host fd alive until poll() returns and the refs are
released. sys_pselect6 follows the same pattern.

For invalid guest fds, the bad-fd slot is no longer suppressed by an
early return. Force a non-blocking poll() when any slot is invalid,
re-stamp POLLNVAL on those entries after the call (POSIX poll resets
revents to 0 for fd < 0), and credit them to the return count. Linux
reports POLLNVAL on bad fds alongside revents on the good ones in the
same call; the prior early-return dropped ready events on valid fds.

ppoll and pselect6 also stop silently dropping a bad sigmask pointer:
guest_read_small failures now return -EFAULT, and pselect6 returns
-EINVAL when ss_len does not match sizeof(sigset_t).

pselect6 grows a poll(2) fallback for cases where a host fd or the
wakeup pipe exceeds FD_SETSIZE. select cannot represent those fds, so
the fallback drives a struct pollfd array and maps poll revents back
to read/write/except fd_sets (POLLIN|POLLHUP|POLLERR for read,
POLLOUT|POLLHUP|POLLERR for write, POLLPRI for except).

Add the Linux IP small-int sockopts to socket_opt_uses_small_int:
IP_TOS, IP_TTL, IP_HDRINCL, IP_PKTINFO, IP_RECVTTL, IP_RECVTOS. macOS
rejects setsockopt with optlen < sizeof(int) for these, so the host
call always forwards a zero-extended int regardless of guest optlen.
getsockopt mirrors Linux ip_sockglue copyval: when the caller buffer
is shorter than int and the value fits in a byte, report and write a
single byte. Factor ip_copyval_clamp to share the check between the
cached fast path and the post-host path.

Normalize SO_PASSCRED and the four IP boolean toggles (IP_HDRINCL,
IP_PKTINFO, IP_RECVTTL, IP_RECVTOS) to 0 or 1 in
socket_small_int_normalize so setsockopt(IP_PKTINFO, 5) caches 1 and
getsockopt returns 1, matching the kernel's !!val convention.
SO_PASSCRED also joins the small-int set so its cached round-trip
survives the gate change from "guest_optlen <= sizeof(int)" to
socket_opt_uses_small_int(level, optname).

rt_sigreturn switches its rt_sigframe read to guest_read_small to pick
up the bounded direct-mapping fast path; guest_read_small falls back to
guest_read when the bounded mapping fails so the frame size stays
correct.
@jserv jserv force-pushed the hotpath-cleanup branch from fde4313 to d369caf Compare June 5, 2026 13:12
@jserv jserv merged commit 23b8300 into main Jun 5, 2026
4 checks passed
@jserv jserv deleted the hotpath-cleanup branch June 5, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant