Deliver synchronous fault signals to the faulting thread#86
Open
Max042004 wants to merge 2 commits into
Open
Conversation
test-fork's phase-2 signal child spins in `while (!got_usr1) usleep()` waiting for a SIGUSR1 sent cross-process by its parent. The signal was delivered only ~35% of the time and lost the rest, so `make check` hung at test-fork until the 60s per-test timeout -- often longer, as leaked `elfuse --fork-child` orphans kept the driver's stdout pipe open. Two complementary defects: 1. vcpu_run_loop treated HV_EXIT_REASON_UNKNOWN as fatal. A host SIGUSR2 (the cross-process guest-signal transport) that interrupts hv_vcpu_run mid-execution aborts the run with UNKNOWN rather than the clean CANCELED that hv_vcpus_exit() produces for a vCPU caught between runs. Route UNKNOWN through the same cancellation handling so the already-queued guest signal is delivered instead of crashing the child. 2. signal_deliver redirected to the handler only via ELR_EL1, which takes effect solely on an ERET from EL1 (the syscall-return path, gated by the shim's X8==2 exec_drop_frame marker). When the signal is delivered from the cancellation branch -- i.e. the vCPU was preempted while running EL0 code (cross-process SIGUSR2, or SIGALRM in a tight loop) -- there is no pending ERET, the resume uses HV_REG_PC, and the ELR_EL1 write is a no-op: the handler never runs and only the X0=signum clobber lands, re-running the interrupted nanosleep with a bogus arg and spinning forever. Detect EL0 preemption from the live PSTATE (CPSR M[3:0]==0), save the interrupted PC from HV_REG_PC instead of the stale ELR_EL1, and redirect HV_REG_PC/CPSR directly; skip the X8==2 marker since there is no shim frame to drop. test-fork now passes 20/20 (was ~7/20); `make check` is green with no hang.
A guest exception (SIGSEGV/SIGBUS/SIGILL/SIGFPE/SIGTRAP) was delivered by calling signal_set_fault_info() (which records si_code/si_addr in a _Thread_local slot) followed by signal_queue() + signal_deliver(). But signal_queue() sets a bit in the process-wide pending mask, and every vCPU thread runs signal_deliver() at its own poll points. So a fault raised by thread A could be dequeued and delivered by thread B, whose thread-local fault info is empty -- the siginfo then degrades to si_code=SI_USER with no si_addr instead of SEGV_MAPERR/si_addr. A JVM treats a SIGSEGV with si_code=SI_USER as a fatal external signal rather than a recoverable implicit null check, so javac (heavily multi-threaded, many null-check faults) crashed with a fatal error report. Two threads faulting on the same signal also collapsed into a single bitmask bit, dropping one fault entirely. Add signal_deliver_fault(), which delivers a given signal directly to the current thread using its thread-local fault info and never touches the process-wide pending set. signal_deliver()'s frame-building tail is factored into deliver_signal_locked() and shared. Route all four synchronous-fault sites in the vCPU loop (SEGV_ACCERR, BRK->SIGTRAP, SIGILL, data-abort SIGSEGV) through it. Add test-fault-signal-mt: N threads each take recoverable SIGSEGVs in a loop and assert every delivery is SEGV_MAPERR with the right si_addr on the faulting thread. It fails deterministically (rc=2, fault delivered to a non-faulting thread) on the old queue-based path and passes on the fix. (cherry picked from commit 201b4fa0d2ffb2218e252be71930e1cfb9e360a5)
jserv
reviewed
Jun 6, 2026
| */ | ||
| pthread_mutex_lock(&sig_lock); | ||
| signal_rt_info_t rt_info = signal_default_info(signum); | ||
| return deliver_signal_locked(vcpu, g, signum, rt_info, exit_code); |
Contributor
There was a problem hiding this comment.
The comment two lines up calls this "intentionally ignored: a synchronous fault cannot be postponed" -- that's only half the Linux contract. force_sig_info_to_task in kernel/signal.c resets the disposition to SIG_DFL and unblocks the signum before applying default, so a SIGSEGV/SIGBUS/SIGILL/SIGFPE/SIGTRAP that's blocked OR set to SIG_IGN terminates the process. This PR's behavior diverges in two ways:
- SIG_IGN on synchronous fault (preexisting):
deliver_signal_lockedat signal.c:1352 returns 0 for SIG_IGN. The vCPU loop resumes, PC unchanged, the same instruction re-faults forever. Linux would terminate. - Blocked synchronous fault (new in this PR): pre-PR,
signal_deliver'sdeliverable = sig_state.pending & ~*blockedshort-circuited and the thread re-faulted infinitely. Post-PR,signal_deliver_faultbypasses the blocked mask entirely and runs the user handler despite the block. Linux would reset to SIG_DFL + unblock + terminate.
JVM works either way because it never blocks SIGSEGV, but the contract drift is real. A five-line precheck closes both cases with one shape:
int signal_deliver_fault(hv_vcpu_t vcpu, guest_t *g, int signum, int *exit_code)
{
pthread_mutex_lock(&sig_lock);
uint64_t *blocked = thread_blocked_ptr();
linux_sigaction_t *act = &sig_state.actions[signum - 1];
if (act->sa_handler == LINUX_SIG_IGN ||
(*blocked & sig_bit(signum))) {
/* Linux force_sig_info_to_task: forced synchronous faults reset
* disposition to SIG_DFL, unblock the signum, then apply default. */
act->sa_handler = LINUX_SIG_DFL;
*blocked &= ~sig_bit(signum);
}
signal_rt_info_t rt_info = signal_default_info(signum);
return deliver_signal_locked(vcpu, g, signum, rt_info, exit_code);
}Update the "intentionally ignored" comment accordingly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Stacked on #76. This branch is based on
fix-cross-process-signal-el0(#76),whose commit appears here until #76 merges; afterwards this PR auto-reduces to the
single fault-delivery commit. Review #76 first.
A guest exception (SIGSEGV/SIGBUS/SIGILL/SIGFPE/SIGTRAP) was delivered via
signal_set_fault_info()(which records si_code/si_addr in a_Thread_localslot)followed by
signal_queue()+signal_deliver(). Butsignal_queue()sets a bit inthe process-wide pending mask, and every vCPU thread runs
signal_deliver()at its ownpoll points, so a fault raised by thread A could be dequeued and delivered by thread B —
whose thread-local fault info is empty. The siginfo then degraded to
si_code=SI_USERwith no
si_addrinstead ofSEGV_MAPERR/si_addr. A JVM treats such a SIGSEGV as afatal external signal rather than a recoverable implicit null check, so
javaccrashed.Two threads faulting on the same signal also collapsed into one bitmask bit, dropping
one fault.
Add
signal_deliver_fault(), which delivers a given signal directly to the currentthread using its thread-local fault info and never touches the process-wide pending set.
signal_deliver()'s frame-building tail is factored intodeliver_signal_locked()andshared. All four synchronous-fault sites in the vCPU loop (SEGV_ACCERR, BRK→SIGTRAP,
SIGILL, data-abort SIGSEGV) route through it.
Adds
test-fault-signal-mt: N threads each take recoverable SIGSEGVs and assert everydelivery is
SEGV_MAPERRwith the rightsi_addron the faulting thread. It failsdeterministically (rc=2) on the old queue-based path and passes on the fix.
Summary by cubic
Deliver synchronous fault signals to the faulting vCPU thread to keep correct
si_code/si_addrand avoid lost faults. Fixes JVM crashes caused bySIGSEGVdelivered asSI_USERand makes multi-threaded faults reliable.signal_deliver_fault()and routed all synchronous guest exceptions through it; bypasses the process-wide pending mask.deliver_signal_locked(); supports EL0 preemption by redirecting viaHV_REG_PC/CPSR.vcpu_run_loopto use the new path and to treatHV_EXIT_REASON_UNKNOWNlikeCANCELEDso preempted runs resume cleanly.test-fault-signal-mtto verify per-threadSIGSEGVdelivery; Makefile and manifest updated.Written for commit 23b2307. Summary will update on new commits.