Fix cross-process signal delivery to EL0-preempted guests by Max042004 · Pull Request #76 · sysprog21/elfuse

Max042004 · 2026-06-06T04:14:30Z

To fix #67. test-fork's phase-2 signal child spins in while (!got_usr1) usleep() waiting for a SIGUSR1 sent cross-process by its parent. The signal was delivered only ~35% of the time and lost the rest, so make check hung at test-fork until the 60s per-test timeout -- often longer, as leaked elfuse --fork-child orphans kept the driver's stdout pipe open.

Two complementary defects:

vcpu_run_loop treated HV_EXIT_REASON_UNKNOWN as fatal. A host SIGUSR2 (the cross-process guest-signal transport) that interrupts hv_vcpu_run mid-execution aborts the run with UNKNOWN rather than the clean CANCELED that hv_vcpus_exit() produces for a vCPU caught between runs. Route UNKNOWN through the same cancellation handling so the already-queued guest signal is delivered instead of crashing the child.
signal_deliver redirected to the handler only via ELR_EL1, which takes effect solely on an ERET from EL1 (the syscall-return path, gated by the shim's X8==2 exec_drop_frame marker). When the signal is delivered from the cancellation branch -- i.e. the vCPU was preempted while running EL0 code (cross-process SIGUSR2, or SIGALRM in a tight loop) -- there is no pending ERET, the resume uses HV_REG_PC, and the ELR_EL1 write is a no-op: the handler never runs and only the X0=signum clobber lands, re-running the interrupted nanosleep with a bogus arg and spinning forever. Detect EL0 preemption from the live PSTATE (CPSR M[3:0]==0), save the interrupted PC from HV_REG_PC instead of the stale ELR_EL1, and redirect HV_REG_PC/CPSR directly; skip the X8==2 marker since there is no shim frame to drop.

test-fork now passes 20/20 (was ~7/20); make check is green with no hang.

test-fork's phase-2 signal child spins in `while (!got_usr1) usleep()` waiting for a SIGUSR1 sent cross-process by its parent. The signal was delivered only ~35% of the time and lost the rest, so `make check` hung at test-fork until the 60s per-test timeout -- often longer, as leaked `elfuse --fork-child` orphans kept the driver's stdout pipe open. Two complementary defects: 1. vcpu_run_loop treated HV_EXIT_REASON_UNKNOWN as fatal. A host SIGUSR2 (the cross-process guest-signal transport) that interrupts hv_vcpu_run mid-execution aborts the run with UNKNOWN rather than the clean CANCELED that hv_vcpus_exit() produces for a vCPU caught between runs. Route UNKNOWN through the same cancellation handling so the already-queued guest signal is delivered instead of crashing the child. 2. signal_deliver redirected to the handler only via ELR_EL1, which takes effect solely on an ERET from EL1 (the syscall-return path, gated by the shim's X8==2 exec_drop_frame marker). When the signal is delivered from the cancellation branch -- i.e. the vCPU was preempted while running EL0 code (cross-process SIGUSR2, or SIGALRM in a tight loop) -- there is no pending ERET, the resume uses HV_REG_PC, and the ELR_EL1 write is a no-op: the handler never runs and only the X0=signum clobber lands, re-running the interrupted nanosleep with a bogus arg and spinning forever. Detect EL0 preemption from the live PSTATE (CPSR M[3:0]==0), save the interrupted PC from HV_REG_PC instead of the stale ELR_EL1, and redirect HV_REG_PC/CPSR directly; skip the X8==2 marker since there is no shim frame to drop. test-fork now passes 20/20 (was ~7/20); `make check` is green with no hang.

cubic-dev-ai

1 issue found across 2 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/syscall/proc.c">

<violation number="1" location="src/syscall/proc.c:1945">
P2: Treating every `HV_EXIT_REASON_UNKNOWN` as cancelation can mask real hypervisor failures and cause silent retry loops when no signal is pending.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-06T04:17:53Z

            }
-        } else if (vexit->reason == HV_EXIT_REASON_CANCELED) {
+        } else if (vexit->reason == HV_EXIT_REASON_CANCELED ||
+                   vexit->reason == HV_EXIT_REASON_UNKNOWN) {


P2: Treating every HV_EXIT_REASON_UNKNOWN as cancelation can mask real hypervisor failures and cause silent retry loops when no signal is pending.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/proc.c, line 1945: <comment>Treating every `HV_EXIT_REASON_UNKNOWN` as cancelation can mask real hypervisor failures and cause silent retry loops when no signal is pending.</comment> <file context> @@ -1941,11 +1941,22 @@ int vcpu_run_loop(hv_vcpu_t vcpu, } - } else if (vexit->reason == HV_EXIT_REASON_CANCELED) { + } else if (vexit->reason == HV_EXIT_REASON_CANCELED || + vexit->reason == HV_EXIT_REASON_UNKNOWN) { /* Canceled by hv_vcpus_exit(). Can be: alarm timeout, * exit_group from another thread, or signal preemption </file context>

jserv · 2026-06-06T15:33:26Z

+        } else if (vexit->reason == HV_EXIT_REASON_CANCELED ||
+                   vexit->reason == HV_EXIT_REASON_UNKNOWN) {


Add proper comments.

jserv

Core fix looks right -- the UNKNOWN/CANCELED unification and the EL0-preempt redirect via HV_REG_PC/CPSR both match the bug description. One follow-up worth folding in:

src/syscall/proc.c:1995 -- rseq IP fixup has the same root cause this PR fixes in signal.c. The rseq_try_abort call in the CANCELED branch reads ELR_EL1 unconditionally, so for an EL0-preempted vCPU it sees the stale syscall-return PC and won't fire the IP fixup for a critical section actually interrupted at EL0. Partial cover: when signal_pending() is also true on the same exit, signal_deliver re-runs rseq_try_abort with the el0_preempt-corrected PC, so the abort still happens. The gap is exits with no queued signal (fork-barrier wakeups, future ptrace wake paths). Mirror the CPSR-based PC selection here for symmetry:

uint64_t cur_pc, cur_cpsr = 0;
hv_vcpu_get_reg(vcpu, HV_REG_CPSR, &cur_cpsr);
bool el0 = (cur_cpsr & 0xfULL) == 0;
if (el0)
    hv_vcpu_get_reg(vcpu, HV_REG_PC, &cur_pc);
else
    hv_vcpu_get_sys_reg(vcpu, HV_SYS_REG_ELR_EL1, &cur_pc);
int rseq_rc = rseq_try_abort(g, current_thread->rseq_gva,
                             current_thread->rseq_signature, &cur_pc);
if (rseq_rc == 1) {
    if (el0)
        hv_vcpu_set_reg(vcpu, HV_REG_PC, cur_pc);
    else
        hv_vcpu_set_sys_reg(vcpu, HV_SYS_REG_ELR_EL1, cur_pc);
}

See inline note on the UNKNOWN routing.

jserv · 2026-06-06T15:43:11Z

            }
-        } else if (vexit->reason == HV_EXIT_REASON_CANCELED) {
+        } else if (vexit->reason == HV_EXIT_REASON_CANCELED ||
+                   vexit->reason == HV_EXIT_REASON_UNKNOWN) {


Routing UNKNOWN through the same fall-through as CANCELED is broad: if HVF ever returns UNKNOWN for a genuine fault (not the SIGUSR2 race), the loop will silently retry instead of taking the unexpected exit reason crash path at the end of this switch. Consider gating it on something actionable being present after the drain, e.g.:

} else if (vexit->reason == HV_EXIT_REASON_CANCELED || (vexit->reason == HV_EXIT_REASON_UNKNOWN && (signal_pending() || proc_exit_group_requested() || (is_main && g_timed_out)))) {

Per the PR description the queued guest signal is already drained before we reach the switch, so signal_pending() is true for the cross-process SIGUSR2 race this is targeting -- but a genuine HVF error with no signal queued would still crash visibly instead of looping.

cubic-dev-ai Bot reviewed Jun 6, 2026

View reviewed changes

This was referenced Jun 6, 2026

Route vCPU preemption signals through a dedicated sigwait thread to eliminate HV_EXIT_REASON_UNKNOWN #77

Open

Deliver synchronous fault signals to the faulting thread #86

Open

jserv reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cross-process signal delivery to EL0-preempted guests#76

Fix cross-process signal delivery to EL0-preempted guests#76
Max042004 wants to merge 1 commit into
sysprog21:mainfrom
Max042004:fix-cross-process-signal-el0

Max042004 commented Jun 6, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 6, 2026

Uh oh!

jserv Jun 6, 2026

Uh oh!

jserv left a comment

Uh oh!

jserv Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		} else if (vexit->reason == HV_EXIT_REASON_CANCELED \|\|
		vexit->reason == HV_EXIT_REASON_UNKNOWN) {

Conversation

Max042004 commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

jserv Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

jserv left a comment

Choose a reason for hiding this comment

Uh oh!

jserv Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Max042004 commented Jun 6, 2026 •

edited

Loading