Fix madvise(MADV_DONTNEED) on high-VA mmap regions#83
Conversation
sys_madvise was primary-window-only: it computed off = addr - ipa_base (GUEST_IPA_BASE = 0) and rejected anything past guest_size with ENOMEM. High-VA mmap regions -- used by Rosetta's own slab/JIT and by guest JITs such as V8 -- therefore failed every MADV_DONTNEED, even though sys_mprotect already handles the same high-VA range. V8's page allocator decommits guard/code pages with mprotect(PROT_NONE)+madvise(MADV_DONTNEED) and asserts CHECK_EQ(0, ret) on the madvise return, so the ENOMEM aborted x86_64 Node.js the moment its JIT initialized. Accept high-VA ranges the region tracker records as mapped, and resolve the zero target through host_ptr_for_gpa(gpa_base + ...) instead of host_base + off so the reset hits the real backing (overlays included). Identity regions are unchanged (gpa_base == start). High-VA file-backed restore is left as zero-fill, which no current JIT guest exercises.
Covers the sys_madvise high-VA fix: a vendored x86_64 static ELF whose anonymous mmap(NULL) lands in the high-VA window, then exercises MADV_DONTNEED on a writable page, on a PROT_NONE guard page (the V8 decommit pattern), and across a multi-page span. Each must return 0 and zero-fill; before the fix every madvise returned -ENOMEM. Wired up as the `test-rosetta-madvise` make target (also in test-rosetta-all) and the rosetta-madvise suite in tests/test-matrix.sh, and documented in tests/fixtures/rosetta/README.md. The fixture is an x86_64 Linux ELF, not built in-tree; rebuild out of tree per the README.
There was a problem hiding this comment.
4 issues found across 7 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="tests/test-rosetta-madvise.sh">
<violation number="1" location="tests/test-rosetta-madvise.sh:66">
P2: Script exits 0 even when tests fail — missing `if [ "$fail" -gt 0 ]; then exit 1; fi` after `report_summary`. This masks test failures in standalone runs and weakens the matrix runner's belt-and-suspenders `|| rc=1` check.</violation>
</file>
<file name="src/syscall/mem.c">
<violation number="1" location="src/syscall/mem.c:3005">
P1: Missing `off + length` overflow guard can make invalid high-VA madvise ranges pass mapping checks and return success.</violation>
<violation number="2" location="src/syscall/mem.c:3005">
P2: The high-VA admission check is too loose: it can approve ranges based only on region coverage even though the rest of `sys_madvise()` still uses the low-IPA `off` layout.</violation>
<violation number="3" location="src/syscall/mem.c:3065">
P2: `MADV_DONTNEED` on high-VA file-backed regions now zero-fills without restoring file contents, producing incorrect data after a successful call.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| * slab/JIT and guest JITs (e.g. V8) decommit pages in the high-VA window | ||
| * via mprotect(PROT_NONE)+madvise(MADV_DONTNEED); rejecting those with | ||
| * ENOMEM trips the guest's CHECK_EQ(0, ret) on the madvise return. */ | ||
| bool in_primary = (off <= g->guest_size && length <= g->guest_size - off); |
There was a problem hiding this comment.
P1: Missing off + length overflow guard can make invalid high-VA madvise ranges pass mapping checks and return success.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/mem.c, line 3005:
<comment>Missing `off + length` overflow guard can make invalid high-VA madvise ranges pass mapping checks and return success.</comment>
<file context>
@@ -2997,7 +2997,13 @@ int64_t sys_madvise(guest_t *g, uint64_t addr, uint64_t length, int advice)
+ * slab/JIT and guest JITs (e.g. V8) decommit pages in the high-VA window
+ * via mprotect(PROT_NONE)+madvise(MADV_DONTNEED); rejecting those with
+ * ENOMEM trips the guest's CHECK_EQ(0, ret) on the madvise return. */
+ bool in_primary = (off <= g->guest_size && length <= g->guest_size - off);
+ if (!in_primary && !madvise_range_mapped(g, off, length))
return -LINUX_ENOMEM;
</file context>
| bool in_primary = (off <= g->guest_size && length <= g->guest_size - off); | |
| if (off > UINT64_MAX - length) | |
| return -LINUX_ENOMEM; | |
| bool in_primary = (off <= g->guest_size && length <= g->guest_size - off); |
| printf '%s\n' "$madv_out" >&2 | ||
| fi | ||
|
|
||
| report_summary "$total" |
There was a problem hiding this comment.
P2: Script exits 0 even when tests fail — missing if [ "$fail" -gt 0 ]; then exit 1; fi after report_summary. This masks test failures in standalone runs and weakens the matrix runner's belt-and-suspenders || rc=1 check.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/test-rosetta-madvise.sh, line 66:
<comment>Script exits 0 even when tests fail — missing `if [ "$fail" -gt 0 ]; then exit 1; fi` after `report_summary`. This masks test failures in standalone runs and weakens the matrix runner's belt-and-suspenders `|| rc=1` check.</comment>
<file context>
@@ -0,0 +1,66 @@
+ printf '%s\n' "$madv_out" >&2
+fi
+
+report_summary "$total"
</file context>
| * identity regions gpa_base == start, so this is unchanged. */ | ||
| uint64_t rgpa = r->gpa_base + (zstart - r->start); | ||
| memset(host_ptr_for_gpa(g, rgpa), 0, zend - zstart); | ||
| if (!(r->flags & LINUX_MAP_ANONYMOUS) && r->gpa_base == r->start) { |
There was a problem hiding this comment.
P2: MADV_DONTNEED on high-VA file-backed regions now zero-fills without restoring file contents, producing incorrect data after a successful call.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/mem.c, line 3065:
<comment>`MADV_DONTNEED` on high-VA file-backed regions now zero-fills without restoring file contents, producing incorrect data after a successful call.</comment>
<file context>
@@ -3050,11 +3056,18 @@ int64_t sys_madvise(guest_t *g, uint64_t addr, uint64_t length, int advice)
+ * identity regions gpa_base == start, so this is unchanged. */
+ uint64_t rgpa = r->gpa_base + (zstart - r->start);
+ memset(host_ptr_for_gpa(g, rgpa), 0, zend - zstart);
+ if (!(r->flags & LINUX_MAP_ANONYMOUS) && r->gpa_base == r->start) {
/* EOF leaves the tail zero per mmap rules; the helper
* already returns 0 in that case after stopping the
</file context>
| * slab/JIT and guest JITs (e.g. V8) decommit pages in the high-VA window | ||
| * via mprotect(PROT_NONE)+madvise(MADV_DONTNEED); rejecting those with | ||
| * ENOMEM trips the guest's CHECK_EQ(0, ret) on the madvise return. */ | ||
| bool in_primary = (off <= g->guest_size && length <= g->guest_size - off); |
There was a problem hiding this comment.
P2: The high-VA admission check is too loose: it can approve ranges based only on region coverage even though the rest of sys_madvise() still uses the low-IPA off layout.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/mem.c, line 3005:
<comment>The high-VA admission check is too loose: it can approve ranges based only on region coverage even though the rest of `sys_madvise()` still uses the low-IPA `off` layout.</comment>
<file context>
@@ -2997,7 +2997,13 @@ int64_t sys_madvise(guest_t *g, uint64_t addr, uint64_t length, int advice)
+ * slab/JIT and guest JITs (e.g. V8) decommit pages in the high-VA window
+ * via mprotect(PROT_NONE)+madvise(MADV_DONTNEED); rejecting those with
+ * ENOMEM trips the guest's CHECK_EQ(0, ret) on the madvise return. */
+ bool in_primary = (off <= g->guest_size && length <= g->guest_size - off);
+ if (!in_primary && !madvise_range_mapped(g, off, length))
return -LINUX_ENOMEM;
</file context>
Problem
sys_madvisewas primary-window-only: it computedoff = addr - ipa_base(
GUEST_IPA_BASE == 0) and rejected any range pastguest_sizewith-ENOMEM. High-VA mmap regions -- used by Rosetta's own slab/JIT and by guestJITs such as V8 -- therefore failed every
MADV_DONTNEED, even thoughsys_mprotectalready handles the same high-VA range. That asymmetry is thebug.
V8's page allocator decommits guard/code pages with
mprotect(PROT_NONE) + madvise(MADV_DONTNEED)andCHECK_EQ(0, ret)s themadvise return, so the spurious
-ENOMEMaborted x86_64 Node.js the moment itsJIT initialized:
A
--verbosetrace pins it to the last syscall before the abort:madvise(advice=4 MADV_DONTNEED) -> -12 (ENOMEM)on a page in the high-VAwindow (
mmap(NULL)under Rosetta lands at e.g.0x7fffff7fd000).Fix
src/syscall/mem.c: accept high-VA ranges the region tracker records as mapped(
madvise_range_mapped), and resolve the zero target throughhost_ptr_for_gpa(gpa_base + ...)instead ofhost_base + offso the resethits the real backing (overlays included). Identity regions are unchanged
(
gpa_base == start). High-VA file-backed restore is left as zero-fill, whichno current JIT guest exercises.
Test
tests/x86_64-rosetta-madvise.c(vendored x86_64 static ELF, run via Rosetta):mmap(NULL)lands high-VA, thenMADV_DONTNEEDis exercised on a writablepage, on a
PROT_NONEguard page (the exact V8 decommit pattern), and across amulti-page span. Each must return 0 and zero-fill. Wired up as
make test-rosetta-madvise(andtest-rosetta-all) plus therosetta-madvisesuite in
tests/test-matrix.sh.madvise rc=-1 errno=12.Results: 1 passed, 0 failed.Validation (local, Apple M4 + RosettaLinux)
The macOS CI runners can't run guest tests (no HVF entitlement), so this was
verified locally:
make test-rosetta-madvise-> OK (and fails as expected on a pre-fix build)make check-> 92/93; the one failure is the pre-existingtest-forkflake(the EL0-preempt fix is not yet on
main), unrelated to this change and notreproducible on re-run
scan all clean
Summary by cubic
Fixes
madvise(MADV_DONTNEED)on high-VA mmap regions by accepting mapped ranges outside the primary IPA window and zeroing via GPA-backed host pointers. This unblocks JITs under Rosetta (e.g., V8/Node.js) and bringssys_madvisein line withsys_mprotect.Bug Fixes
src/syscall/mem.c: allow high-VA ranges ifmadvise_range_mappedreturns true; detect primary vs high-VA window.host_ptr_for_gpa(r->gpa_base + ...)so resets hit the real backing (overlays included).New Features
x86_64-rosetta-madviseandtests/test-rosetta-madvise.sh; newmake test-rosetta-madvise, included intest-rosetta-allandtests/test-matrix.sh.Written for commit 38c26c5. Summary will update on new commits.