Skip to content

Fix madvise(MADV_DONTNEED) on high-VA mmap regions#83

Open
Max042004 wants to merge 2 commits into
sysprog21:mainfrom
Max042004:fix-madvise-high-va
Open

Fix madvise(MADV_DONTNEED) on high-VA mmap regions#83
Max042004 wants to merge 2 commits into
sysprog21:mainfrom
Max042004:fix-madvise-high-va

Conversation

@Max042004
Copy link
Copy Markdown
Collaborator

@Max042004 Max042004 commented Jun 6, 2026

Problem

sys_madvise was primary-window-only: it computed off = addr - ipa_base
(GUEST_IPA_BASE == 0) and rejected any range past guest_size with
-ENOMEM. High-VA mmap regions -- used by Rosetta's own slab/JIT and by guest
JITs such as V8 -- therefore failed every MADV_DONTNEED, even though
sys_mprotect already handles the same high-VA range. That asymmetry is the
bug.

V8's page allocator decommits guard/code pages with
mprotect(PROT_NONE) + madvise(MADV_DONTNEED) and CHECK_EQ(0, ret)s the
madvise return, so the spurious -ENOMEM aborted x86_64 Node.js the moment its
JIT initialized:

# Fatal error in , line 0
# Check failed: 0 == ret.

A --verbose trace pins it to the last syscall before the abort:
madvise(advice=4 MADV_DONTNEED) -> -12 (ENOMEM) on a page in the high-VA
window (mmap(NULL) under Rosetta lands at e.g. 0x7fffff7fd000).

Fix

src/syscall/mem.c: accept high-VA ranges the region tracker records as mapped
(madvise_range_mapped), and resolve the zero target through
host_ptr_for_gpa(gpa_base + ...) instead of host_base + off so the reset
hits the real backing (overlays included). Identity regions are unchanged
(gpa_base == start). High-VA file-backed restore is left as zero-fill, which
no current JIT guest exercises.

Test

tests/x86_64-rosetta-madvise.c (vendored x86_64 static ELF, run via Rosetta):
mmap(NULL) lands high-VA, then MADV_DONTNEED is exercised on a writable
page, on a PROT_NONE guard page (the exact V8 decommit pattern), and across a
multi-page span. Each must return 0 and zero-fill. Wired up as
make test-rosetta-madvise (and test-rosetta-all) plus the rosetta-madvise
suite in tests/test-matrix.sh.

  • Before the fix: every subtest fails with madvise rc=-1 errno=12.
  • After the fix: Results: 1 passed, 0 failed.

Validation (local, Apple M4 + RosettaLinux)

The macOS CI runners can't run guest tests (no HVF entitlement), so this was
verified locally:

  • make test-rosetta-madvise -> OK (and fails as expected on a pre-fix build)
  • make check -> 92/93; the one failure is the pre-existing test-fork flake
    (the EL0-preempt fix is not yet on main), unrelated to this change and not
    reproducible on re-run
  • clang-format (v22), trailing-newline, cppcheck, and the banned-API security
    scan all clean

Summary by cubic

Fixes madvise(MADV_DONTNEED) on high-VA mmap regions by accepting mapped ranges outside the primary IPA window and zeroing via GPA-backed host pointers. This unblocks JITs under Rosetta (e.g., V8/Node.js) and brings sys_madvise in line with sys_mprotect.

  • Bug Fixes

    • src/syscall/mem.c: allow high-VA ranges if madvise_range_mapped returns true; detect primary vs high-VA window.
    • Zero pages via host_ptr_for_gpa(r->gpa_base + ...) so resets hit the real backing (overlays included).
    • Keep file-backed restore via host-offset only for identity regions; high-VA file-backed remains zero-fill.
  • New Features

    • Added Rosetta regression test: vendored x86_64-rosetta-madvise and tests/test-rosetta-madvise.sh; new make test-rosetta-madvise, included in test-rosetta-all and tests/test-matrix.sh.

Written for commit 38c26c5. Summary will update on new commits.

Review in cubic

Max042004 added 2 commits June 6, 2026 21:28
sys_madvise was primary-window-only: it computed off = addr - ipa_base
(GUEST_IPA_BASE = 0) and rejected anything past guest_size with ENOMEM.
High-VA mmap regions -- used by Rosetta's own slab/JIT and by guest JITs
such as V8 -- therefore failed every MADV_DONTNEED, even though
sys_mprotect already handles the same high-VA range. V8's page allocator
decommits guard/code pages with mprotect(PROT_NONE)+madvise(MADV_DONTNEED)
and asserts CHECK_EQ(0, ret) on the madvise return, so the ENOMEM aborted
x86_64 Node.js the moment its JIT initialized.

Accept high-VA ranges the region tracker records as mapped, and resolve
the zero target through host_ptr_for_gpa(gpa_base + ...) instead of
host_base + off so the reset hits the real backing (overlays included).
Identity regions are unchanged (gpa_base == start). High-VA file-backed
restore is left as zero-fill, which no current JIT guest exercises.
Covers the sys_madvise high-VA fix: a vendored x86_64 static ELF whose
anonymous mmap(NULL) lands in the high-VA window, then exercises
MADV_DONTNEED on a writable page, on a PROT_NONE guard page (the V8
decommit pattern), and across a multi-page span. Each must return 0 and
zero-fill; before the fix every madvise returned -ENOMEM.

Wired up as the `test-rosetta-madvise` make target (also in
test-rosetta-all) and the rosetta-madvise suite in tests/test-matrix.sh,
and documented in tests/fixtures/rosetta/README.md. The fixture is an
x86_64 Linux ELF, not built in-tree; rebuild out of tree per the README.
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 7 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="tests/test-rosetta-madvise.sh">

<violation number="1" location="tests/test-rosetta-madvise.sh:66">
P2: Script exits 0 even when tests fail — missing `if [ "$fail" -gt 0 ]; then exit 1; fi` after `report_summary`. This masks test failures in standalone runs and weakens the matrix runner's belt-and-suspenders `|| rc=1` check.</violation>
</file>

<file name="src/syscall/mem.c">

<violation number="1" location="src/syscall/mem.c:3005">
P1: Missing `off + length` overflow guard can make invalid high-VA madvise ranges pass mapping checks and return success.</violation>

<violation number="2" location="src/syscall/mem.c:3005">
P2: The high-VA admission check is too loose: it can approve ranges based only on region coverage even though the rest of `sys_madvise()` still uses the low-IPA `off` layout.</violation>

<violation number="3" location="src/syscall/mem.c:3065">
P2: `MADV_DONTNEED` on high-VA file-backed regions now zero-fills without restoring file contents, producing incorrect data after a successful call.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread src/syscall/mem.c
* slab/JIT and guest JITs (e.g. V8) decommit pages in the high-VA window
* via mprotect(PROT_NONE)+madvise(MADV_DONTNEED); rejecting those with
* ENOMEM trips the guest's CHECK_EQ(0, ret) on the madvise return. */
bool in_primary = (off <= g->guest_size && length <= g->guest_size - off);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Missing off + length overflow guard can make invalid high-VA madvise ranges pass mapping checks and return success.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/mem.c, line 3005:

<comment>Missing `off + length` overflow guard can make invalid high-VA madvise ranges pass mapping checks and return success.</comment>

<file context>
@@ -2997,7 +2997,13 @@ int64_t sys_madvise(guest_t *g, uint64_t addr, uint64_t length, int advice)
+     * slab/JIT and guest JITs (e.g. V8) decommit pages in the high-VA window
+     * via mprotect(PROT_NONE)+madvise(MADV_DONTNEED); rejecting those with
+     * ENOMEM trips the guest's CHECK_EQ(0, ret) on the madvise return. */
+    bool in_primary = (off <= g->guest_size && length <= g->guest_size - off);
+    if (!in_primary && !madvise_range_mapped(g, off, length))
         return -LINUX_ENOMEM;
</file context>
Suggested change
bool in_primary = (off <= g->guest_size && length <= g->guest_size - off);
if (off > UINT64_MAX - length)
return -LINUX_ENOMEM;
bool in_primary = (off <= g->guest_size && length <= g->guest_size - off);

printf '%s\n' "$madv_out" >&2
fi

report_summary "$total"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Script exits 0 even when tests fail — missing if [ "$fail" -gt 0 ]; then exit 1; fi after report_summary. This masks test failures in standalone runs and weakens the matrix runner's belt-and-suspenders || rc=1 check.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/test-rosetta-madvise.sh, line 66:

<comment>Script exits 0 even when tests fail — missing `if [ "$fail" -gt 0 ]; then exit 1; fi` after `report_summary`. This masks test failures in standalone runs and weakens the matrix runner's belt-and-suspenders `|| rc=1` check.</comment>

<file context>
@@ -0,0 +1,66 @@
+    printf '%s\n' "$madv_out" >&2
+fi
+
+report_summary "$total"
</file context>

Comment thread src/syscall/mem.c
* identity regions gpa_base == start, so this is unchanged. */
uint64_t rgpa = r->gpa_base + (zstart - r->start);
memset(host_ptr_for_gpa(g, rgpa), 0, zend - zstart);
if (!(r->flags & LINUX_MAP_ANONYMOUS) && r->gpa_base == r->start) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: MADV_DONTNEED on high-VA file-backed regions now zero-fills without restoring file contents, producing incorrect data after a successful call.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/mem.c, line 3065:

<comment>`MADV_DONTNEED` on high-VA file-backed regions now zero-fills without restoring file contents, producing incorrect data after a successful call.</comment>

<file context>
@@ -3050,11 +3056,18 @@ int64_t sys_madvise(guest_t *g, uint64_t addr, uint64_t length, int advice)
+             * identity regions gpa_base == start, so this is unchanged. */
+            uint64_t rgpa = r->gpa_base + (zstart - r->start);
+            memset(host_ptr_for_gpa(g, rgpa), 0, zend - zstart);
+            if (!(r->flags & LINUX_MAP_ANONYMOUS) && r->gpa_base == r->start) {
                 /* EOF leaves the tail zero per mmap rules; the helper
                  * already returns 0 in that case after stopping the
</file context>

Comment thread src/syscall/mem.c
* slab/JIT and guest JITs (e.g. V8) decommit pages in the high-VA window
* via mprotect(PROT_NONE)+madvise(MADV_DONTNEED); rejecting those with
* ENOMEM trips the guest's CHECK_EQ(0, ret) on the madvise return. */
bool in_primary = (off <= g->guest_size && length <= g->guest_size - off);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The high-VA admission check is too loose: it can approve ranges based only on region coverage even though the rest of sys_madvise() still uses the low-IPA off layout.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/mem.c, line 3005:

<comment>The high-VA admission check is too loose: it can approve ranges based only on region coverage even though the rest of `sys_madvise()` still uses the low-IPA `off` layout.</comment>

<file context>
@@ -2997,7 +2997,13 @@ int64_t sys_madvise(guest_t *g, uint64_t addr, uint64_t length, int advice)
+     * slab/JIT and guest JITs (e.g. V8) decommit pages in the high-VA window
+     * via mprotect(PROT_NONE)+madvise(MADV_DONTNEED); rejecting those with
+     * ENOMEM trips the guest's CHECK_EQ(0, ret) on the madvise return. */
+    bool in_primary = (off <= g->guest_size && length <= g->guest_size - off);
+    if (!in_primary && !madvise_range_mapped(g, off, length))
         return -LINUX_ENOMEM;
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant