Back read-only MAP_SHARED file mappings with MAP_PRIVATE#84
Conversation
A MAP_SHARED, PROT_READ mapping of a file opened O_RDONLY could never be
installed. hvf_apply_file_overlay_quiesced() always mmap'd the host page
PROT_READ|PROT_WRITE and mapped the HVF segment RWX. On a read-only fd the
host mmap fails with EACCES (writable mapping of an O_RDONLY fd); forcing
PROT_READ then trips hv_vm_map(), because a MAP_SHARED mapping of an
O_RDONLY fd has macOS max_protection=READ and HVF cannot grant stage-2
rights (RWX) beyond the host region's max_protection (HV_ERROR).
This blocked every workload that maps a read-only file MAP_SHARED -- most
visibly the JVM, which maps its ~135 MiB lib/modules image exactly this
way and crashed on startup.
Choose the host backing from what the fd and the guest actually need:
- guest wants PROT_WRITE: MAP_SHARED PROT_READ|PROT_WRITE (writes reach
the file; an O_RDONLY fd still yields EACCES, matching Linux).
- guest read-only on a writable fd: MAP_SHARED PROT_READ (max_protection
is RWX, so the segment maps and cross-mapping coherence is preserved).
- guest read-only on an O_RDONLY fd: MAP_PRIVATE PROT_READ. Its
max_protection is RWX so the segment maps; the pages still show file
content, and the guest's stage-1 tables keep the region read-only so
the private copy is never dirtied -- no observable MAP_SHARED
divergence for a read-only mapping.
The guest-requested prot is threaded through hvf_apply_file_overlay(),
hvf_apply_file_overlay_quiesced(), and restore_file_overlay_range() so
every overlay install/restore site picks the correct backing.
Add test-mmap-shared-ro covering the O_RDONLY read path, a second
concurrent read-only mapping, EACCES on a writable request, and the
read-only-mapping-on-O_RDWR-fd branch.
(cherry picked from commit 337d39a4313109884112a86a0c4147bddfe18fa1)
| bool fd_writable = acc >= 0 && ((acc & O_ACCMODE) == O_RDWR || | ||
| (acc & O_ACCMODE) == O_WRONLY); | ||
| int host_prot = want_write ? (PROT_READ | PROT_WRITE) : PROT_READ; | ||
| int share = (want_write || fd_writable) ? MAP_SHARED : MAP_PRIVATE; |
There was a problem hiding this comment.
The MAP_PRIVATE substitution is sound while the mapping stays read-only, but sys_mprotect at mem.c:3275 doesn't know about the backing decision -- it just calls guest_region_set_prot + guest_update_perms(prot_to_perms(prot)). A guest that does:
int fd = open(path, O_RDONLY);
char *p = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
mprotect(p, len, PROT_READ | PROT_WRITE); // Linux: EACCES; here: succeeds
*p = 0xff; // Writes to COW copy, not the filewill silently upgrade stage-1 to RW and write into the COW copy. Linux returns EACCES because the mapping remembers max_prot=READ from the O_RDONLY fd. Before this PR the upgrade was unreachable (the initial mmap failed); the MAP_PRIVATE fallback exposes it.
The cleanest fix is to track a host_backing_kind (or max_prot) on guest_region_t, set it when this branch is taken, carry it through snapshots/splits/merges/mremap, and have sys_mprotect return -LINUX_EACCES when PROT_WRITE would exceed it. That also closes the downstream gap where sys_msync at mem.c:3560 skips its pwrite-refresh path for overlay_active=true regions on the assumption "the page cache already keeps them coherent with the file" -- false for a MAP_PRIVATE backing.
| * divergence for a read-only mapping). | ||
| */ | ||
| bool want_write = (prot & LINUX_PROT_WRITE) != 0; | ||
| int acc = fcntl(fd, F_GETFL); |
There was a problem hiding this comment.
Treating fcntl failure as "not writable" silently picks MAP_PRIVATE for a valid writable fd whose F_GETFL transiently failed. Vanishingly rare on a host fd elfuse already holds via host_fd_ref_open, but the failure mode (losing MAP_SHARED coherence) is silent rather than surfaced.
Two options: return -linux_errno() on acc < 0, or hoist fd-writability detection up to where host_backing_fd is resolved and thread a plain bool fd_writable through. The latter also eliminates the per-install fcntl on the hot mmap path.
|
|
||
| /* Several guest pages so the overlay spans more than one host page and the | ||
| * containing 2 MiB segment is split and remapped over a realistic range. */ | ||
| #define NPAGES 64 |
There was a problem hiding this comment.
64 x 4 KiB = 256 KiB stays entirely within one 2 MiB segment, so hvf_segment_split's multi-block path isn't exercised. JVM lib/modules is ~135 MiB and crosses many. Bump to at least NPAGES 768 (3 MiB, two segments) so the segment-split + per-page-marker check catches a misaligned split.
A further test would lock in the corner this PR introduces:
// Linux returns EACCES; with the MAP_PRIVATE fallback in place but no
// backing-kind tracking, elfuse currently lets this through silently.
static void test_rdonly_mprotect_write_rejected(const char *path) {
int fd = open(path, O_RDONLY);
char *p = mmap(NULL, FILE_LEN, PROT_READ, MAP_SHARED, fd, 0);
EXPECT_EQ(mprotect(p, FILE_LEN, PROT_READ | PROT_WRITE), -1, "must reject");
EXPECT_EQ(errno, EACCES, "errno must be EACCES");
munmap(p, FILE_LEN); close(fd);
}
A MAP_SHARED, PROT_READ mapping of a file opened O_RDONLY could never be
installed. hvf_apply_file_overlay_quiesced() always mmap'd the host page
PROT_READ|PROT_WRITE and mapped the HVF segment RWX. On a read-only fd the
host mmap fails with EACCES (writable mapping of an O_RDONLY fd); forcing
PROT_READ then trips hv_vm_map(), because a MAP_SHARED mapping of an
O_RDONLY fd has macOS max_protection=READ and HVF cannot grant stage-2
rights (RWX) beyond the host region's max_protection (HV_ERROR).
This blocked every workload that maps a read-only file MAP_SHARED -- most
visibly the JVM, which maps its ~135 MiB lib/modules image exactly this
way and crashed on startup.
Choose the host backing from what the fd and the guest actually need:
the file; an O_RDONLY fd still yields EACCES, matching Linux).
is RWX, so the segment maps and cross-mapping coherence is preserved).
max_protection is RWX so the segment maps; the pages still show file
content, and the guest's stage-1 tables keep the region read-only so
the private copy is never dirtied -- no observable MAP_SHARED
divergence for a read-only mapping.
The guest-requested prot is threaded through hvf_apply_file_overlay(),
hvf_apply_file_overlay_quiesced(), and restore_file_overlay_range() so
every overlay install/restore site picks the correct backing.
Add test-mmap-shared-ro covering the O_RDONLY read path, a second
concurrent read-only mapping, EACCES on a writable request, and the
read-only-mapping-on-O_RDWR-fd branch.
(cherry picked from commit 337d39a4313109884112a86a0c4147bddfe18fa1)
Summary by cubic
Fixes read-only
MAP_SHAREDmappings ofO_RDONLYfiles by backing them withMAP_PRIVATEwhen needed. This unblocks common workloads (e.g., JVMlib/modules) and restores Linux-compatible behavior.MAP_SHARED | PROT_READ|PROT_WRITE(returnsEACCESonO_RDONLY, matching Linux).MAP_SHARED | PROT_READ.O_RDONLYfd:MAP_PRIVATE | PROT_READ(segment maps; no divergence since guest pages stay read-only).protthrough overlay paths (apply/restore,sys_mmap,mremap, and fork install/restore) so each site picks the correct backing.O_RDONLYfd yieldsEACCES.test-mmap-shared-roand manifest entry covering:MAP_SHAREDonO_RDONLY.MAP_SHAREDonO_RDONLY.O_RDWRfd.Written for commit ace1dd6. Summary will update on new commits.