fix: kill orphan VMM via /proc cmdline fallback when pidfile pre-removed#51
Merged
Conversation
PR #50's socket-probe tiebreaker only catches orphans whose api.sock is still listening. If pidfile and api.sock are both pre-removed before the VMM exits (observed on GKE prod after vk-cocoon rapid restart + CH InvalidStateTransition), DeleteAll's pidfile-based stop returns ErrNotRunning, the probe returns ENOENT, and the VMM survives as a PPID=1 orphan with no rundir. Add utils.FindVMMByCmdline as a /proc scan fallback keyed on the api-socket path (already unique per VM). Wire it into: - WithRunningVM: recover the live pid when pidfile/socket are gone - DeleteAll: second-pass after socket probe to catch sibling/worker pids Repro: sleep + rm pidfile + rm api.sock + cocoon vm rm --force leaves a CH orphan. With the fix, the cmdline scan recovers and SIGKILLs it.
- Reorder utils/process_*.go so FindVMMByCmdline sits above verifyProcessCmdline (matches sparse_linux.go / reflink_linux.go). - Rename the FindVMMByCmdline marker param to expectArg for consistency with VerifyProcessCmdline / TerminateProcess / pidfd_linux.go.
There was a problem hiding this comment.
Pull request overview
This PR hardens VM teardown to prevent orphaned VMM processes when runtime artifacts (pidfile and/or api.sock) are removed before the VMM exits, by adding a /proc/<pid>/cmdline scan fallback and using it to recover/kill still-running VMMs.
Changes:
- Add
utils.FindVMMByCmdlineto locate candidate VMM PIDs by scanning/proc/*/cmdline(Linux) with a no-op stub on non-Linux. - Update
Backend.WithRunningVMto recover a live PID via cmdline scan when the pidfile path can’t validate a running process. - Update
Backend.DeleteAllto do a post-stop/procsweep and kill any remaining worker/sibling VMM processes.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| utils/process_linux.go | Implements /proc cmdline scanning helper to find candidate VMM PIDs on Linux. |
| utils/process_other.go | Adds non-Linux stub for the new cmdline scan helper. |
| utils/process_test.go | Adds a unit test intended to validate the cmdline scan behavior. |
| hypervisor/state.go | Uses cmdline scan fallback in WithRunningVM to recover live PID(s) when pidfile validation fails. |
| hypervisor/stop.go | Adds post-stop cmdline scan sweep in DeleteAll to kill leftover VMM processes before deleting runtime dirs/DB record. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- utils/process_linux.go: slices.Sort the returned pids so callers get a deterministic smallest-pid choice (Copilot caught the /proc lexicographic ordering trap, e.g. "100" < "11"). - hypervisor/state.go: fail-closed when /proc scan errors after pidfile-based check fails; previously returned ErrNotRunning on inconclusive state, which could let start/delete proceed against a still-running VM. - hypervisor/stop.go: fail-closed in DeleteAll second-pass when /proc scan errors; previously dropped scanErr and risked re-introducing the orphan leak the PR is trying to fix. - utils/process_test.go: replace flaky "sleep marker 60" (sleep rejects non-numeric arg and exits immediately) with "sh -c 'sleep 60 && :' marker" (compound prevents sh tail-exec into sleep). Gate on runtime.GOOS == "linux".
…ding state.go + stop.go: add "(resolve the host issue and retry)" actionable-hint clause so the new scan-error wraps match the existing socket-probe error format.
… closed Copilot round-3 finding: verifyProcessCmdline returned (false, available=false) on permission/IO errors reading /proc/<pid>/cmdline (e.g. hidepid/EPERM), and FindVMMByCmdline silently dropped that signal — a hidepid environment could mask the real VMM and reintroduce the orphan leak. Refactor verifyProcessCmdline to return (bool, error); FindVMMByCmdline now distinguishes ENOENT (transient race, safe to skip) from any other read error (fail-closed, return wrapped first error). VerifyProcessCmdline wrapper preserves the "fall back to IsProcessAlive on error" semantic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
utils.FindVMMByCmdline/proc scan as a fallback when the pidfile-based stop path can't see a live VMM.WithRunningVM(recover the pid) andDeleteAll(post-stop sweep for workers/siblings).Background
PR #50 added the socket-probe tiebreaker, which only catches orphans whose
api.sockis still listening. The GKE prod scan today turned up 3 orphan VMMs whose api.sock had been wiped along with the rundir — the chain was:InvalidStateTransition(Paused, Paused)is a separate upstream CH bug (non-idempotentVm::pause()invmm/src/vm.rs:3239—valid_transitionrejectsPaused → Paused); investigating separately. This PR closes the orphan-leak hole on the cocoon side regardless of what triggers it.Reproduction (testbed
35.240.182.52)Test plan
make lintpasses on bothGOOS=linuxandGOOS=darwin(0 issues)go test ./utils/ ./hypervisor/passes; newTestFindVMMByCmdlinecovers the happy path + binary/marker negativesvm rmpath still works