Skip to content

fix: kill orphan VMM via /proc cmdline fallback when pidfile pre-removed#51

Merged
CMGS merged 5 commits into
masterfrom
fix/orphan-vmm-cmdline-fallback
May 15, 2026
Merged

fix: kill orphan VMM via /proc cmdline fallback when pidfile pre-removed#51
CMGS merged 5 commits into
masterfrom
fix/orphan-vmm-cmdline-fallback

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 15, 2026

Summary

  • Adds utils.FindVMMByCmdline /proc scan as a fallback when the pidfile-based stop path can't see a live VMM.
  • Hardens WithRunningVM (recover the pid) and DeleteAll (post-stop sweep for workers/siblings).
  • Repro: pidfile + api.sock removed before the VMM exits — current code declares the VM dead, wipes the rundir, deletes the DB record, but the VMM keeps running as a PPID=1 orphan.

Background

PR #50 added the socket-probe tiebreaker, which only catches orphans whose api.sock is still listening. The GKE prod scan today turned up 3 orphan VMMs whose api.sock had been wiped along with the rundir — the chain was:

CH snapshot save → InvalidStateTransition(Paused, Paused)
  → vm rm --force triggered
  → WithRunningVM (pidfile gone) → ErrNotRunning
  → socket-probe (api.sock also gone) → false
  → RemoveVMDirs + DB delete
  → VMM process still alive, now PPID=1 orphan

InvalidStateTransition(Paused, Paused) is a separate upstream CH bug (non-idempotent Vm::pause() in vmm/src/vm.rs:3239valid_transition rejects Paused → Paused); investigating separately. This PR closes the orphan-leak hole on the cocoon side regardless of what triggers it.

Reproduction (testbed 35.240.182.52)

cocoon vm run --nics 0 --memory 512M ghcr.io/cocoonstack/cocoon/ubuntu:24.04
rm /var/lib/cocoon/run/cloudhypervisor/$VMID/ch.pid
rm /var/lib/cocoon/run/cloudhypervisor/$VMID/api.sock
cocoon vm rm --force $VMID
# Before fix: deleted VM, but `ps | grep cloud-hypervisor` still shows the process
# After fix:  WARN "recovered live pids ... via cmdline scan", process is gone

Test plan

  • make lint passes on both GOOS=linux and GOOS=darwin (0 issues)
  • go test ./utils/ ./hypervisor/ passes; new TestFindVMMByCmdline covers the happy path + binary/marker negatives
  • testbed repro: pidfile-only sabotage → killed (socket-probe path) + pidfile+socket sabotage → killed (cmdline-scan fallback) + normal vm rm path still works
  • deploy to GKE prod, re-scan for orphans after next e2e run

CMGS added 2 commits May 15, 2026 16:05
PR #50's socket-probe tiebreaker only catches orphans whose api.sock is still
listening. If pidfile and api.sock are both pre-removed before the VMM exits
(observed on GKE prod after vk-cocoon rapid restart + CH InvalidStateTransition),
DeleteAll's pidfile-based stop returns ErrNotRunning, the probe returns
ENOENT, and the VMM survives as a PPID=1 orphan with no rundir.

Add utils.FindVMMByCmdline as a /proc scan fallback keyed on the api-socket
path (already unique per VM). Wire it into:
- WithRunningVM: recover the live pid when pidfile/socket are gone
- DeleteAll: second-pass after socket probe to catch sibling/worker pids

Repro: sleep + rm pidfile + rm api.sock + cocoon vm rm --force leaves a CH
orphan. With the fix, the cmdline scan recovers and SIGKILLs it.
- Reorder utils/process_*.go so FindVMMByCmdline sits above verifyProcessCmdline (matches sparse_linux.go / reflink_linux.go).
- Rename the FindVMMByCmdline marker param to expectArg for consistency with VerifyProcessCmdline / TerminateProcess / pidfd_linux.go.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens VM teardown to prevent orphaned VMM processes when runtime artifacts (pidfile and/or api.sock) are removed before the VMM exits, by adding a /proc/<pid>/cmdline scan fallback and using it to recover/kill still-running VMMs.

Changes:

  • Add utils.FindVMMByCmdline to locate candidate VMM PIDs by scanning /proc/*/cmdline (Linux) with a no-op stub on non-Linux.
  • Update Backend.WithRunningVM to recover a live PID via cmdline scan when the pidfile path can’t validate a running process.
  • Update Backend.DeleteAll to do a post-stop /proc sweep and kill any remaining worker/sibling VMM processes.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
utils/process_linux.go Implements /proc cmdline scanning helper to find candidate VMM PIDs on Linux.
utils/process_other.go Adds non-Linux stub for the new cmdline scan helper.
utils/process_test.go Adds a unit test intended to validate the cmdline scan behavior.
hypervisor/state.go Uses cmdline scan fallback in WithRunningVM to recover live PID(s) when pidfile validation fails.
hypervisor/stop.go Adds post-stop cmdline scan sweep in DeleteAll to kill leftover VMM processes before deleting runtime dirs/DB record.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread utils/process_test.go
Comment thread utils/process_linux.go
Comment thread hypervisor/state.go
Comment thread hypervisor/stop.go Outdated
- utils/process_linux.go: slices.Sort the returned pids so callers get a deterministic smallest-pid choice (Copilot caught the /proc lexicographic ordering trap, e.g. "100" < "11").
- hypervisor/state.go: fail-closed when /proc scan errors after pidfile-based check fails; previously returned ErrNotRunning on inconclusive state, which could let start/delete proceed against a still-running VM.
- hypervisor/stop.go: fail-closed in DeleteAll second-pass when /proc scan errors; previously dropped scanErr and risked re-introducing the orphan leak the PR is trying to fix.
- utils/process_test.go: replace flaky "sleep marker 60" (sleep rejects non-numeric arg and exits immediately) with "sh -c 'sleep 60 && :' marker" (compound prevents sh tail-exec into sleep). Gate on runtime.GOOS == "linux".
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Copilot finished work on behalf of CMGS May 15, 2026 08:27
Copilot finished work on behalf of CMGS May 15, 2026 08:28
Copilot finished work on behalf of CMGS May 15, 2026 08:30
…ding

state.go + stop.go: add "(resolve the host issue and retry)" actionable-hint clause so the new scan-error wraps match the existing socket-probe error format.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread utils/process_linux.go
… closed

Copilot round-3 finding: verifyProcessCmdline returned (false, available=false) on permission/IO errors reading /proc/<pid>/cmdline (e.g. hidepid/EPERM), and FindVMMByCmdline silently dropped that signal — a hidepid environment could mask the real VMM and reintroduce the orphan leak.

Refactor verifyProcessCmdline to return (bool, error); FindVMMByCmdline now distinguishes ENOENT (transient race, safe to skip) from any other read error (fail-closed, return wrapped first error). VerifyProcessCmdline wrapper preserves the "fall back to IsProcessAlive on error" semantic.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

@CMGS CMGS merged commit c86938c into master May 15, 2026
9 checks passed
@CMGS CMGS deleted the fix/orphan-vmm-cmdline-fallback branch May 15, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants