fix(loopback): conflict canary also probes the HTTP port to catch lingering workers#20
Conversation
…gering workers The conflict canary only probed the operations port (main-thread, non-reusePort). That catches a fully-alive node, but misses the most common post-kill/post-teardown shape: the main thread has exited (freeing the operations port) while the HTTP worker threads still hold the address via SO_REUSEPORT. The ops-only probe reports such an address as free and hands it out, producing a silent SO_REUSEPORT co-bind on the HTTP port — connections split between two nodes, ECONNREFUSED storms, and cross-test contamination under CI sharding (observed in harper#1427). Extend the canary to probe BOTH the operations port and the HTTP port. A plain (non-reusePort) bind fails with EADDRINUSE against a lingering SO_REUSEPORT worker socket on Linux, so the HTTP-port probe detects exactly the lingering-worker case the ops-port probe blind-spots. Probe set is de-duplicated and both ports are overridable. Validated locally: with 127.0.0.2:9926 occupied, allocation now skips to 127.0.0.3 with '[loopback-pool] ... (port 9926 bound); skipping'. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a second conflict-canary port (HTTP_CONFLICT_PROBE_PORT) to detect lingering HTTP worker threads that might still hold a loopback address via SO_REUSEPORT after the main thread has exited. It refactors the address validation logic to probe multiple ports (CONFLICT_PROBE_PORTS) and returns the specific conflicting port if found. There are no review comments, and I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…acOS caveat) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
I am going to go ahead and merge and publish this, since this bug is causing a lot of frustration for parallel QA agents that I have running over the weekend. |
|
🎉 This PR is included in version 0.6.2 🎉 The release is available on: Your semantic-release bot 📦🚀 |
| const HTTP_CONFLICT_PROBE_PORT = (() => { | ||
| const parsed = parseInt(process.env.HARPER_INTEGRATION_TEST_HTTP_CONFLICT_PROBE_PORT || '', 10); | ||
| return Number.isNaN(parsed) || parsed < 0 || parsed > 65535 ? 9926 : parsed; | ||
| })(); |
There was a problem hiding this comment.
Why an IIFE ?
I guess its just copying CONFLICT_PROBE_PORT, which I have the same question for.
Problem
Under CI sharding, the harper integration suite (and harper-pro) hit cascading
ECONNREFUSED :9926failures where one test's instance becomes unreachable and the failure spreads to sibling tests in the same shard (observed concretely on harper#1427). Root cause is an SO_REUSEPORT co-bind that the conflict canary doesn't currently catch.Root cause
The conflict canary (added in #16/#17) probes only the operations port — main-thread, bound without SO_REUSEPORT — so a plain bind fails with
EADDRINUSEif a stale node still holds the address. That correctly catches a fully-alive node.But the most common post-kill / post-teardown shape is different: the main thread has already exited (freeing the operations port) while the HTTP worker threads still linger holding the address via
SO_REUSEPORT. Teardown waits only ~5s for ports to free and then recycles the address anyway, so under CI load the slot is handed to the next test while a worker still owns:9926. The ops-port-only canary reports that address as free → the next node co-binds the HTTP port via SO_REUSEPORT → connections split between two nodes →ECONNREFUSEDstorms.Fix
Extend the canary to probe both the operations port and the HTTP port. On Linux a plain (non-reusePort) bind still fails with
EADDRINUSEagainst a lingering SO_REUSEPORT socket, so the HTTP-port probe detects exactly the lingering-worker case the ops-port probe blind-spots.isLoopbackAddressInUse→ refactored intoisPortInUseOnAddress(port, addr)+findConflictingPort(addr)which probes each port inCONFLICT_PROBE_PORTSand returns the first conflicting port (ornull).HTTP_CONFLICT_PROBE_PORT(default9926, overrideHARPER_INTEGRATION_TEST_HTTP_CONFLICT_PROBE_PORT); probe set de-duplicated if the two overrides collide.This is the acquire-side safety net for the inherently best-effort, lazy post-kill cleanup: even when teardown recycles an address while a worker lingers, the next allocation detects it and skips.
Validation
Local repro: occupy
127.0.0.2:9926with a plain (non-reusePort) server (a strict subset of the lingering-worker case), then allocate:Before this change the ops-only probe finds
:9925free and hands out127.0.0.2→ co-bind.tscclean; not a behavior change for the healthy path (free address still allocated immediately).Consumption note
mainis on the 0.6.x line; harper currently pins^0.5.1(resolves to 0.5.x). To pick this up, harper needs a follow-up range bump to^0.6.xafter this releases. (harper-pro tracks^0.5.2.)🤖 Generated with Claude Code