Skip to content

fix(loopback): conflict canary also probes the HTTP port to catch lingering workers#20

Merged
kriszyp merged 2 commits into
mainfrom
kris/loopback-http-canary
Jun 20, 2026
Merged

fix(loopback): conflict canary also probes the HTTP port to catch lingering workers#20
kriszyp merged 2 commits into
mainfrom
kris/loopback-http-canary

Conversation

@kriszyp

@kriszyp kriszyp commented Jun 20, 2026

Copy link
Copy Markdown
Member

Problem

Under CI sharding, the harper integration suite (and harper-pro) hit cascading ECONNREFUSED :9926 failures where one test's instance becomes unreachable and the failure spreads to sibling tests in the same shard (observed concretely on harper#1427). Root cause is an SO_REUSEPORT co-bind that the conflict canary doesn't currently catch.

Root cause

The conflict canary (added in #16/#17) probes only the operations port — main-thread, bound without SO_REUSEPORT — so a plain bind fails with EADDRINUSE if a stale node still holds the address. That correctly catches a fully-alive node.

But the most common post-kill / post-teardown shape is different: the main thread has already exited (freeing the operations port) while the HTTP worker threads still linger holding the address via SO_REUSEPORT. Teardown waits only ~5s for ports to free and then recycles the address anyway, so under CI load the slot is handed to the next test while a worker still owns :9926. The ops-port-only canary reports that address as free → the next node co-binds the HTTP port via SO_REUSEPORT → connections split between two nodes → ECONNREFUSED storms.

Fix

Extend the canary to probe both the operations port and the HTTP port. On Linux a plain (non-reusePort) bind still fails with EADDRINUSE against a lingering SO_REUSEPORT socket, so the HTTP-port probe detects exactly the lingering-worker case the ops-port probe blind-spots.

  • isLoopbackAddressInUse → refactored into isPortInUseOnAddress(port, addr) + findConflictingPort(addr) which probes each port in CONFLICT_PROBE_PORTS and returns the first conflicting port (or null).
  • New HTTP_CONFLICT_PROBE_PORT (default 9926, override HARPER_INTEGRATION_TEST_HTTP_CONFLICT_PROBE_PORT); probe set de-duplicated if the two overrides collide.
  • Warning now names the actual conflicting port.

This is the acquire-side safety net for the inherently best-effort, lazy post-kill cleanup: even when teardown recycles an address while a worker lingers, the next allocation detects it and skips.

Validation

Local repro: occupy 127.0.0.2:9926 with a plain (non-reusePort) server (a strict subset of the lingering-worker case), then allocate:

[loopback-pool] 127.0.0.2 is still in use by another Harper node (port 9926 bound); skipping to avoid an SO_REUSEPORT co-bind.
allocated: 127.0.0.3
RESULT: PASS

Before this change the ops-only probe finds :9925 free and hands out 127.0.0.2 → co-bind. tsc clean; not a behavior change for the healthy path (free address still allocated immediately).

Consumption note

main is on the 0.6.x line; harper currently pins ^0.5.1 (resolves to 0.5.x). To pick this up, harper needs a follow-up range bump to ^0.6.x after this releases. (harper-pro tracks ^0.5.2.)

🤖 Generated with Claude Code

…gering workers

The conflict canary only probed the operations port (main-thread, non-reusePort).
That catches a fully-alive node, but misses the most common post-kill/post-teardown
shape: the main thread has exited (freeing the operations port) while the HTTP
worker threads still hold the address via SO_REUSEPORT. The ops-only probe reports
such an address as free and hands it out, producing a silent SO_REUSEPORT co-bind on
the HTTP port — connections split between two nodes, ECONNREFUSED storms, and
cross-test contamination under CI sharding (observed in harper#1427).

Extend the canary to probe BOTH the operations port and the HTTP port. A plain
(non-reusePort) bind fails with EADDRINUSE against a lingering SO_REUSEPORT worker
socket on Linux, so the HTTP-port probe detects exactly the lingering-worker case the
ops-port probe blind-spots. Probe set is de-duplicated and both ports are overridable.

Validated locally: with 127.0.0.2:9926 occupied, allocation now skips to 127.0.0.3
with '[loopback-pool] ... (port 9926 bound); skipping'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a second conflict-canary port (HTTP_CONFLICT_PROBE_PORT) to detect lingering HTTP worker threads that might still hold a loopback address via SO_REUSEPORT after the main thread has exited. It refactors the address validation logic to probe multiple ports (CONFLICT_PROBE_PORTS) and returns the specific conflicting port if found. There are no review comments, and I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@kriszyp kriszyp requested review from Ethan-Arrowood and heskew June 20, 2026 20:21
…acOS caveat)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kriszyp

kriszyp commented Jun 20, 2026

Copy link
Copy Markdown
Member Author

I am going to go ahead and merge and publish this, since this bug is causing a lot of frustration for parallel QA agents that I have running over the weekend.

@kriszyp kriszyp merged commit 5675699 into main Jun 20, 2026
@kriszyp kriszyp deleted the kris/loopback-http-canary branch June 20, 2026 22:16
@github-actions

Copy link
Copy Markdown

🎉 This PR is included in version 0.6.2 🎉

The release is available on:

Your semantic-release bot 📦🚀

Comment on lines +61 to +64
const HTTP_CONFLICT_PROBE_PORT = (() => {
const parsed = parseInt(process.env.HARPER_INTEGRATION_TEST_HTTP_CONFLICT_PROBE_PORT || '', 10);
return Number.isNaN(parsed) || parsed < 0 || parsed > 65535 ? 9926 : parsed;
})();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why an IIFE ?

I guess its just copying CONFLICT_PROBE_PORT, which I have the same question for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants