Skip to content

doublezerod: gate multicast publisher heartbeat on BGP session up#3902

Open
ben-dz wants to merge 1 commit into
mainfrom
bdz/mcast-publisher-readiness
Open

doublezerod: gate multicast publisher heartbeat on BGP session up#3902
ben-dz wants to merge 1 commit into
mainfrom
bdz/mcast-publisher-readiness

Conversation

@ben-dz

@ben-dz ben-dz commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Gate the multicast publisher heartbeat (the packet stream that registers the source at the DZD and builds the (S,G)) on the BGP session reaching Up, instead of starting it inline in MulticastService.Setup. This closes the reprovision race that leaves a wedged (S,G) on the publisher's DZD — present but missing the MSDP notify (N) flag and with no OIF — after a host-side network reload.
  • The gate rides the existing BGP session lifecycle (BGPSessionTimeout = 30s) and the 10s reconcile loop — no new timeout or tunable is introduced.
  • Teardown cancels and awaits the readiness watcher before closing the heartbeat, so a pending start can't race with or follow the close.
  • Incremental group updates (add/remove groups without a disconnect/reconnect) now respect readiness: UpdateGroups records the new group set and updates the running sender in place, or defers to the watcher when the heartbeat hasn't started yet — avoiding a nil-conn panic in the pre-Up window.

Why

The underlying wedge is an EOS/DZD MSDP bug, and DZD/EOS upgrades depend on independent contributors. This is the client-side mitigation we can ship unilaterally: by registering the source only after the tunnel/BGP plumbing is up, we stop creating the overlap that produces the wedge in the first place, rather than relying on operators to notice and reconnect.

Testing Verification

  • TestMulticastService_HeartbeatGatedOnBGPUp — heartbeat stays silent while the session is Pending, then starts once it reaches Up.
  • TestMulticastService_HeartbeatGroupChangeBeforeBGPUp — a group change before BGP Up does not touch the not-yet-started sender (0 UpdateGroups calls) and the eventual start uses the updated group set (regression guard for the no-disconnect group-update path).
  • Reworked TestMulticastService_UpdateGroups_AddPubGroup to bring BGP up first, exercising the running-sender in-place update path.
  • go test -race green across internal/services and internal/manager; golangci-lint clean.

@ben-dz ben-dz force-pushed the bdz/mcast-publisher-readiness branch from 5443ae4 to 44d33bf Compare June 15, 2026 21:39
@ben-dz ben-dz requested a review from bgm-malbeclabs June 15, 2026 23:44
The publisher heartbeat is what registers the multicast source at the DZD and builds the (S,G). It was started inline in MulticastService.Setup before the BGP session was established, so on a reprovision after a host network reload the new source register could overlap the DZD tearing down the prior PIM-register state — the window that leaves a wedged (S,G) (no N/notify-MSDP flag, no OIF) and stalls delivery until a manual publisher reconnect.

Defer the heartbeat behind a readiness watcher that starts it only once the BGP session reports Up, riding the existing 30s session lifecycle and 10s reconcile (no new timeout). Teardown cancels and awaits the watcher before closing the heartbeat.

UpdateGroups (incremental add/remove without a disconnect/reconnect) now respects readiness: it records the new group set and updates the running sender in place, or defers to the watcher when the heartbeat has not started yet, avoiding a nil-conn panic in that window.
@ben-dz ben-dz force-pushed the bdz/mcast-publisher-readiness branch from 44d33bf to 0eeae11 Compare June 16, 2026 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant