doublezerod: gate multicast publisher heartbeat on BGP session up#3902
Open
ben-dz wants to merge 1 commit into
Open
doublezerod: gate multicast publisher heartbeat on BGP session up#3902ben-dz wants to merge 1 commit into
ben-dz wants to merge 1 commit into
Conversation
5443ae4 to
44d33bf
Compare
The publisher heartbeat is what registers the multicast source at the DZD and builds the (S,G). It was started inline in MulticastService.Setup before the BGP session was established, so on a reprovision after a host network reload the new source register could overlap the DZD tearing down the prior PIM-register state — the window that leaves a wedged (S,G) (no N/notify-MSDP flag, no OIF) and stalls delivery until a manual publisher reconnect. Defer the heartbeat behind a readiness watcher that starts it only once the BGP session reports Up, riding the existing 30s session lifecycle and 10s reconcile (no new timeout). Teardown cancels and awaits the watcher before closing the heartbeat. UpdateGroups (incremental add/remove without a disconnect/reconnect) now respects readiness: it records the new group set and updates the running sender in place, or defers to the watcher when the heartbeat has not started yet, avoiding a nil-conn panic in that window.
44d33bf to
0eeae11
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
(S,G)) on the BGP session reachingUp, instead of starting it inline inMulticastService.Setup. This closes the reprovision race that leaves a wedged(S,G)on the publisher's DZD — present but missing the MSDP notify (N) flag and with no OIF — after a host-side network reload.BGPSessionTimeout = 30s) and the 10s reconcile loop — no new timeout or tunable is introduced.Teardowncancels and awaits the readiness watcher before closing the heartbeat, so a pending start can't race with or follow the close.UpdateGroupsrecords the new group set and updates the running sender in place, or defers to the watcher when the heartbeat hasn't started yet — avoiding a nil-conn panic in the pre-Upwindow.Why
The underlying wedge is an EOS/DZD MSDP bug, and DZD/EOS upgrades depend on independent contributors. This is the client-side mitigation we can ship unilaterally: by registering the source only after the tunnel/BGP plumbing is up, we stop creating the overlap that produces the wedge in the first place, rather than relying on operators to notice and reconnect.
Testing Verification
TestMulticastService_HeartbeatGatedOnBGPUp— heartbeat stays silent while the session isPending, then starts once it reachesUp.TestMulticastService_HeartbeatGroupChangeBeforeBGPUp— a group change before BGPUpdoes not touch the not-yet-started sender (0UpdateGroupscalls) and the eventual start uses the updated group set (regression guard for the no-disconnect group-update path).TestMulticastService_UpdateGroups_AddPubGroupto bring BGP up first, exercising the running-sender in-place update path.go test -racegreen acrossinternal/servicesandinternal/manager;golangci-lintclean.