doublezerod: add periodic kernel route reconciliation#3672
Conversation
f31a780 to
bd203a8
Compare
Add a reconciliation loop to the liveness manager that periodically scans the kernel routing table for missing BGP routes and reinstalls them, mitigating connectivity loss caused by external processes removing routes. Also promote liveness session down logs from DEBUG to INFO for passive/peer-passive modes so operators can see the full up/down lifecycle.
Increment RouteInstallFailures counter when a reconciliation reinstall fails, matching the observability pattern in onSessionUp. Also pre-allocate the toCheck slice.
- Re-check installed state under lock before RouteAdd to prevent resurrecting routes intentionally withdrawn by onSessionDown - Add SrcIP to kernel route lookup key for tighter matching in multi-interface setups - Reject negative RouteReconcileInterval in Validate() - Use named const for reconcile interval flag default - Log when route reconciliation is enabled at startup
bd203a8 to
99d373a
Compare
Route Reconciliation Performance Analysis
CPU cost per reconciliation cycle
Estimation methodologyLock hold (step 1): Map iteration over Netlink dump (step 2): Map build + diff (step 3): For each kernel route, we call Amortized CPU: Lock contention with HandleRxThe lock is not held during the expensive netlink syscall (step 2). The snapshot in step 1 holds Practical impact on doublezerod CPU usageGiven a ~3% baseline CPU on a modern x86 core, this change adds effectively zero overhead at realistic route counts (low hundreds). The 1M route case is pathological for a doublezerod client and would have other scaling bottlenecks (BGP convergence, session state memory, netlink install throughput) long before reconciliation matters. |
There was a problem hiding this comment.
- The
0 disableskill switch doesn't work —Validate()rewrites 0 to the 30s default, so there's no way to turn this new periodic dataplane writer off during staged rollout. - Excluded routes churn forever — on any host using a route exclude list, the reinstall counter and a Warn log fire every tick, permanently, defeating the metric's purpose.
Details inline.
PR-description corrections (not line-anchored):
- The description claims a unit test for "incrementing the install failure metric on RouteAdd error". No such test exists — the PR adds three (ReinstallsMissing, SkipsPresent, SkipsUninstalled). Either add it (a few lines with the existing mock) or drop the claim.
- It claims to add
doublezero_liveness_route_install_failures_total. That metric already existed; this PR only adds a new increment site (the new metric isdoublezero_liveness_route_reinstalls_total). - "Prevent TOCTOU race" overstates it — see the inline comment; the race is narrowed, not closed.
| if c.RouteReconcileInterval < 0 { | ||
| return errors.New("routeReconcileInterval must be non-negative") | ||
| } | ||
| if c.RouteReconcileInterval == 0 { | ||
| c.RouteReconcileInterval = defaultRouteReconcileInterval | ||
| } |
There was a problem hiding this comment.
Should fix — the documented 0 disables kill switch doesn't work. The flag help in main.go, the ManagerConfig comment, and the if cfg.RouteReconcileInterval > 0 guard in NewManager all promise that 0 disables reconciliation. But this block rewrites 0 → 30s before that guard runs, and negative values are rejected just above. So --route-liveness-reconcile-interval=0 actually enables reconciliation at 30s — there is no off switch.
This matters: the surrounding liveness flags exist specifically for staged rollout, and this adds a brand-new periodic dataplane writer. Fix: let 0 pass through Validate() unchanged — the operational default already comes from the flag default in main.go. That also lets you drop the duplicate default constant (defaultRouteLivenessReconcileInterval in main.go vs defaultRouteReconcileInterval here) and makes the > 0 guard and the "route reconciliation enabled" log meaningful again.
| kernelSet[kernelKey{Table: kr.Table, DstIP: dstIP, NextHop: nhIP, SrcIP: srcIP}] = struct{}{} | ||
| } | ||
|
|
||
| for _, ir := range toCheck { |
There was a problem hiding this comment.
Should fix — excluded routes churn forever. When a route config has exclude entries, the manager's Netlinker is ConfiguredRouteReaderWriter, whose RouteAdd returns nil without touching the kernel for excluded destinations (routing/config.go). But RegisterRoute (passive) and onSessionUp (active) still set installed[rk] = true.
Result: every tick, for every excluded route, this loop finds it "missing", logs reinstalling missing route at Warn, calls RouteAdd (a silent no-op that itself logs at Info), and increments doublezero_liveness_route_reinstalls_total. The counter is meant to signal external deletion; on any host using excludes it becomes a steadily climbing false positive plus two log lines per route per tick.
Fix: skip destinations in m.cr.GetExcluded() here — the manager already holds m.cr and NewManager already iterates it for AdminDown.
| stillInstalled := m.installed[ir.rk] | ||
| m.mu.Unlock() | ||
| if !stillInstalled { | ||
| continue | ||
| } | ||
| m.log.Warn("liveness: reinstalling missing route", | ||
| "route", ir.route.String(), | ||
| "iface", ir.rk.Interface, | ||
| ) | ||
| if err := m.cfg.Netlinker.RouteAdd(&ir.route.Route); err != nil { |
There was a problem hiding this comment.
Fix or reword — the TOCTOU guard narrows the race but doesn't close it. The lock is dropped between the stillInstalled re-check and RouteAdd. onSessionDown can set installed[rk] = false and delete the kernel route in that window; reconcile then resurrects a route the manager believes is gone. The next tick won't correct it (no longer in toCheck), and a WithdrawRoute while the session is still down may skip the kernel delete, so the stale route can outlive the withdraw. It self-heals only on the next session-up.
Holding m.mu across the re-check + RouteAdd closes it, because onSessionDown flips the flag under the lock before issuing RouteDelete: either reconcile sees false and skips, or its add completes before the delete lands and the end state is consistent. Cost is one netlink call under the lock per genuinely-missing route, which is rare by definition. If you'd rather not hold the lock across a syscall, at minimum soften the PR's "Prevent TOCTOU race" claim — it's narrowed, not prevented.
| return | ||
| } | ||
|
|
||
| kernelRoutes, err := m.cfg.Netlinker.RouteByProtocol(unix.RTPROT_BGP) |
There was a problem hiding this comment.
**Latent foot-gun ** routing.Netlink.RouteByProtocol calls RouteListFiltered(FAMILY_V4, {Protocol: BGP}, RT_FILTER_PROTOCOL). In vishvananda/netlink v1.3.1, routes outside RT_TABLE_MAIN are skipped unless RT_FILTER_TABLE is also set (route_linux.go:1266). Safe today because liveness only attaches in IBRL mode (main table), but:
- The new tests use routes in table 100 (
newTestRoutedefault). Against the real netlink backend those routes would never be returned, soSkipsPresentpasses only because the mock doesn't reproduce kernel filtering — worth knowing the test is partly fictional. - If liveness is ever enabled for edge filtering (tables 100/101), every installed route would be declared missing every tick: harmless dataplane-wise (
RouteAddisRouteReplace) but permanent Warn spam + a useless counter.
Suggest a comment documenting the main-table assumption, or listing per distinct desired table with RT_FILTER_TABLE.
| } | ||
|
|
||
| for _, ir := range toCheck { | ||
| kk := kernelKey{Table: ir.route.Table, DstIP: ir.rk.DstPrefix, NextHop: ir.rk.NextHop, SrcIP: ir.rk.SrcIP} |
There was a problem hiding this comment.
Nit — prefix length is ignored in the kernel match. DstIP uses only the destination IP (mirroring RouteKey.DstPrefix, which despite the name carries no mask). So a kernel route 10.0.0.0/16 would satisfy a desired 10.0.0.0/24, and a deleted /24 wouldn't be reinstalled while a same-IP route with a different mask exists. Pre-existing map-key semantics, but here both sides have the full *net.IPNet (kr.Dst and ir.route.Dst), so the kernel-set key could use Dst.String() for an exact match at no cost.
Resolves: #3669
Summary of Changes
--route-liveness-reconcile-interval), detects BGP routes that should already be installed but are missing, and reinstalls themdoublezero_liveness_route_reinstalls_totalanddoublezero_liveness_route_install_failures_totalPrometheus metrics to track reinstalls and failuresinstalledstate under lock before each reinstall soreconcileRoutescannot resurrect a route thatonSessionDownintentionally withdrew between snapshot and reinstall(table, dst, nexthop)but different source IPs are matched independently in multi-interface setupsDiff Breakdown
Bulk of the change is the reconciliation loop and its tests.
Key files (click to expand)
client/doublezerod/internal/liveness/manager.go—reconcileRoutes()implementation with TOCTOU guard and src-aware kernel key, config field + validation, goroutine launch, startup log, Debug→Info log level changeclient/doublezerod/internal/liveness/manager_test.go— unit tests for route reconciliation (missing route reinstall, present route skip, uninstalled route skip, install failure metric)client/doublezerod/internal/liveness/metrics.go—RouteReinstallscounter androuteReinstallhelperclient/doublezerod/cmd/doublezerod/main.go—--route-liveness-reconcile-intervalflag wiring with named default constTesting Verification
RouteAdderrorNetlinkerto simulate kernel route state; reconciliation ticker set totime.Hourin tests to prevent background interference while callingreconcileRoutes()directlygo vetandgo buildclean