Skip to content

doublezerod: add periodic kernel route reconciliation#3672

Open
nikw9944 wants to merge 4 commits into
mainfrom
nikw9944/doublezero-3669
Open

doublezerod: add periodic kernel route reconciliation#3672
nikw9944 wants to merge 4 commits into
mainfrom
nikw9944/doublezero-3669

Conversation

@nikw9944

@nikw9944 nikw9944 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Resolves: #3669

Summary of Changes

  • Add a periodic route reconciliation goroutine to the liveness manager that scans the kernel routing table every 30s (configurable via --route-liveness-reconcile-interval), detects BGP routes that should already be installed but are missing, and reinstalls them
  • This mitigates the case where another process or an administrator mistakenly deletes a doublezero route from the kernel routing table
  • Add doublezero_liveness_route_reinstalls_total and doublezero_liveness_route_install_failures_total Prometheus metrics to track reinstalls and failures
  • Prevent TOCTOU race in active mode: re-check installed state under lock before each reinstall so reconcileRoutes cannot resurrect a route that onSessionDown intentionally withdrew between snapshot and reinstall
  • Include source IP in the kernel route lookup key so routes with the same (table, dst, nexthop) but different source IPs are matched independently in multi-interface setups
  • Promote "session down (passive; keeping route)" log messages from Debug to Info, otherwise it's possible to log multiple 'liveness: session up' messages in a row

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 1 +116 / -2 +114
Scaffolding 2 +30 / -13 +17
Tests 1 +124 / -0 +124
Docs 1 +2 / -0 +2
Total 5 +272 / -15 +257

Bulk of the change is the reconciliation loop and its tests.

Key files (click to expand)

Testing Verification

  • Unit tests cover: reinstalling a missing route, skipping a route present in kernel, skipping an uninstalled route (active mode, session never went Up), and incrementing the install failure metric on RouteAdd error
  • Tests use mock Netlinker to simulate kernel route state; reconciliation ticker set to time.Hour in tests to prevent background interference while calling reconcileRoutes() directly
  • Full liveness test suite passes (40s, all existing tests unaffected)
  • go vet and go build clean

@nikw9944 nikw9944 marked this pull request as ready for review May 5, 2026 21:07
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3669 branch from f31a780 to bd203a8 Compare May 5, 2026 21:37
nikw9944 added 4 commits May 5, 2026 21:49
Add a reconciliation loop to the liveness manager that periodically
scans the kernel routing table for missing BGP routes and reinstalls
them, mitigating connectivity loss caused by external processes
removing routes.

Also promote liveness session down logs from DEBUG to INFO for
passive/peer-passive modes so operators can see the full up/down
lifecycle.
Increment RouteInstallFailures counter when a reconciliation reinstall
fails, matching the observability pattern in onSessionUp. Also
pre-allocate the toCheck slice.
- Re-check installed state under lock before RouteAdd to prevent
  resurrecting routes intentionally withdrawn by onSessionDown
- Add SrcIP to kernel route lookup key for tighter matching in
  multi-interface setups
- Reject negative RouteReconcileInterval in Validate()
- Use named const for reconcile interval flag default
- Log when route reconciliation is enabled at startup
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3669 branch from bd203a8 to 99d373a Compare May 5, 2026 21:50
@nikw9944

nikw9944 commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Route Reconciliation Performance Analysis

reconcileRoutes() runs every 30s with 4 phases:

  1. Lock → snapshot → unlock — iterates installed ∩ desired into a local slice
  2. Netlink dump (no lock held)RouteListFiltered(FAMILY_V4, RTPROT_BGP) dumps all kernel BGP routes
  3. Build hash set + diff — builds map[kernelKey]struct{} from kernel routes, checks each installed route against it
  4. Reinstall missing — brief re-lock per missing route, then RouteAdd

CPU cost per reconciliation cycle

Routes Lock hold (step 1) Netlink dump (step 2) Map build + diff (step 3) Total per cycle Amortized over 30s
100 ~1 μs ~0.5 ms ~50 μs ~0.6 ms 0.002%
1,000 ~5 μs ~3 ms ~0.5 ms ~4 ms 0.013%
1,000,000 ~5 ms ~1-2 s ~300-500 ms ~2-3 s ~7-10%

Estimation methodology

Lock hold (step 1): Map iteration over installed checking membership in desired, appending matches to a slice. Go map iteration is ~10ns/entry. 100 entries × 10ns = 1μs; 1000 × 10ns = 10μs (rounded to ~5μs accounting for the append being fast). At 1M entries, iteration + slice growth + memory allocation dominates: ~5ms.

Netlink dump (step 2): RouteListFiltered issues a single RTM_GETROUTE netlink dump; the kernel filters by protocol server-side and streams matching routes back. Each route message is ~100-200 bytes of netlink payload. At 100 routes that's ~10-20KB parsed by vishvananda/netlink — well under 1ms. At 1,000 routes, ~100-200KB, ~3ms. At 1M routes, ~100-200MB of netlink data to receive and deserialize — estimated 1-2s based on netlink socket throughput (~100-200MB/s for dump operations).

Map build + diff (step 3): For each kernel route, we call IP.To4().String() on 3 fields (dst, nexthop, src) creating heap-allocated strings, then insert into map[kernelKey]struct{}. String conversion + map insert is ~200-500ns/route. At 100 routes: ~50μs. At 1,000: ~500μs. At 1M: ~300-500ms (dominated by allocations and GC pressure from ~3M short-lived strings). The diff itself is O(installed) map lookups at O(1) each — negligible relative to the build.

Amortized CPU: total_per_cycle / 30s. E.g., at 1,000 routes: 4ms / 30,000ms = 0.013%.

Lock contention with HandleRx

The lock is not held during the expensive netlink syscall (step 2). The snapshot in step 1 holds m.mu only during map iteration. HandleRx (the hot path) also holds m.mu for its duration, so these can contend — but at 100-1000 routes, a 1-5μs delay is negligible compared to the 50ms-3s liveness TX intervals. At 1M routes the ~5ms lock hold becomes measurable but is still small relative to liveness timers.

Practical impact on doublezerod CPU usage

Given a ~3% baseline CPU on a modern x86 core, this change adds effectively zero overhead at realistic route counts (low hundreds). The 1M route case is pathological for a doublezerod client and would have other scaling bottlenecks (BGP convergence, session state memory, netlink install throughput) long before reconciliation matters.

@ben-dz ben-dz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The 0 disables kill switch doesn't work — Validate() rewrites 0 to the 30s default, so there's no way to turn this new periodic dataplane writer off during staged rollout.
  2. Excluded routes churn forever — on any host using a route exclude list, the reinstall counter and a Warn log fire every tick, permanently, defeating the metric's purpose.

Details inline.

PR-description corrections (not line-anchored):

  • The description claims a unit test for "incrementing the install failure metric on RouteAdd error". No such test exists — the PR adds three (ReinstallsMissing, SkipsPresent, SkipsUninstalled). Either add it (a few lines with the existing mock) or drop the claim.
  • It claims to add doublezero_liveness_route_install_failures_total. That metric already existed; this PR only adds a new increment site (the new metric is doublezero_liveness_route_reinstalls_total).
  • "Prevent TOCTOU race" overstates it — see the inline comment; the race is narrowed, not closed.

Comment on lines +162 to +167
if c.RouteReconcileInterval < 0 {
return errors.New("routeReconcileInterval must be non-negative")
}
if c.RouteReconcileInterval == 0 {
c.RouteReconcileInterval = defaultRouteReconcileInterval
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should fix — the documented 0 disables kill switch doesn't work. The flag help in main.go, the ManagerConfig comment, and the if cfg.RouteReconcileInterval > 0 guard in NewManager all promise that 0 disables reconciliation. But this block rewrites 0 → 30s before that guard runs, and negative values are rejected just above. So --route-liveness-reconcile-interval=0 actually enables reconciliation at 30s — there is no off switch.

This matters: the surrounding liveness flags exist specifically for staged rollout, and this adds a brand-new periodic dataplane writer. Fix: let 0 pass through Validate() unchanged — the operational default already comes from the flag default in main.go. That also lets you drop the duplicate default constant (defaultRouteLivenessReconcileInterval in main.go vs defaultRouteReconcileInterval here) and makes the > 0 guard and the "route reconciliation enabled" log meaningful again.

kernelSet[kernelKey{Table: kr.Table, DstIP: dstIP, NextHop: nhIP, SrcIP: srcIP}] = struct{}{}
}

for _, ir := range toCheck {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should fix — excluded routes churn forever. When a route config has exclude entries, the manager's Netlinker is ConfiguredRouteReaderWriter, whose RouteAdd returns nil without touching the kernel for excluded destinations (routing/config.go). But RegisterRoute (passive) and onSessionUp (active) still set installed[rk] = true.

Result: every tick, for every excluded route, this loop finds it "missing", logs reinstalling missing route at Warn, calls RouteAdd (a silent no-op that itself logs at Info), and increments doublezero_liveness_route_reinstalls_total. The counter is meant to signal external deletion; on any host using excludes it becomes a steadily climbing false positive plus two log lines per route per tick.

Fix: skip destinations in m.cr.GetExcluded() here — the manager already holds m.cr and NewManager already iterates it for AdminDown.

Comment on lines +973 to +982
stillInstalled := m.installed[ir.rk]
m.mu.Unlock()
if !stillInstalled {
continue
}
m.log.Warn("liveness: reinstalling missing route",
"route", ir.route.String(),
"iface", ir.rk.Interface,
)
if err := m.cfg.Netlinker.RouteAdd(&ir.route.Route); err != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix or reword — the TOCTOU guard narrows the race but doesn't close it. The lock is dropped between the stillInstalled re-check and RouteAdd. onSessionDown can set installed[rk] = false and delete the kernel route in that window; reconcile then resurrects a route the manager believes is gone. The next tick won't correct it (no longer in toCheck), and a WithdrawRoute while the session is still down may skip the kernel delete, so the stale route can outlive the withdraw. It self-heals only on the next session-up.

Holding m.mu across the re-check + RouteAdd closes it, because onSessionDown flips the flag under the lock before issuing RouteDelete: either reconcile sees false and skips, or its add completes before the delete lands and the end state is consistent. Cost is one netlink call under the lock per genuinely-missing route, which is rare by definition. If you'd rather not hold the lock across a syscall, at minimum soften the PR's "Prevent TOCTOU race" claim — it's narrowed, not prevented.

return
}

kernelRoutes, err := m.cfg.Netlinker.RouteByProtocol(unix.RTPROT_BGP)

@ben-dz ben-dz Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

**Latent foot-gun ** routing.Netlink.RouteByProtocol calls RouteListFiltered(FAMILY_V4, {Protocol: BGP}, RT_FILTER_PROTOCOL). In vishvananda/netlink v1.3.1, routes outside RT_TABLE_MAIN are skipped unless RT_FILTER_TABLE is also set (route_linux.go:1266). Safe today because liveness only attaches in IBRL mode (main table), but:

  • The new tests use routes in table 100 (newTestRoute default). Against the real netlink backend those routes would never be returned, so SkipsPresent passes only because the mock doesn't reproduce kernel filtering — worth knowing the test is partly fictional.
  • If liveness is ever enabled for edge filtering (tables 100/101), every installed route would be declared missing every tick: harmless dataplane-wise (RouteAdd is RouteReplace) but permanent Warn spam + a useless counter.

Suggest a comment documenting the main-table assumption, or listing per distinct desired table with RT_FILTER_TABLE.

}

for _, ir := range toCheck {
kk := kernelKey{Table: ir.route.Table, DstIP: ir.rk.DstPrefix, NextHop: ir.rk.NextHop, SrcIP: ir.rk.SrcIP}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit — prefix length is ignored in the kernel match. DstIP uses only the destination IP (mirroring RouteKey.DstPrefix, which despite the name carries no mask). So a kernel route 10.0.0.0/16 would satisfy a desired 10.0.0.0/24, and a deleted /24 wouldn't be reinstalled while a same-IP route with a different mask exists. Pre-existing map-key semantics, but here both sides have the full *net.IPNet (kr.Dst and ir.route.Dst), so the kernel-set key could use Dst.String() for an exact match at no cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Route installed by doublezerod removed by unknown process

2 participants