From db38331163bc85ca23e5ba77d7e59d9e55ab8867 Mon Sep 17 00:00:00 2001 From: bdchatham Date: Sat, 20 Jun 2026 10:26:17 -0700 Subject: [PATCH] fix(monitoring): scope ControllerHighReconcileLatency to sei-k8s-controller's own controllers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The #417 SeiNetwork clean-break deleted SeiNodeDeployment, retiring the sei_controller_seinodedeployment_reconcile_substep_duration_seconds metric the alert was scoped to. The rewrite to the generic controller_runtime_reconcile_time_seconds bucket dropped the implicit scope, so the alert now matches EVERY controller-runtime controller in the cluster — Karpenter, Flux, cert-manager, aws-lbc — not just ours. That fires falsely on Karpenter's `interruption` controller, whose reconcile blocks on a ~20s SQS ReceiveMessage long-poll (WaitTimeSeconds=20) against a permanently-empty queue on on-demand nodes. p50≈mean≈p99≈20s (the whole distribution is the idle long-poll, not a tail), 100% result=requeue_after, workqueue_depth=0 — benign by design, fleet-wide. The generic 10s p99 threshold is meaningless for a long-poller, and `kustomization` (Flux) also trips it intermittently. Restore the original intent: only alert on sei-k8s-controller's own reconcile latency (job="sei-k8s-controller"). Karpenter and Flux own their own alerting. Co-Authored-By: Claude Opus 4.8 --- config/monitoring/prometheus-rule.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/config/monitoring/prometheus-rule.yaml b/config/monitoring/prometheus-rule.yaml index edcf2de0..c892b7ca 100644 --- a/config/monitoring/prometheus-rule.yaml +++ b/config/monitoring/prometheus-rule.yaml @@ -96,7 +96,7 @@ spec: expr: | histogram_quantile(0.99, sum by (controller, le) ( - rate(controller_runtime_reconcile_time_seconds_bucket[5m]) + rate(controller_runtime_reconcile_time_seconds_bucket{job="sei-k8s-controller"}[5m]) ) ) > 10 for: 10m