From db38331163bc85ca23e5ba77d7e59d9e55ab8867 Mon Sep 17 00:00:00 2001
From: bdchatham <bdchatham@gmail.com>
Date: Sat, 20 Jun 2026 10:26:17 -0700
Subject: [PATCH] fix(monitoring): scope ControllerHighReconcileLatency to
 sei-k8s-controller's own controllers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The #417 SeiNetwork clean-break deleted SeiNodeDeployment, retiring the
sei_controller_seinodedeployment_reconcile_substep_duration_seconds metric the
alert was scoped to. The rewrite to the generic controller_runtime_reconcile_time_seconds
bucket dropped the implicit scope, so the alert now matches EVERY controller-runtime
controller in the cluster — Karpenter, Flux, cert-manager, aws-lbc — not just ours.

That fires falsely on Karpenter's `interruption` controller, whose reconcile blocks on a
~20s SQS ReceiveMessage long-poll (WaitTimeSeconds=20) against a permanently-empty queue
on on-demand nodes. p50≈mean≈p99≈20s (the whole distribution is the idle long-poll, not a
tail), 100% result=requeue_after, workqueue_depth=0 — benign by design, fleet-wide. The
generic 10s p99 threshold is meaningless for a long-poller, and `kustomization` (Flux) also
trips it intermittently.

Restore the original intent: only alert on sei-k8s-controller's own reconcile latency
(job="sei-k8s-controller"). Karpenter and Flux own their own alerting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 config/monitoring/prometheus-rule.yaml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/config/monitoring/prometheus-rule.yaml b/config/monitoring/prometheus-rule.yaml
index edcf2de0..c892b7ca 100644
--- a/config/monitoring/prometheus-rule.yaml
+++ b/config/monitoring/prometheus-rule.yaml
@@ -96,7 +96,7 @@ spec:
           expr: |
             histogram_quantile(0.99,
               sum by (controller, le) (
-                rate(controller_runtime_reconcile_time_seconds_bucket[5m])
+                rate(controller_runtime_reconcile_time_seconds_bucket{job="sei-k8s-controller"}[5m])
               )
             ) > 10
           for: 10m