feat(ops): internal observability dashboard at /ops#2075
Merged
Conversation
Adds the persistence + read path for an internal observability dashboard:
two append-only tables (request_logs, outbound_call_logs), an
ObservabilitySink that buffers and batch-flushes (5s interval, 100-row
threshold) so writes never block requests, and the layered read pipeline
RouteOpsRouter -> OpsController -> GetOpsMetricsUseCase ->
ObservabilityQueryService -> ObservabilityRepository returning bucketed
JSON for /ops.
The inbound `requestLoggingMiddleware` is mounted globally before
routers, captures method/route-template/status/duration on
res.on('finish'), and skips /ops/* itself.
`/ops/api/metrics` is gated by RequireOpsAccess (404 for non-Al, not 403
- we do not reveal the dashboard's existence). The matching feature
flag `features.ops` is exposed via /api/users/debug/locals so the
frontend can render the navbar entry only for Al.
Spec + design notes live in Documentation/ops-observability/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Routes the six existing axios call sites that target our allowlisted
external services through `instrumentedAxios` so each call records its
service label, host+pathname (query stripped), status code, and
duration into outbound_call_logs:
- NotionService.getAccessData -> notion
- NotionService.helpers.downloadMediaOrSkip -> notion
- NotionService.helpers.renderIcon -> notion
- AuthenticationService.loginWithGoogle -> google_drive
- handleDropbox upload download -> dropbox
- handleGoogleDrive upload download -> google_drive
Skipped on purpose:
- BlockBookmark.useMetadata fetches arbitrary user-supplied URLs and
doesn't fit the closed allowlist; instrumenting it would require
expanding the allowlist with no useful service label. Leaving it
un-instrumented matches the spec ("if a caller doesn't fit, skip
it").
- Patreon has no live HTTP caller in the codebase today; the label is
reserved for the future Patreon webhook ingest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renders the four charts spec'd in Documentation/ops-observability/DESIGN.md from the aggregated /ops/api/metrics payload: inbound volume stacked by status class, top-15 routes by avg/p95 latency, outbound calls per service over time, and side-by-side error-rate bars for routes and services. - Recharts is added as a dependency and only ships in the lazy-loaded /ops chunk - no impact on the upload/Ankify hot paths. - Window is a URL query param (?window=1h|24h|7d, default 24h), reload-safe. - React Query handles the 30-second background refetch; refetch is paused while the tab is hidden via refetchIntervalInBackground: false. - The visible chart data falls back to the last successful snapshot on fetch error so a transient 5xx doesn't blank out the dashboard - it just renders the alert banner above the panels. - All numeric formatting (% color thresholds, status grouping, bucket labels) is covered by unit tests in opsHelpers.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an "Ops" entry to the desktop and mobile navbar that's gated on `features.ops` from /api/users/debug/locals - hidden for everyone except the ops owner, who is also the only user the backend will serve /ops/api/metrics for. A small uppercase "admin" tag sits next to the label so it reads as an internal tool, not a regular feature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| verb: 'get' | 'delete', | ||
| url: string, | ||
| config: AxiosRequestConfig | undefined | ||
| ): Promise<AxiosResponse<T>> => axios[verb]<T>(url, config); |
| verb: 'get' | 'delete', | ||
| url: string, | ||
| config: AxiosRequestConfig | undefined | ||
| ): Promise<AxiosResponse<T>> => axios[verb]<T>(url, config); |
URL.pathname is already normalized by the URL parser, so the `/\/+$/` regex was doing nothing useful for endpoint labels. Removing it also clears Sonar S5852 (slow-regex hotspot) — the regex is linear, but reviewing it adds noise we don't need. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dashboard was fetching `/ops/api/metrics`, but vite's dev proxy only forwards `/api/*` to the backend, so the SPA's index.html came back as `<!DOCTYPE...` and the page rendered an empty state. The production server's DefaultRouter catch-all `^/(?!api).*` would have hit the same issue once deployed. Moving the JSON endpoint to `/api/ops/metrics` matches the codebase convention (`/api/*` for JSON, everything else for SPA), so vite proxies it automatically and DefaultRouter's regex naturally lets it through. The dashboard page itself still lives at `/ops`. requestLoggingMiddleware now skips both `/ops/*` (the page) and `/api/ops/*` (its own polling) so the dashboard never logs itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds /api/ops/business/metrics returning six business metrics (MRR, net new MRR MTD, active paying subs, 30d churn, 7d failed payments, 7d new paid conversions). Five come from the local subscriptions table via the new SubscriptionsAnalyticsRepository; failed payments hit Stripe with a 15-min in-memory cache. Owner-only via existing RequireOpsAccess. The Engineering view moves into a sibling EngineeringTab under a shared OpsLayout, with /ops/business rendering raw JSON in a <pre> as a sanity check before the card grid in PR B. Hourly Stripe sync added to ScheduleCleanup so the local subscriptions table stays fresh enough for the 15-min response cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the local-first approach (subscriptions table read) in favour of
calling Stripe directly with a 15-min server-side cache. Decouples the
dashboard from updateStripeSubscriptions, which Al kept manual-only.
- Delete SubscriptionsAnalyticsRepository + test
- BusinessMetricsService now owns all six metric queries via Stripe SDK
- mrr_usd and active_paying_subs share one paginated walk of
subscriptions.list({status: 'active'}) — assert in tests
- net_new_mrr_mtd_usd: subscriptions.list({created.gte: month_start})
- new_paid_conversions_7d: subscriptions.list({created.gte: 7d_ago})
- churn_30d_pct: subscriptions.search('canceled_at>:30d') / active count
- failed_payments_7d: invoices.list filtered to open + attempts > 0
- MRR normalization in TS: per-item unit_amount × quantity × factor
where factor is {month:1, year:1/12, week:4.33, day:30}/interval_count
- OpsRouter drops repository wiring; service constructed empty
- Tests now mock the Stripe SDK (external dep), cover yearly/weekly
normalization, multi-item subs, trialing exclusion, partial-failure
errors[], cache hit/expiry, single active-list walk
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stripe Search API uses range operators without the colon (`canceled_at>1234`), not `canceled_at>:1234`. The colon form is rejected with 'Ensure you have properly quoted values'. Adds a regex assertion on the search call so the syntax can't regress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reconstructs four historical series from a single paginated walk of
subscriptions.list({status:'all'}) plus the existing 12-week invoices
walk: mrr_timeseries (90d), active_subs_timeseries (90d),
conversions_vs_churn_weekly (12w), failed_payments_weekly (12w).
Today snapshots and time-series share one walk per refresh; each metric
is independently cached for 15 min and partial failures still surface
via response.errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six big-number cards (MRR, Net new MRR MTD, Active paying subs, Churn 30d, Failed payments 7d, New paid conversions 7d) sit above a 2x2 grid of Recharts panels: MRR area, active-subs line, new-vs-churned paired bars, failed-payments bars. Mirrors the Engineering tab's visual treatment (ChartPanel wrapper, 220px chart frame, 30s auto-refresh, last-snapshot retention during refetches, pageWide container). The previous <pre> JSON fallback is gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
aalemayhu
added a commit
that referenced
this pull request
May 9, 2026
PR #2075 introduced four new Recharts components (MRR, ActiveSubs, ConversionsChurn, FailedPaymentsWeekly) that each duplicated chart margins, axis tick props, axis stroke, grid stroke, tooltip cursor fill, and the tooltip wrapper markup. Sonar measured the new code at 4.1% duplication, failing the <=3% quality gate. Extract the shared tokens into timeSeriesChartHelpers.ts and the tooltip shell + row into TimeSeriesTooltipShell.tsx, then route the four new charts through both. Also pass tooltip components by reference to <Tooltip content={X}> instead of an inline arrow, addressing the typescript:S6478 "component-definition-in-parent" finding on those four files at the same time. Engineering charts (InboundVolume, LatencyByRoute, OutboundByService, ErrorRate) are left untouched per scope discipline — they were not the source of the new-code duplication delta. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aalemayhu
added a commit
that referenced
this pull request
May 9, 2026
PR #2075 merged with the SonarCloud Code Analysis check at conclusion=FAILURE because that check is not on the GitHub branch protection required-checks list, and neither safety.py nor check-commit-message.py inspect PR check status before \`gh pr merge\`. Sonar's quality-gate failure (4.1% duplication on new code, C security rating) went uncaught. Add a third PreToolUse hook, check-merge-status.py, that: * Detects \`gh pr merge\` invocations in several command shapes (positional number, --rebase first, full PR URL, no PR -> let gh resolve from the current branch). * Calls \`gh pr view <ref> --json statusCheckRollup\` and inspects every entry — not just named-required ones. * If any conclusion == "FAILURE", denies the tool call with a bullet list of the failing check names and a hint to bypass via CLAUDE_SKIP_SAFETY=1. * On gh / network / parse errors, prints a one-line warning to stderr and exits 0 — never breaks a legitimate merge over a transient API issue. Wired into .claude/settings.json alongside safety.py and check-commit-message.py under the same Bash PreToolUse matcher. Manual verification: PR 2075 (FAILURE on SonarCloud) -> deny. PR 2074 (all green) -> allow. Non-merge command \`ls\` -> allow. --rebase 2075 / URL form -> deny (same). CLAUDE_SKIP_SAFETY=1 bypass -> allow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aalemayhu
added a commit
that referenced
this pull request
May 9, 2026
PR #2075 introduced four new Recharts components (MRR, ActiveSubs, ConversionsChurn, FailedPaymentsWeekly) that each duplicated chart margins, axis tick props, axis stroke, grid stroke, tooltip cursor fill, and the tooltip wrapper markup. Sonar measured the new code at 4.1% duplication, failing the <=3% quality gate. Extract the shared tokens into timeSeriesChartHelpers.ts and the tooltip shell + row into TimeSeriesTooltipShell.tsx, then route the four new charts through both. Also pass tooltip components by reference to <Tooltip content={X}> instead of an inline arrow, addressing the typescript:S6478 "component-definition-in-parent" finding on those four files at the same time. Engineering charts (InboundVolume, LatencyByRoute, OutboundByService, ErrorRate) are left untouched per scope discipline — they were not the source of the new-code duplication delta. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
aalemayhu
added a commit
that referenced
this pull request
May 9, 2026
PR #2075 merged with the SonarCloud Code Analysis check at conclusion=FAILURE because that check is not on the GitHub branch protection required-checks list, and neither safety.py nor check-commit-message.py inspect PR check status before \`gh pr merge\`. Sonar's quality-gate failure (4.1% duplication on new code, C security rating) went uncaught. Add a third PreToolUse hook, check-merge-status.py, that: * Detects \`gh pr merge\` invocations in several command shapes (positional number, --rebase first, full PR URL, no PR -> let gh resolve from the current branch). * Calls \`gh pr view <ref> --json statusCheckRollup\` and inspects every entry — not just named-required ones. * If any conclusion == "FAILURE", denies the tool call with a bullet list of the failing check names and a hint to bypass via CLAUDE_SKIP_SAFETY=1. * On gh / network / parse errors, prints a one-line warning to stderr and exits 0 — never breaks a legitimate merge over a transient API issue. Wired into .claude/settings.json alongside safety.py and check-commit-message.py under the same Bash PreToolUse matcher. Manual verification: PR 2075 (FAILURE on SonarCloud) -> deny. PR 2074 (all green) -> allow. Non-merge command \`ls\` -> allow. --rebase 2075 / URL form -> deny (same). CLAUDE_SKIP_SAFETY=1 bypass -> allow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Summary
/opsdashboard for Al with two tabs:request_logs,outbound_call_logs) populated by a global Express middleware and an opt-ininstrumentedAxioswrapper.subscriptions.list({status:'all'})walk plus a 12-weekinvoices.listwalk, cached for 15 min per metric.ObservabilitySinkthat batches every 5s or 100 rows; the request path takes a singleDate.now()and ares.on('finish')callback — no blocking work, no impact on user-facing latency./ops); auto-refreshes every 30s, pauses while the tab is hidden, falls back to the last good snapshot on a transient API error so the charts don't blank out. The Business tab mirrors the same visual treatment —ChartPanelwrapper, 220px chart frame, snapshot-during-refetch, no<pre>JSON anywhere.RequireOpsAccessreturning 404 (not 403 — we don't reveal the dashboard exists). The matchingfeatures.opsflag drives the navbar entry, hidden for everyone except the ops owner.Spec / design
Documentation/ops-observability/SPEC.mdDocumentation/ops-observability/DESIGN.mdTest plan
pnpm testfor the new files (41/41 green for engineering side; 14/14 forBusinessMetricsServiceincluding time-series cases; 5/5 forBusinessTab.test.tsx).pnpm tsc -p .clean.pnpm --filter 2anki-web typecheckclean.pnpm --filter 2anki-web testclean (72/72).pnpm --filter 2anki-web lintclean (Biome).pnpm --filter 2anki-web buildproduces a separateOpsPage-*.jschunk (~113KB gzip) so Recharts is not paid for outside/ops./ops(Engineering) — four charts, window dropdown, 30s tick./ops/business: eyeball the six cards, then cross-check MRR and active-paying-subs against the Stripe dashboard. Don't declare success until those two numbers line up to within rounding.Risks
(created_at desc)and(<key>, created_at desc); size grows linearly with traffic. Without a retention/rotation cron we'll need to revisit once we have a week of real volume — see follow-ups.console.errorrather than retry, which is the right call (instrumentation must never threaten the request path).canceled_atis set when a subscription is requested canceled (potentially before period end), whileended_atis the actual termination time. We treatendedAt ?? canceledAtas the historical "active until" boundary, and we explicitly do not try to reconstruct trial→active transitions (currentstatus === 'trialing'excludes the sub from time series — a v3 problem).migrations/<...>_observability.jsdown. The middleware andinstrumentedAxioswrapper become no-ops if the tables vanish (sink swallows insert errors and continues). Reverting also drops the Business tab; nothing else in the app depends on it.Future work (intentionally not in this PR)
trialing. Reconstructing the moment a trial converted requires walking subscription events, not just the sub list.Goal alignment
Scaling toward 300K users requires we know (1) what's slow and what's broken on the engineering side and (2) what's happening to revenue on the business side, before users or the bank tell us. This PR is the foundation that makes every later perf/reliability and pricing bet data-driven instead of vibes-driven, with zero user-facing surface area added.
🤖 Generated with Claude Code