Skip to content

feat(ci): scheduled QA-stuck issue check with Slack notifications#35802

Open
nollymar wants to merge 5 commits into
mainfrom
worktree-qa-stuck-prs-automation
Open

feat(ci): scheduled QA-stuck issue check with Slack notifications#35802
nollymar wants to merge 5 commits into
mainfrom
worktree-qa-stuck-prs-automation

Conversation

@nollymar
Copy link
Copy Markdown
Member

Summary

  • New scheduled workflow cicd_scheduled_qa-stuck-check.yml runs every 3 days at 13:00 UTC and pings each team's Slack channel about issues stuck in QA.
  • An issue is considered stuck when, on the dotCMS - Product Planning project (testing assign #7), it has Status QA or Done, lacks both QA : Passed and QA : Not Needed labels, and the project item has not changed in 3+ days.
  • Team is resolved from the issue's Team : <name> label only. Team→channel mapping was added to .claude/triage-config.json.

Behavior

  • No issues stuck → the notify job is skipped entirely; no Slack messages are sent.
  • Manual dispatch → defaults to dry_run=true; rendered messages are printed to the job summary instead of posting. Override stuck_days to test other thresholds.
  • Scheduled run → posts to each team channel via chat.postMessage using SLACK_BOT_TOKEN. Per-team posts run in a matrix so one Slack failure doesn't block the others.

Auth note

The workflow queries an org-owned ProjectV2 board, which the default GITHUB_TOKEN cannot read. It uses secrets.CI_MACHINE_TOKEN (already used by cicd_scheduled_notify-seated-prs.yml and the slack-channel-resolver). If that token lacks the read:project scope, the first dry-run will surface a 401 and we'll need to grant it.

Test plan

  • node --check on the script
  • actionlint on the workflow
  • Local smoke test (node .github/scripts/qa-stuck-check/test-find-stuck-issues.js) with a fake project payload — verifies Status / label / staleness / team-resolution filters and the per-team grouping
  • Trigger workflow_dispatch with dry_run=true on this branch and review the job summary against the actual project state
  • Confirm Status field options on project testing assign #7 are exactly QA and Done (case-insensitive match in the script, but spelling must agree)
  • Flip dry_run=false once for a live end-to-end Slack test against the team channels before letting the schedule run unattended

🤖 Generated with Claude Code

Adds a workflow that runs every 3 days at 13:00 UTC, queries the
dotCMS - Product Planning project (#7) for issues in QA/Done status
that lack a QA : Passed / QA : Not Needed label and have not changed
in 3+ days, and pings the responsible team's Slack channel.

Team is resolved from the issue's "Team : <name>" label; team→channel
mapping lives in .claude/triage-config.json. Manual dispatch supports
dry-run (default) so the rendered messages can be reviewed before any
real Slack post. If no issues qualify, the notify job is skipped and
no message is sent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Issue Linking Required

This PR could not be linked to an issue. All PRs must be linked to an issue for tracking purposes.

How to fix this:

Option 1: Add keyword to PR body (Recommended - auto-removes this comment)
Edit this PR description and add one of these lines:

  • This PR fixes #123 or Fixes: #123

  • This PR closes #123 or Closes: #123

  • This PR resolves #123 or Resolves: #123

  • Other supported keywords: fix, fixed, close, closed, resolve, resolved
    Option 2: Link via GitHub UI (Note: won't clear the failed check)

  1. Go to the PR → Development section (right sidebar)

  2. Click "Link issue" and select an existing issue

  3. Push a new commit or re-run the workflow to clear the failed check
    Option 3: Use branch naming
    Create a new branch with one of these patterns:

  • 123-feature-description (number at start)

  • issue-123-feature-description (issue-number at start)

  • feature-issue-123 (issue-number anywhere)

Why is this required?

Issue linking ensures proper tracking, documentation, and helps maintain project history. It connects your code changes to the problem they solve.---

This comment was automatically generated by the issue linking workflow

@github-actions github-actions Bot added the Area : CI/CD PR changes GitHub Actions/workflows label May 21, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 21, 2026

Claude finished @nollymar's task in 3m 6s —— View job


Review

  • Read the 5 changed files
  • Review find-stuck-issues.js
  • Review cicd_scheduled_qa-stuck-check.yml
  • Review cicd_pr_qa-stuck-check-validate.yml
  • Review test-find-stuck-issues.js
  • Review triage-config.json
  • Post review

Overall this is in good shape — the iterative review rounds have already burned down most of the obvious issues. A few things still look risky or worth a second look before this runs unattended.


🟠 Things to fix before live posting

1. Hidden coupling between TARGET_STATUSES and the CLOSED filterfind-stuck-issues.js:87

if (issue.state === 'CLOSED' && status.toLowerCase() !== 'done') continue;

'done' is hardcoded here, but TARGET_STATUSES is configurable via env. If someone ever sets TARGET_STATUSES='QA,Verified' (or renames the Done column), this gate silently drops every CLOSED issue in the new status — including the ones that should be flagged. Two options: (a) drop the env config and just hardcode the whole set since both ends are hardcoded anyway, or (b) introduce something like DONE_STATUS_NAMES and derive the gate from it.

Fix this →

2. The stuck_days != '' workaround may not actually catch 0cicd_scheduled_qa-stuck-check.yml:40

STUCK_DAYS: ${{ inputs.stuck_days != '' && inputs.stuck_days || '3' }}

The comment claims this lets 0 through to trip validateConfig. With GH Actions expression semantics:

  • If inputs.stuck_days arrives as the number 0: 0 != '' is true → true && 000 || '3''3' (because 0 is falsy in ||). Coerced back to 3.
  • If it arrives as the string '0': '0' != '' is true → '0' is truthy → returns '0'. Works.

Whether type: number inputs surface as numeric 0 or string '0' in the expression engine is exactly the ambiguity this was meant to remove — and the test suite stubs the script with STUCK_DAYS='0' directly, so it never exercises the YAML path. Worth one explicit workflow_dispatch with stuck_days=0 to confirm the script actually receives '0' (or change the expression to something unambiguous like format('{0}', inputs.stuck_days) != '' && inputs.stuck_days || '3').

3. fieldValues(first: 50) has no pagination guardfind-stuck-issues.js:255

Unlike the labels query (which returns pageInfo and skips overflowing issues with a warning), fieldValues is fetched with first: 50 and no pageInfo. If the Product Planning board ever crosses 50 fields per item (or even fewer if some return arrays of options), Status falls off and readStatus returns null. The STATUS_PRESENCE_MIN_RATIO=0.5 heuristic catches a global cliff but won't catch a tail — e.g., 70% of items expose Status and 30% don't, you fall under the ratio quietly via the warning path. Either fetch pageInfo and skip overflowing items the same way you do for labels, or bump the cap to GraphQL's hard ceiling (100) and call out the boundary in a comment.


🟡 Worth a second look

4. Slack mrkdwn escaping is incompletecicd_scheduled_qa-stuck-check.yml:105-108

esc() covers & < >. Slack's formatting rules say those three are the only ones that need HTML-escaping, so this is technically correct for raw text. But the message embeds titles inside <URL|text> link syntax and inside _..._ italics. A title containing a literal | will break the link target, and titles with * / _ / backticks render with unintended formatting. The author flagged this as judgment-call territory ("display-quality not correctness") and I agree — but issue titles in the wild do contain pipes (Fix X | Y is common). If you've never seen a broken Slack link, fine; if you have, swap || or use block kit which doesn't require this escaping dance.

5. notify-teams matrix expansion when groups_json is empty

The if: gate uses has_results == 'true', but strategy.matrix.group is also evaluated unconditionally. fromJSON('[]') produces a zero-element matrix which expands to zero jobs — that does work in practice and the if: makes it defensive, but I'd add a one-line comment noting the dual-gate intent so a future editor doesn't drop the if: thinking it's redundant.

6. daysStuck in the test fixtures is fragile near midnight UTC

testMainScenario does exact equality checks like byTeam['Team : Scout'].issues[0].number === 107 — that's a sort order assertion (107 has 6d, 100 has 5d). It works today, but daysAgo(5) computed near a DST boundary (and then Math.floord against an integer-days threshold) can flip the ordering occasionally on CI. Not a correctness issue with the script, but PR-validate could flake. Consider asserting on Set-equality or stuckDays >= threshold rather than exact ordering.

7. assertStatusFieldVisible runs on the full item set, after content typename filter, but before the target-status filterfind-stuck-issues.js:71

That's intentional and correct (it's a sanity gate, not a result filter), but it means a board with mostly closed-Done items will still trigger the warning path when their Status field shows up normally. No bug — just flagging because if you ever filter the query at the GraphQL level (status: QA), the assertion premise changes.


🟢 Nits / observed, not flagging

  • assignees(first: 20) — display-only, already acknowledged.
  • PROJECT_NUMBER not validateConfig-checked — hardcoded in YAML, low likelihood.
  • 4-day Mon→Thu gap vs claimed "every 3 days" — acknowledged in comments.
  • Slack channel IDs in job summary on dry-run — internal info, not secrets.
  • No 429 retry on Slack — low cadence, fine.
  • set -euo pipefail in the post step — the default shell already gives -eo pipefail; adding -u would only help if a future edit referenced an unset var, marginal.

Test coverage looks solid

11 cases including the edge ones (multi-team, CLOSED+QA, label overflow, non-Issue content not tripping the ratio, null project, invalid STUCK_DAYS). Two coverage gaps worth adding:

  • A test that goes through the YAML coalesce path for stuck_days=0 (would require workflow-level testing, or just a comment in the YAML clarifying that the test suite doesn't cover it).
  • A test for TARGET_STATUSES override interacting with the CLOSED filter (would surface issue update with latest SVN #1).
    · branch

- Cron now '0 13 * * 1,4' (Mon/Thu) instead of '*/3' day-of-month,
  which produced uneven cadence across month boundaries.
- Bumped GraphQL caps: fieldValues 20->50, labels 30->50, assignees
  10->20, so the Status / Team labels can't silently fall off the page.
- Added a Status-field presence assertion: fails the run if fewer than
  half of items expose a readable Status (surfaces schema mismatches in
  dry-run instead of returning empty results).
- Slack message wording now says "no project activity for Nd" / "last
  project update Nd ago" rather than "stuck for Nd", reflecting that
  ProjectV2Item.updatedAt is a proxy and any project-field edit resets
  the clock. Same caveat in the job summary.
- Dropped "(cc @user)" framing: Slack rendered it as plain text, not
  an actual mention. Assignees are still listed for context.
- Multiple "Team : X" labels on one issue now report the issue under
  each matched team (previously only the first label-order match got
  notified).
- Added a guard for projectV2 == null (renamed project / token without
  read:project scope) so dry-runs fail with a clear error rather than
  a "Cannot read properties of null" stack.
- stuck_days is now embedded on each group; matrix builder no longer
  needs to re-resolve the env var.
- Test suite expanded to cover: empty project, multi-team labels,
  missing Status field (fail loudly), partial-missing Status (warn,
  continue), null projectV2, and a team configured without a slack
  channel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nollymar
Copy link
Copy Markdown
Member Author

Addressed in 93afc3f:

  • 🔴 Cron0 13 */3 * *0 13 * * 1,4 (Mon/Thu 13:00 UTC). Even cadence, no month-boundary glitches.
  • 🟠 fieldValues(first: 20) — bumped to 50; labels 30→50; assignees 10→20. Added a presence assertion that fails the run when fewer than 50% of items expose a readable Status (surfaces "wrong field name" or "still capped" in dry-run rather than silently returning empty).
  • 🟠 updatedAt semantics — Slack header now reads "N QA/Done issue(s) with no project activity for Xd+ days" and each line "last project update Nd ago"; trailing paragraph spells out the caveat: "Metric: time since the project item last changed (any field). May undercount if other fields were edited recently." Same wording in the job-summary.
  • 🟡 Slack @-cc — dropped, since @name was plain text not a mention. Assignees still listed (· assignees: foo, bar).
  • 🟡 Multi Team label — issue now reports under every matched team, not just first-in-label-order. Test case added.
  • 🟡 STUCK_DAYS propagation — embedded on each group (stuck_days: N); matrix job reads group.stuck_days instead of re-resolving via workflow env.
  • 🟢 Null-project guard — explicit setFailed with a "token may lack read:project scope" hint instead of Cannot read properties of null. Test case added.
  • 🟢 Fragile channel deref — channel is now stored on the team accumulator at insert time, removing the teamChannels[team] re-lookup in grouping.
  • 🟢 Tests — expanded from 1 scenario to 7: main, empty project, multi-team labels, missing-Status loud failure, partial-Status warn-and-continue, null project, team-without-slack-channel ignored. node test-find-stuck-issues.js runs them all; actionlint clean.

Second pass on review feedback.

- Validate STUCK_DAYS up front: must be a finite integer >= 1 or the
  run fails. Without this, parseInt('abc') -> NaN -> cutoff invalid ->
  every QA/Done item passes the staleness gate (false positive
  storm). Same risk for 0 / -1, which slid cutoff to now/future.
- workflow_dispatch.inputs.stuck_days is now type: number so GH
  rejects non-numeric input at dispatch time.
- Labels page-size: bumped to 100 and the query now returns pageInfo.
  Issues whose labels overflow page 1 are skipped with a warning
  rather than evaluated against an incomplete label set (the
  QA : Passed gate could otherwise produce false positives).
- Concurrency group "qa-stuck-check" with cancel-in-progress: false
  so a manual dry-run can't overlap a scheduled live post.
- Escape Slack mrkdwn-sensitive chars (&, <, >) in team / status /
  title / assignee strings so a title like "Fix <X> & <Y>" doesn't
  break the <url|text> link syntax. jq --arg still handles JSON
  escaping; this is the orthogonal mrkdwn layer.
- Tests: added invalid-STUCK_DAYS path and labels-overflow path.
  Suite is now 9 cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nollymar
Copy link
Copy Markdown
Member Author

Second-review pass landed in 18197c7:

Bugs / risks

  1. parseInt(STUCK_DAYS) unguarded — added validateConfig: finite integer ≥ 1 or the run fails up front. workflow_dispatch.inputs.stuck_days switched from string to number so GH rejects bad input at dispatch time. Test asserts abc / 0 / -1 all throw (empty stays valid since it's how scheduled triggers signal "use default").
  2. Labels pagination — bumped labels(first: 50)labels(first: 100), query now returns pageInfo, and issues with hasNextPage: true are skipped with a warning rather than evaluated against an incomplete label set. Test asserts the skip + warning.
  3. Concurrency — added concurrency: { group: qa-stuck-check, cancel-in-progress: false }. Manual dry-run can't race the scheduled run.
  4. updatedAt proxy false-negative shape (Priority/Iteration bumps reset the clock) — acknowledged limitation; no signal change short of a different data source. Open to revisiting cadence (weekly × 7-day vs Mon/Thu × 3-day) but leaving as-is for now since the wording already calls out "any project-field edit resets the clock".
  5. Slack mrkdwn escaping&, <, > now escaped in team / status / title / assignee text via an esc helper. jq --arg continues to handle JSON-level escaping; this fixes the orthogonal mrkdwn layer (e.g. titles like Fix <X> & <Y>).

Smaller stuff
6. ✅ inputs.stuck_days type → number.
7. ⏸ STATUS_PRESENCE_MIN_RATIO left hardcoded — one knob, internal heuristic, not worth widening the env surface.
8. ⏸ DST drift in daysStuck — acknowledged, no action.
9-10. ⏸ No action (existing convention).
11. ✅ Tests for NaN/invalid stuck_days and >100 labels both added. Suite is now 9 cases, all green.

actionlint and node --check clean; full test suite passes.

Addresses the two items the second review flagged as fix-before-merge.

- find-stuck-issues now drops items where state == 'CLOSED' and the
  project Status is anything other than Done. A closed issue parked
  in Status=QA is board-cleanup (closed-as-completed without moving
  the card, or closed-as-not-planned without a QA : Not Needed label)
  — pinging the team about it is noise. Added a test case asserting
  CLOSED+QA is dropped while OPEN+QA and CLOSED+Done are kept.
- New workflow cicd_pr_qa-stuck-check-validate.yml runs
  `node .github/scripts/qa-stuck-check/test-find-stuck-issues.js`
  on PRs that touch the script, scheduled workflow, this validator
  workflow, or the triage config. Sub-second job (zero npm deps).
  Without this, regressions would only surface on the next Mon/Thu
  13:00 UTC scheduled run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nollymar
Copy link
Copy Markdown
Member Author

Addressed the two items called out as "actually fix before merging" in b0f674cb07:

2. ✅ Issue.state drift — find-stuck-issues now drops items where state == 'CLOSED' and project Status is anything other than Done. Rationale: a closed issue parked in Status=QA is board-cleanup (closed-as-completed without moving the card, or closed-as-not-planned without QA : Not Needed), not pending QA. Test case asserts CLOSED+QA is dropped while OPEN+QA and CLOSED+Done are kept.

3. ✅ Tests not in CI — added cicd_pr_qa-stuck-check-validate.yml. Path-filtered pull_request trigger on the script/workflow/config; checks out the relevant files, sets up Node from .nvmrc, runs the smoke test. Zero npm deps, sub-second job. Concurrency group cancels in-progress on synchronize.

The 🟡/🟢 items I'm leaving as judgment calls (happy to address any if you want):

  • (1) YAML comment clarifying schedule-doesn't-dry-run — fair, can add.
  • (4) assignees first: 20 no-overflow guard — display-only impact.
  • (5) parseInt(PROJECT_NUMBER) validation — env is hard-coded in the YAML, lower likelihood of a typo than user input.
  • (6) set -euo pipefail in the Slack post step — defensive, harmless.
  • (8) Number.isInteger(STUCK_DAYS) to reject decimals — minor.
  • (9) Project-not-found error fires after retries: 3 — known limitation of the github-script wrapper; acceptable since the result is "fail with clear message after a few seconds."

Test suite is now 10 cases, all green; actionlint clean on both workflows.

…uck_days=0

Addresses the two highest-impact findings from the latest review.

- assertStatusFieldVisible now filters to content.__typename === 'Issue'
  before computing the Status-presence ratio. The previous version
  included PullRequest / DraftIssue items in the denominator, which can
  legitimately lack a Status value on this board — a project with even
  a handful of PRs added to it would have failed the ratio guard with
  a misleading "field falls outside fieldValues page" error. Test added
  asserting that a single Issue+Status alongside several DraftIssues
  passes cleanly.
- STUCK_DAYS env in the workflow now uses an explicit empty-string check
  instead of `||`. GH Actions evaluates `0 || '3'` to `'3'`, so a manual
  dispatch with stuck_days=0 was silently coerced to the default 3
  instead of failing validateConfig with a clear error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nollymar
Copy link
Copy Markdown
Member Author

Addressed the two highest-impact findings from the latest review in 6a91742913:

🟠 assertStatusFieldVisible scoped to Issue items — denominator now filters to content.__typename === 'Issue'. PRs / DraftIssues no longer dilute the ratio. Test testNonIssueContentDoesNotTripRatio asserts a board with 1 Issue + 4 DraftIssues passes cleanly (previously would have failed with a misleading "field outside fieldValues page" error).

🟡 stuck_days=0 YAML coalesce bypass${{ inputs.stuck_days || '3' }}${{ inputs.stuck_days != '' && inputs.stuck_days || '3' }}. Explicit empty-string check means 0 propagates to the script and trips validateConfig with the clear "must be >= 1" message, rather than being silently coerced to 3.

Left as judgment calls:

  • 🟡 esc only covering & < > — Slack's spec is satisfied; */_/backtick are display-quality not correctness.
  • 🟡 Job-summary markdown not escaped — internal-only render, lower stakes.
  • 🟢 set -euo pipefail in the Slack post step — already acknowledged, marginal benefit.
  • 🟢 429 retry — single-shot is fine at ~4 teams Mon/Thu cadence.

Suite is 11 cases, all green; actionlint clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : CI/CD PR changes GitHub Actions/workflows

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant