feat(ci): scheduled QA-stuck issue check with Slack notifications by nollymar · Pull Request #35802 · dotCMS/core

nollymar · 2026-05-21T22:51:45Z

Summary

New scheduled workflow cicd_scheduled_qa-stuck-check.yml runs every 3 days at 13:00 UTC and pings each team's Slack channel about issues stuck in QA.
An issue is considered stuck when, on the dotCMS - Product Planning project (testing assign #7), it has Status QA or Done, lacks both QA : Passed and QA : Not Needed labels, and the project item has not changed in 3+ days.
Team is resolved from the issue's Team : <name> label only. Team→channel mapping was added to .claude/triage-config.json.

Behavior

No issues stuck → the notify job is skipped entirely; no Slack messages are sent.
Manual dispatch → defaults to dry_run=true; rendered messages are printed to the job summary instead of posting. Override stuck_days to test other thresholds.
Scheduled run → posts to each team channel via chat.postMessage using SLACK_BOT_TOKEN. Per-team posts run in a matrix so one Slack failure doesn't block the others.

Auth note

The workflow queries an org-owned ProjectV2 board, which the default GITHUB_TOKEN cannot read. It uses secrets.CI_MACHINE_TOKEN (already used by cicd_scheduled_notify-seated-prs.yml and the slack-channel-resolver). If that token lacks the read:project scope, the first dry-run will surface a 401 and we'll need to grant it.

Test plan

node --check on the script
actionlint on the workflow
Local smoke test (node .github/scripts/qa-stuck-check/test-find-stuck-issues.js) with a fake project payload — verifies Status / label / staleness / team-resolution filters and the per-team grouping
Trigger workflow_dispatch with dry_run=true on this branch and review the job summary against the actual project state
Confirm Status field options on project testing assign #7 are exactly QA and Done (case-insensitive match in the script, but spelling must agree)
Flip dry_run=false once for a live end-to-end Slack test against the team channels before letting the schedule run unattended

🤖 Generated with Claude Code

Adds a workflow that runs every 3 days at 13:00 UTC, queries the dotCMS - Product Planning project (#7) for issues in QA/Done status that lack a QA : Passed / QA : Not Needed label and have not changed in 3+ days, and pings the responsible team's Slack channel. Team is resolved from the issue's "Team : <name>" label; team→channel mapping lives in .claude/triage-config.json. Manual dispatch supports dry-run (default) so the rendered messages can be reviewed before any real Slack post. If no issues qualify, the notify job is skipped and no message is sent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-21T22:51:55Z

❌ Issue Linking Required

This PR could not be linked to an issue. All PRs must be linked to an issue for tracking purposes.

How to fix this:

Option 1: Add keyword to PR body (Recommended - auto-removes this comment)
Edit this PR description and add one of these lines:

This PR fixes #123 or Fixes: #123
This PR closes #123 or Closes: #123
This PR resolves #123 or Resolves: #123
Other supported keywords: fix, fixed, close, closed, resolve, resolved
Option 2: Link via GitHub UI (Note: won't clear the failed check)

Go to the PR → Development section (right sidebar)
Click "Link issue" and select an existing issue
Push a new commit or re-run the workflow to clear the failed check
Option 3: Use branch naming
Create a new branch with one of these patterns:

123-feature-description (number at start)
issue-123-feature-description (issue-number at start)
feature-issue-123 (issue-number anywhere)

Why is this required?

Issue linking ensures proper tracking, documentation, and helps maintain project history. It connects your code changes to the problem they solve.---

This comment was automatically generated by the issue linking workflow

claude · 2026-05-21T22:52:33Z

Claude finished @nollymar's task in 3m 6s —— View job

Review

Read the 5 changed files
Review find-stuck-issues.js
Review cicd_scheduled_qa-stuck-check.yml
Review cicd_pr_qa-stuck-check-validate.yml
Review test-find-stuck-issues.js
Review triage-config.json
Post review

Overall this is in good shape — the iterative review rounds have already burned down most of the obvious issues. A few things still look risky or worth a second look before this runs unattended.

🟠 Things to fix before live posting

1. Hidden coupling between TARGET_STATUSES and the CLOSED filter — find-stuck-issues.js:87

if (issue.state === 'CLOSED' && status.toLowerCase() !== 'done') continue;

'done' is hardcoded here, but TARGET_STATUSES is configurable via env. If someone ever sets TARGET_STATUSES='QA,Verified' (or renames the Done column), this gate silently drops every CLOSED issue in the new status — including the ones that should be flagged. Two options: (a) drop the env config and just hardcode the whole set since both ends are hardcoded anyway, or (b) introduce something like DONE_STATUS_NAMES and derive the gate from it.

Fix this →

2. The stuck_days != '' workaround may not actually catch 0 — cicd_scheduled_qa-stuck-check.yml:40

STUCK_DAYS: ${{ inputs.stuck_days != '' && inputs.stuck_days || '3' }}

The comment claims this lets 0 through to trip validateConfig. With GH Actions expression semantics:

If inputs.stuck_days arrives as the number 0: 0 != '' is true → true && 0 → 0 → 0 || '3' → '3' (because 0 is falsy in ||). Coerced back to 3.
If it arrives as the string '0': '0' != '' is true → '0' is truthy → returns '0'. Works.

Whether type: number inputs surface as numeric 0 or string '0' in the expression engine is exactly the ambiguity this was meant to remove — and the test suite stubs the script with STUCK_DAYS='0' directly, so it never exercises the YAML path. Worth one explicit workflow_dispatch with stuck_days=0 to confirm the script actually receives '0' (or change the expression to something unambiguous like format('{0}', inputs.stuck_days) != '' && inputs.stuck_days || '3').

3. fieldValues(first: 50) has no pagination guard — find-stuck-issues.js:255

Unlike the labels query (which returns pageInfo and skips overflowing issues with a warning), fieldValues is fetched with first: 50 and no pageInfo. If the Product Planning board ever crosses 50 fields per item (or even fewer if some return arrays of options), Status falls off and readStatus returns null. The STATUS_PRESENCE_MIN_RATIO=0.5 heuristic catches a global cliff but won't catch a tail — e.g., 70% of items expose Status and 30% don't, you fall under the ratio quietly via the warning path. Either fetch pageInfo and skip overflowing items the same way you do for labels, or bump the cap to GraphQL's hard ceiling (100) and call out the boundary in a comment.

🟡 Worth a second look

4. Slack mrkdwn escaping is incomplete — cicd_scheduled_qa-stuck-check.yml:105-108

esc() covers & < >. Slack's formatting rules say those three are the only ones that need HTML-escaping, so this is technically correct for raw text. But the message embeds titles inside <URL|text> link syntax and inside _..._ italics. A title containing a literal | will break the link target, and titles with * / _ / backticks render with unintended formatting. The author flagged this as judgment-call territory ("display-quality not correctness") and I agree — but issue titles in the wild do contain pipes (Fix X | Y is common). If you've never seen a broken Slack link, fine; if you have, swap | → | or use block kit which doesn't require this escaping dance.

5. notify-teams matrix expansion when groups_json is empty

The if: gate uses has_results == 'true', but strategy.matrix.group is also evaluated unconditionally. fromJSON('[]') produces a zero-element matrix which expands to zero jobs — that does work in practice and the if: makes it defensive, but I'd add a one-line comment noting the dual-gate intent so a future editor doesn't drop the if: thinking it's redundant.

6. daysStuck in the test fixtures is fragile near midnight UTC

testMainScenario does exact equality checks like byTeam['Team : Scout'].issues[0].number === 107 — that's a sort order assertion (107 has 6d, 100 has 5d). It works today, but daysAgo(5) computed near a DST boundary (and then Math.floord against an integer-days threshold) can flip the ordering occasionally on CI. Not a correctness issue with the script, but PR-validate could flake. Consider asserting on Set-equality or stuckDays >= threshold rather than exact ordering.

7. assertStatusFieldVisible runs on the full item set, after content typename filter, but before the target-status filter — find-stuck-issues.js:71

That's intentional and correct (it's a sanity gate, not a result filter), but it means a board with mostly closed-Done items will still trigger the warning path when their Status field shows up normally. No bug — just flagging because if you ever filter the query at the GraphQL level (status: QA), the assertion premise changes.

🟢 Nits / observed, not flagging

assignees(first: 20) — display-only, already acknowledged.
PROJECT_NUMBER not validateConfig-checked — hardcoded in YAML, low likelihood.
4-day Mon→Thu gap vs claimed "every 3 days" — acknowledged in comments.
Slack channel IDs in job summary on dry-run — internal info, not secrets.
No 429 retry on Slack — low cadence, fine.
set -euo pipefail in the post step — the default shell already gives -eo pipefail; adding -u would only help if a future edit referenced an unset var, marginal.

Test coverage looks solid

11 cases including the edge ones (multi-team, CLOSED+QA, label overflow, non-Issue content not tripping the ratio, null project, invalid STUCK_DAYS). Two coverage gaps worth adding:

A test that goes through the YAML coalesce path for stuck_days=0 (would require workflow-level testing, or just a comment in the YAML clarifying that the test suite doesn't cover it).
A test for TARGET_STATUSES override interacting with the CLOSED filter (would surface issue update with latest SVN #1).
· branch

@user

- Cron now '0 13 * * 1,4' (Mon/Thu) instead of '*/3' day-of-month, which produced uneven cadence across month boundaries. - Bumped GraphQL caps: fieldValues 20->50, labels 30->50, assignees 10->20, so the Status / Team labels can't silently fall off the page. - Added a Status-field presence assertion: fails the run if fewer than half of items expose a readable Status (surfaces schema mismatches in dry-run instead of returning empty results). - Slack message wording now says "no project activity for Nd" / "last project update Nd ago" rather than "stuck for Nd", reflecting that ProjectV2Item.updatedAt is a proxy and any project-field edit resets the clock. Same caveat in the job summary. - Dropped "(cc @user)" framing: Slack rendered it as plain text, not an actual mention. Assignees are still listed for context. - Multiple "Team : X" labels on one issue now report the issue under each matched team (previously only the first label-order match got notified). - Added a guard for projectV2 == null (renamed project / token without read:project scope) so dry-runs fail with a clear error rather than a "Cannot read properties of null" stack. - stuck_days is now embedded on each group; matrix builder no longer needs to re-resolve the env var. - Test suite expanded to cover: empty project, multi-team labels, missing Status field (fail loudly), partial-missing Status (warn, continue), null projectV2, and a team configured without a slack channel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nollymar · 2026-05-21T23:07:56Z

Addressed in 93afc3f:

🔴 Cron — 0 13 */3 * * → 0 13 * * 1,4 (Mon/Thu 13:00 UTC). Even cadence, no month-boundary glitches.
🟠 fieldValues(first: 20) — bumped to 50; labels 30→50; assignees 10→20. Added a presence assertion that fails the run when fewer than 50% of items expose a readable Status (surfaces "wrong field name" or "still capped" in dry-run rather than silently returning empty).
🟠 updatedAt semantics — Slack header now reads "N QA/Done issue(s) with no project activity for Xd+ days" and each line "last project update Nd ago"; trailing paragraph spells out the caveat: "Metric: time since the project item last changed (any field). May undercount if other fields were edited recently." Same wording in the job-summary.
🟡 Slack @-cc — dropped, since @name was plain text not a mention. Assignees still listed (· assignees: foo, bar).
🟡 Multi Team label — issue now reports under every matched team, not just first-in-label-order. Test case added.
🟡 STUCK_DAYS propagation — embedded on each group (stuck_days: N); matrix job reads group.stuck_days instead of re-resolving via workflow env.
🟢 Null-project guard — explicit setFailed with a "token may lack read:project scope" hint instead of Cannot read properties of null. Test case added.
🟢 Fragile channel deref — channel is now stored on the team accumulator at insert time, removing the teamChannels[team] re-lookup in grouping.
🟢 Tests — expanded from 1 scenario to 7: main, empty project, multi-team labels, missing-Status loud failure, partial-Status warn-and-continue, null project, team-without-slack-channel ignored. node test-find-stuck-issues.js runs them all; actionlint clean.

Second pass on review feedback. - Validate STUCK_DAYS up front: must be a finite integer >= 1 or the run fails. Without this, parseInt('abc') -> NaN -> cutoff invalid -> every QA/Done item passes the staleness gate (false positive storm). Same risk for 0 / -1, which slid cutoff to now/future. - workflow_dispatch.inputs.stuck_days is now type: number so GH rejects non-numeric input at dispatch time. - Labels page-size: bumped to 100 and the query now returns pageInfo. Issues whose labels overflow page 1 are skipped with a warning rather than evaluated against an incomplete label set (the QA : Passed gate could otherwise produce false positives). - Concurrency group "qa-stuck-check" with cancel-in-progress: false so a manual dry-run can't overlap a scheduled live post. - Escape Slack mrkdwn-sensitive chars (&, <, >) in team / status / title / assignee strings so a title like "Fix <X> & <Y>" doesn't break the <url|text> link syntax. jq --arg still handles JSON escaping; this is the orthogonal mrkdwn layer. - Tests: added invalid-STUCK_DAYS path and labels-overflow path. Suite is now 9 cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nollymar · 2026-05-21T23:44:15Z

Second-review pass landed in 18197c7:

Bugs / risks

✅ parseInt(STUCK_DAYS) unguarded — added validateConfig: finite integer ≥ 1 or the run fails up front. workflow_dispatch.inputs.stuck_days switched from string to number so GH rejects bad input at dispatch time. Test asserts abc / 0 / -1 all throw (empty stays valid since it's how scheduled triggers signal "use default").
✅ Labels pagination — bumped labels(first: 50) → labels(first: 100), query now returns pageInfo, and issues with hasNextPage: true are skipped with a warning rather than evaluated against an incomplete label set. Test asserts the skip + warning.
✅ Concurrency — added concurrency: { group: qa-stuck-check, cancel-in-progress: false }. Manual dry-run can't race the scheduled run.
⏸ updatedAt proxy false-negative shape (Priority/Iteration bumps reset the clock) — acknowledged limitation; no signal change short of a different data source. Open to revisiting cadence (weekly × 7-day vs Mon/Thu × 3-day) but leaving as-is for now since the wording already calls out "any project-field edit resets the clock".
✅ Slack mrkdwn escaping — &, <, > now escaped in team / status / title / assignee text via an esc helper. jq --arg continues to handle JSON-level escaping; this fixes the orthogonal mrkdwn layer (e.g. titles like Fix <X> & <Y>).

Smaller stuff
6. ✅ inputs.stuck_days type → number.
7. ⏸ STATUS_PRESENCE_MIN_RATIO left hardcoded — one knob, internal heuristic, not worth widening the env surface.
8. ⏸ DST drift in daysStuck — acknowledged, no action.
9-10. ⏸ No action (existing convention).
11. ✅ Tests for NaN/invalid stuck_days and >100 labels both added. Suite is now 9 cases, all green.

actionlint and node --check clean; full test suite passes.

Addresses the two items the second review flagged as fix-before-merge. - find-stuck-issues now drops items where state == 'CLOSED' and the project Status is anything other than Done. A closed issue parked in Status=QA is board-cleanup (closed-as-completed without moving the card, or closed-as-not-planned without a QA : Not Needed label) — pinging the team about it is noise. Added a test case asserting CLOSED+QA is dropped while OPEN+QA and CLOSED+Done are kept. - New workflow cicd_pr_qa-stuck-check-validate.yml runs `node .github/scripts/qa-stuck-check/test-find-stuck-issues.js` on PRs that touch the script, scheduled workflow, this validator workflow, or the triage config. Sub-second job (zero npm deps). Without this, regressions would only surface on the next Mon/Thu 13:00 UTC scheduled run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nollymar · 2026-05-21T23:58:32Z

Addressed the two items called out as "actually fix before merging" in b0f674cb07:

2. ✅ Issue.state drift — find-stuck-issues now drops items where state == 'CLOSED' and project Status is anything other than Done. Rationale: a closed issue parked in Status=QA is board-cleanup (closed-as-completed without moving the card, or closed-as-not-planned without QA : Not Needed), not pending QA. Test case asserts CLOSED+QA is dropped while OPEN+QA and CLOSED+Done are kept.

3. ✅ Tests not in CI — added cicd_pr_qa-stuck-check-validate.yml. Path-filtered pull_request trigger on the script/workflow/config; checks out the relevant files, sets up Node from .nvmrc, runs the smoke test. Zero npm deps, sub-second job. Concurrency group cancels in-progress on synchronize.

The 🟡/🟢 items I'm leaving as judgment calls (happy to address any if you want):

(1) YAML comment clarifying schedule-doesn't-dry-run — fair, can add.
(4) assignees first: 20 no-overflow guard — display-only impact.
(5) parseInt(PROJECT_NUMBER) validation — env is hard-coded in the YAML, lower likelihood of a typo than user input.
(6) set -euo pipefail in the Slack post step — defensive, harmless.
(8) Number.isInteger(STUCK_DAYS) to reject decimals — minor.
(9) Project-not-found error fires after retries: 3 — known limitation of the github-script wrapper; acceptable since the result is "fail with clear message after a few seconds."

Test suite is now 10 cases, all green; actionlint clean on both workflows.

…uck_days=0 Addresses the two highest-impact findings from the latest review. - assertStatusFieldVisible now filters to content.__typename === 'Issue' before computing the Status-presence ratio. The previous version included PullRequest / DraftIssue items in the denominator, which can legitimately lack a Status value on this board — a project with even a handful of PRs added to it would have failed the ratio guard with a misleading "field falls outside fieldValues page" error. Test added asserting that a single Issue+Status alongside several DraftIssues passes cleanly. - STUCK_DAYS env in the workflow now uses an explicit empty-string check instead of `||`. GH Actions evaluates `0 || '3'` to `'3'`, so a manual dispatch with stuck_days=0 was silently coerced to the default 3 instead of failing validateConfig with a clear error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nollymar · 2026-05-22T20:25:15Z

Addressed the two highest-impact findings from the latest review in 6a91742913:

🟠 assertStatusFieldVisible scoped to Issue items — denominator now filters to content.__typename === 'Issue'. PRs / DraftIssues no longer dilute the ratio. Test testNonIssueContentDoesNotTripRatio asserts a board with 1 Issue + 4 DraftIssues passes cleanly (previously would have failed with a misleading "field outside fieldValues page" error).

🟡 stuck_days=0 YAML coalesce bypass — ${{ inputs.stuck_days || '3' }} → ${{ inputs.stuck_days != '' && inputs.stuck_days || '3' }}. Explicit empty-string check means 0 propagates to the script and trips validateConfig with the clear "must be >= 1" message, rather than being silently coerced to 3.

Left as judgment calls:

🟡 esc only covering & < > — Slack's spec is satisfied; */_/backtick are display-quality not correctness.
🟡 Job-summary markdown not escaped — internal-only render, lower stakes.
🟢 set -euo pipefail in the Slack post step — already acknowledged, marginal benefit.
🟢 429 retry — single-shot is fine at ~4 teams Mon/Thu cadence.

Suite is 11 cases, all green; actionlint clean.

nollymar requested a review from a team as a code owner May 21, 2026 22:51

github-project-automation Bot added this to dotCMS - Product Planning May 21, 2026

github-actions Bot added the Area : CI/CD PR changes GitHub Actions/workflows label May 21, 2026

claude Bot added the AI: Safe To Rollback label May 21, 2026

github-actions Bot removed the AI: Safe To Rollback label May 21, 2026

claude Bot added the AI: Safe To Rollback label May 21, 2026

github-actions Bot removed the AI: Safe To Rollback label May 21, 2026

claude Bot added the AI: Safe To Rollback label May 21, 2026

github-actions Bot removed the AI: Safe To Rollback label May 21, 2026

claude Bot added the AI: Safe To Rollback label May 22, 2026

github-actions Bot removed the AI: Safe To Rollback label May 22, 2026

claude Bot added the AI: Safe To Rollback label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): scheduled QA-stuck issue check with Slack notifications#35802

feat(ci): scheduled QA-stuck issue check with Slack notifications#35802
nollymar wants to merge 5 commits into
mainfrom
worktree-qa-stuck-prs-automation

nollymar commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

claude Bot commented May 21, 2026 •

edited

Loading

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nollymar commented May 21, 2026

Summary

Behavior

Auth note

Test plan

Uh oh!

github-actions Bot commented May 21, 2026

❌ Issue Linking Required

How to fix this:

Why is this required?

Uh oh!

claude Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

🟠 Things to fix before live posting

🟡 Worth a second look

🟢 Nits / observed, not flagging

Test coverage looks solid

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 21, 2026

Uh oh!

nollymar commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 21, 2026 •

edited

Loading