Skip to content

docs: fix broken links — redirects, GitHub source links, drift#1938

Open
lbliii wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
lbliii:lbliii/broken-links-review
Open

docs: fix broken links — redirects, GitHub source links, drift#1938
lbliii wants to merge 6 commits intoNVIDIA-NeMo:mainfrom
lbliii:lbliii/broken-links-review

Conversation

@lbliii
Copy link
Copy Markdown
Contributor

@lbliii lbliii commented May 6, 2026

Summary

Hand-verified link fixes from the 2026-05-04 broken-link report. Scoped to changes I'm confident in; unverified scripted rewrites and the autodoc workaround were dropped from earlier iterations of this branch (see history).

  • Version-root index.html redirects/nemo/curator/{latest,v26.04,v26.02,v25.09,…}/index.html were 404ing because the existing :path* redirect rule does not match the empty-path case. Added 7 explicit rules in fern/docs.yml, mirroring the existing /nemo/curator/index.html carve-out. Slotted before the :path* rules so they fire first.
  • Stale GitHub source linksnemo_curator/tasks/audio.py was renamed to audio_task.py; nemo_curator/backends/experimental/ was removed. Updated the View source on GitHub links on the affected committed API reference pages in v25.09, v26.02, and v26.04.
  • Committed-page drift — 4 pages had /api/reference/api-reference/ paths that needed to be /reference/api-reference/. This rewrite is idempotent and safe.
  • Skill doc refresh (.claude/skills/nemo-curator-docs/SKILL.md) — current train updated v26.02 → v26.04; new sections on holding a version back from publish, the Fern Python library generator cross-ref bug (no in-repo workaround currently — track upstream), and redirect quirks (:path* empty-path gotcha, ordering, slug forms); DCO sign-off note added.

Out of scope (deliberate)

  • Autodoc cross-ref rewrite (the 541 broken links from fern docs md generate). The Fern Python library generator emits cross-refs that miss the /nemo/curator basepath and tacks on Sphinx-style #nemo_curator-… fragments that don't match any rendered anchor. Filed upstream with Fern; revisit in a separate PR if the upstream fix doesn't land soon.
  • Existing _fix_broken_links.py rule cleanup. That script has non-idempotent substring rules (("/deployment/requirements", "/admin/deployment/requirements") and similar) that add the prefix on every run. Do not run the script until those rules are tightened (anchored regexes / negative lookbehinds) — running it produced doubled-path regressions like /admin/admin/deployment/requirements in earlier commits on this branch, all reverted. Tightening the rules is its own PR.

Test plan

  • fern check passes locally
  • Preview build (auto-posted by fern-docs-preview-comment.yml) — verify pages render and the affected GitHub source links resolve
  • After deploy: verify /nemo/curator/latest/index.html, /nemo/curator/v26.04/index.html, etc., redirect to the version landing instead of 404ing
  • After deploy: verify https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/tasks/audio_task.py and https://github.com/NVIDIA-NeMo/Curator/tree/main/nemo_curator/backends resolve

🤖 Generated with Claude Code

@lbliii lbliii requested a review from a team as a code owner May 6, 2026 19:59
@lbliii lbliii requested review from praateekmahajan and removed request for a team May 6, 2026 19:59
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 6, 2026

Greptile Summary

This PR fixes a focused set of hand-verified broken links across the NeMo Curator docs: 7 explicit version-root index.html redirects to close the :path* empty-path gap, stale GitHub source links for the renamed audio_task.py and removed backends/experimental/ path, and /api/reference/api-reference//reference/api-reference/ prefix drift on committed API reference pages in v25.09 and v26.02. The CI workflow is also simplified to fern check (config-only validation) so it runs cleanly on fork PRs without secrets.

  • Redirect additions (fern/docs.yml): 7 new rules for {latest,v26.04,v26.02,v25.09,26.04,26.02,25.09}/index.html are slotted before the :path* catch-alls, correctly addressing the empty-path match gap that caused version-root 404s.
  • GitHub source link fixes: audio.pyaudio_task.py in v25.09 and v26.02 audio-batch.mdx; backends/experimental/backends tree in all three versions' experimental.mdx.
  • CI refactor (fern-docs-ci.yml + fern/package.json): replaces the secrets-dependent fern docs md generate job with a lightweight fern check npm script, trading autodocs generation validation for fork-PR compatibility.

Confidence Score: 5/5

The changes are narrow, hand-verified link fixes with no logic changes; the redirect additions are correctly ordered and scoped.

Each change in this PR targets a specific, confirmed broken link category. The redirect rules are additive and correctly placed before the :path* catch-alls. The GitHub source link updates point to existing paths. The href prefix fixes were applied consistently within the files touched by this PR. The CI change is a deliberate, documented scope reduction. No new regressions were introduced.

No files in this PR require special attention. Two pre-existing gaps flagged in earlier review rounds — the v26.04 api-reference/index.mdx Card hrefs and the v25.09/reference/index.mdx Tasks card — remain open but are tracked separately.

Important Files Changed

Filename Overview
fern/docs.yml Adds 7 explicit version-root index.html redirect rules before the :path* catch-alls to fix the empty-path gap; redirect ordering, sources, and destinations are all correct.
fern/versions/v25.09/pages/api-reference/index.mdx All 8 Card hrefs updated from stale /api/reference/api-reference/ prefix to /reference/api-reference/; fix is complete and consistent.
fern/versions/v26.02/pages/api-reference/index.mdx All 8 Card hrefs updated from stale /api/reference/api-reference/ prefix to /reference/api-reference/; fix mirrors the v25.09 change correctly.
fern/versions/v25.09/pages/reference/index.mdx Five of six Card hrefs updated to drop the stale /api/ prefix; the Tasks card fix was flagged separately in a prior review round.
fern/versions/v25.09/pages/api-reference/tasks/audio-batch.mdx GitHub source link corrected from the deleted audio.py to the renamed audio_task.py; same fix applied identically to v26.02.
fern/versions/v26.04/pages/api-reference/executors/experimental.mdx GitHub source link updated from the removed backends/experimental/ path to the live backends tree; fix applied across all three versions.
.github/workflows/fern-docs-ci.yml CI job switched from fern docs md generate (requires FERN_TOKEN, fails on fork PRs) to fern check (config-only validation, no secrets needed); deliberate trade-off documented in header comment.
fern/package.json New package.json wraps Fern CLI commands as npm scripts; the check script used by CI uses npx -y fern-api@latest, consistent with the old unversioned global install pattern.
.claude/skills/nemo-curator-docs/SKILL.md Documentation updated to current train v26.04; adds guidance on holding a version back, the Fern cross-ref bug, DCO sign-off requirement, and redirect quirks including the :path* empty-path gotcha.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Incoming request] --> B{Matches version-root\nindex.html?\ne.g. /nemo/curator/latest/index.html}
    B -->|Yes - NEW rules| C[Redirect to version root\ne.g. /nemo/curator/latest]
    B -->|No| D{Matches :path*/index.html?}
    D -->|Yes - existing rules| E[Redirect: strip index.html\ne.g. /nemo/curator/latest/foo/index.html\n to /nemo/curator/latest/foo]
    D -->|No| F{Matches calendar-train\nslug? e.g. 26.04/:path*}
    F -->|Yes| G[Redirect to v-prefixed slug\ne.g. /nemo/curator/v26.04/:path*]
    F -->|No| H[Serve page normally]
Loading

Reviews (7): Last reviewed commit: "Merge branch 'main' into lbliii/broken-l..." | Re-trigger Greptile

This major release represents a fundamental architecture shift from [Dask](https://www.dask.org/) to [Ray](https://www.ray.io/), expanding NeMo Curator to support multimodal data curation with new [video](/curate-video) and [audio](/curate-audio) capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.

**Migrating from a previous version of NeMo Curator?** Refer to the [Migration Guide](/about/release-notes/migration-guide) for step-by-step instructions and the [Migration FAQ](/about/release-notes/migration-faq) for common questions.
**Migrating from a previous version of NeMo Curator?** Refer to the [Migration Guide](/about/release-notes/about/release-notes/migration-guide) for step-by-step instructions and the [Migration FAQ](/about/release-notes/about/release-notes/migration-faq) for common questions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Broken links introduced by script substitution

Both migration-guide and migration-faq links were rewritten from their correct absolute paths to doubly-nested paths like /about/release-notes/about/release-notes/migration-guide that do not exist. The same pattern repeats across migration-faq.mdx, migration-guide.mdx, and v26.02 equivalents — all need to be reverted to the original single-prefix paths.

### System Requirements

For comprehensive system requirements and production deployment specifications, see [Production Deployment Requirements](/admin/deployment/requirements).
For comprehensive system requirements and production deployment specifications, see [Production Deployment Requirements](/admin/admin/deployment/requirements).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 /admin/admin/deployment/requirements is a non-existent path

The link to Production Deployment Requirements was correct (/admin/deployment/requirements) before the script ran. The /deployment/requirements/admin/deployment/requirements replacement rule matched the tail of the already-correct path and prepended /admin again. The same issue appears in admin/deployment/index.mdx, about/release-notes/migration-faq.mdx, and their v26.02 counterparts.

</Card>

<Card title="Duration Filtering" href="/curate-audio/process-data/quality-assessment/duration-filtering">
<Card title="Duration Filtering" href="/curate-audio/process-data/curate-audio/process-data/quality-assessment/curate-audio/process-data/quality-assessment/duration-filtering">
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Triple-nested duration-filtering and format-validation hrefs will 404

The /quality-assessment/duration-filtering rule fires first and doubles the path; then the /duration-filtering rule fires on the now-present suffix and triples it, producing /curate-audio/process-data/curate-audio/process-data/quality-assessment/curate-audio/process-data/quality-assessment/duration-filtering. The same corruption affects every Duration Filtering and Format Validation link across all curate-audio pages in both v25.09 and v26.02.

…Hub source links

- Extend `fern/_fix_broken_links.py` with `fix_autodoc_file()` to rewrite Fern
  Python library generator output (`fern docs md generate`). Two issues in the
  generator: cross-refs miss the `/nemo/curator` basepath, and Sphinx-style
  `#nemo_curator-…` fragments don't match any rendered anchor. Script walks the
  generated `product-docs/nemo-curator/Full-Library-Reference/**/*.mdx` and
  rewrites both. Skips cleanly when the gitignored dir isn't present locally.
  Filed upstream with Fern; remove the workaround once fixed.
  Accounts for 541 of 543 flagged links.

- Update 2 stale GitHub source links on committed API reference pages:
  `nemo_curator/tasks/audio.py` → `audio_task.py` (file renamed),
  `nemo_curator/backends/experimental/` → `tree/main/nemo_curator/backends`
  (dir removed). Applied in v25.09, v26.02, v26.04.

- Add explicit redirects for `/nemo/curator/{latest,v26.04,v26.02,v25.09,…}/index.html`
  in `fern/docs.yml`. The existing `:path*` rule doesn't match the empty-path
  case (mirrors the existing carve-out for `/nemo/curator/index.html`), so
  bare version-root index.html URLs were 404ing.

- Apply pending rewrites from existing `_fix_broken_links.py` rules to 21
  committed pages (CI doesn't run the script; pages had drifted).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@lbliii lbliii force-pushed the lbliii/broken-links-review branch from 8dc38c3 to 598cb8c Compare May 7, 2026 14:38
Hides v26.04 from the published site so older-version fixes (in this PR)
can ship without v26.04 going out alongside them.

- `fern/docs.yml`: comment out the `latest` (26.04) and `v26.04` entries
  in the `versions:` block; restore instructions inline. Repoint `latest`
  display-name to v26.02 so the dropdown is consistent.
- `fern/versions/latest.yml`: repoint symlink → `v26.02.yml`. `/latest/`
  now serves v26.02 content during the hold-back.
- `.claude/skills/nemo-curator-docs/SKILL.md`: refresh stale references
  (current train v26.02 → v26.04; corrected URL→version mapping); add
  three new sections covering version hold-back/audiences, the Fern
  autodoc cross-ref bug + workaround, and redirect quirks (`:path*`
  empty-path gotcha). Also note the DCO sign-off requirement.

To restore v26.04: uncomment the two entries in `fern/docs.yml`, run
`ln -sf v26.04.yml fern/versions/latest.yml`, update the Latest
display-name back to `(26.04)`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Comment thread fern/docs.yml Outdated
Comment on lines 153 to 165
# path: versions/latest.yml
# slug: latest
# availability: stable
# - display-name: "26.04 · v1.1.2"
# path: versions/v26.04.yml
# slug: v26.04
# availability: stable
- display-name: "Latest · v1.1.0 (26.02)"
path: versions/latest.yml
slug: latest
availability: stable
- display-name: "26.04 · v1.1.2"
path: versions/v26.04.yml
slug: v26.04
availability: stable
- display-name: "26.02 · v1.1.0"
path: versions/v26.02.yml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Undisclosed version rollback bundles into a link-fix PR

The PR title and description cover only broken-link fixes, but this diff also comments out the v26.04 entries and re-points latest.yml to v26.02.yml. A reviewer merging based on the PR description would not realise that "Latest · v1.1.2 (26.04)" is being replaced with "Latest · v1.1.0 (26.02)" for every user on the live docs site. The code comment labels it temporary, but that context is invisible in the PR description or title. Even if intentional, bundling a user-visible version rollback into a 543-link-fix PR without calling it out is a merge risk — the two changes should be in separate PRs, or the description should explicitly disclose the version hold-back so approvers can make an informed decision.

lbliii and others added 2 commits May 7, 2026 11:06
Scoping this PR down to fixes verified end-to-end. The previous commit
included two unverified changes that are now reverted:

1. `fern/_fix_broken_links.py`: remove `fix_autodoc_file()` and the
   `product-docs` traversal. The Fern Python library generator bug
   stays filed upstream — no in-repo workaround until the upstream
   fix lands or we wire the rewrite into CI deliberately.

2. Revert 18 MDX files that the existing script rewrote with
   non-idempotent rules (`/admin/deployment/...` → `/admin/admin/...`,
   `/about/release-notes/migration-*` → doubled-prefix variants,
   `/curate-audio/process-data/quality-assessment/...` → triple-prefix
   variants). These were drift fixes in name but regressions in fact.
   The buggy rules in `_fix_broken_links.py` are unchanged here; treat
   the script as broken until those rules are tightened.

Skill update: rephrase the autodoc section to reflect that there's no
in-repo workaround currently — track the upstream Fern fix.

What ships in this PR (all hand-verified):
- 2 GitHub source-link fixes (`audio.py` → `audio_task.py`, removed
  `backends/experimental/` → parent `backends/`) across v25.09/v26.02/v26.04.
- 4 committed pages with `/api/reference/api-reference/` → `/reference/api-reference/`
  rewrites (drift fix that IS idempotent).
- 7 explicit `*/index.html` redirects in `fern/docs.yml` for the empty-
  path case the existing `:path*` rule misses.
- v26.04 hold-back (versions block + latest.yml symlink → v26.02).
- Skill doc refresh.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Restore the `latest` and `v26.04` entries in `fern/docs.yml` `versions:`
and repoint `fern/versions/latest.yml` symlink back to `v26.04.yml`.
v26.04 publishes alongside the link fixes in this PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@lbliii lbliii changed the title docs: fix 543 broken links from autodoc, redirect gaps, and stale GitHub source links docs: fix broken links — redirects, GitHub source links, drift May 7, 2026
The PR-trigger Fern CI was running `fern docs md generate` which requires
`FERN_TOKEN`. GitHub Actions does not expose repo secrets to `pull_request`
runs from forks (security default), so every fork PR failed with
"Authentication required."

Mirror the pattern used by NVIDIA-NeMo/Gym: split Fern CLI invocations
into npm scripts and have PR CI run only the config-validation step.

- Add `fern/package.json` with `check`, `dev`, `generate`, and
  `generate:library` scripts (matches Gym's setup).
- `.github/workflows/fern-docs-ci.yml`: replace `fern docs md generate`
  with `npm run check`. Renamed job to "Fern docs (check)". Drops the
  separate `npm install -g fern-api` step (npx pulls it inline).

Auth-requiring steps (`fern docs md generate`, `fern generate --docs`)
remain in the publish and preview-comment workflows where secrets are
available — preview via the `workflow_run` two-stage pattern, publish
via the `docs/v*` tag trigger on the upstream repo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@lbliii lbliii requested a review from a team as a code owner May 7, 2026 15:22
Comment on lines 30 to 32
<Card title="Tasks" href="/api/reference/api-reference">
Task types for text, image, video, and audio processing
</Card>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The "Tasks" card was skipped when the /api/reference/api-reference//reference/api-reference/ fix was applied to this page. All five other cards were updated in this PR, but this one still carries the stale prefix — it will 404 just like the others would have without the fix.

Suggested change
<Card title="Tasks" href="/api/reference/api-reference">
Task types for text, image, video, and audio processing
</Card>
<Card title="Tasks" href="/reference/api-reference">
Task types for text, image, video, and audio processing
</Card>

@lbliii lbliii requested a review from sarahyurick May 8, 2026 14:09
lbliii added a commit to NVIDIA-NeMo/Automodel that referenced this pull request May 8, 2026
Fern's :path* parameter does not match the empty-path case, so a rule
like /<basepath>/v0.4/:path*/index.html does not match
/<basepath>/v0.4/index.html (where :path* would have to be empty).
Result: visiting docs.nvidia.com/nemo/automodel/v0.4/index.html (or
the latest/, nightly/, or 0.4/ legacy variants) 404s. NeMo Curator
ran into this and fixed it the same way; see NVIDIA-NeMo/Curator#1938.

Adds 8 explicit version-root rules — for {latest, v0.4, nightly, 0.4}
x {.html, extension-less} — slotted before the :path*/index.html
catch-alls so they fire first. Documents the gotcha in
fern/README.md and skills/fern-docs/SKILL.md so the next person to
add a version slug remembers to add the four matching rules.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants