Skip to content

feat: replace Elasticsearch search with Pagefind#5106

Merged
DeepDiver1975 merged 2 commits into
masterfrom
feature/pagefind-search
Jun 16, 2026
Merged

feat: replace Elasticsearch search with Pagefind#5106
DeepDiver1975 merged 2 commits into
masterfrom
feature/pagefind-search

Conversation

@DeepDiver1975

Copy link
Copy Markdown
Member

Summary

Migrates documentation search from a self-hosted Elasticsearch index to Pagefind, a static search library. Pagefind crawls the already-built HTML in public/ and writes a chunked static index into public/pagefind/ — no server, container, CORS workaround, or ELASTICSEARCH_* secrets required.

Changes

  • package.json — drop @elastic/elasticsearch, cheerio, html-entities, lodash; add pagefind dev dependency and a pagefind npm script
  • site.yml / docs/extensions.md — remove the generate-index.js Antora extension
  • Deleted ext-antora/generate-index.js (192-line ES indexer) and es-docker-compose.yml (local ES container)
  • .github/workflows/ci.yml — drop ELASTICSEARCH_* env/secrets; run npm run pagefind after the Antora build so the deployed site always ships a fresh index
  • docs/build-the-docs.md — rewrite the Search section for the Pagefind workflow

Verification

Built and indexed locally:

  • npm run antora builds cleanly with no reference to the removed extension
  • npm run pagefind indexes 1359 pages / 21916 words and writes public/pagefind/ including the search UI bundle

Note: the front-end search bar integration (loading Pagefind's UI bundle) lives in the separate docs-ui repo and needs a matching change there for end-to-end search to work.

🤖 Generated with Claude Code

Migrate documentation search from a self-hosted Elasticsearch index to
Pagefind, a static search library that crawls the built HTML in public/
and writes a chunked index into public/pagefind/. This removes the need
for an Elasticsearch server, local docker container, CORS workarounds,
and the ELASTICSEARCH_* CI secrets.

- package.json: drop @elastic/elasticsearch, cheerio, html-entities and
  lodash; add pagefind dev dependency and a `pagefind` npm script
- site.yml / extensions.md: remove the generate-index.js Antora extension
- delete ext-antora/generate-index.js and es-docker-compose.yml
- ci.yml: drop ELASTICSEARCH_* env/secrets, run `npm run pagefind` after
  the Antora build so the deployed site always ships a fresh index
- build-the-docs.md: rewrite the Search section for the Pagefind workflow

Verified locally: `npm run antora` builds cleanly with no reference to
the removed extension, and `npm run pagefind` indexes 1359 pages.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Thomas Müller <1005065+DeepDiver1975@users.noreply.github.com>
@DeepDiver1975 DeepDiver1975 force-pushed the feature/pagefind-search branch from f4ebd0b to 49de010 Compare June 15, 2026 15:15
@DeepDiver1975 DeepDiver1975 changed the title Replace Elasticsearch search with Pagefind feat: replace Elasticsearch search with Pagefind Jun 15, 2026
@DeepDiver1975

Copy link
Copy Markdown
Member Author

frontend part: owncloud/docs-ui#1018

@DeepDiver1975

Copy link
Copy Markdown
Member Author
image

@phil-davis phil-davis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the search indexing will be all done as part of the publishing.

@DeepDiver1975

Copy link
Copy Markdown
Member Author

Code review

Pipeline half of the Elasticsearch → Pagefind migration: removes the ES index-generation extension + service and runs the pagefind CLI over the built site. Pairs with owncloud/docs-ui#1018both must land together (this produces an index that only the new docs-ui markup consumes).

Strengths

  • 🔒 Security win: removes a hardcoded credential that was committed in build-the-docs.md (ELASTICSEARCH_READ_AUTH=docs:cADL...) and drops 5 ELASTICSEARCH_*/UPDATE_SEARCH_INDEX secrets from CI. This alone justifies the change.
  • Thorough dependency cleanup: @elastic/elasticsearch, cheerio, html-entities, lodash and their transitive trees all removed from the lockfile. pagefind correctly placed in devDependencies (build-time only).
  • Architecture is a clear net simplification (−661 lines): no live cluster, no browser-exposed credentials, no CORS workaround. Pagefind crawls the already-built public/ so the index automatically covers every component/version in the build.
  • Docs rewrite is excellent — the old 8-step ES/Docker/CORS dance becomes a clean 4-step local flow, with the CI behavior explained. CI ordering (antorapagefind → CNAME) is correct.

Issues & suggestions

🟡 No guard against an empty index. npm run pagefind runs unconditionally after npm run antora. If a future Antora change alters the output dir or emits zero indexable pages, Pagefind exits 0 with an empty index and the site deploys with silently broken search. With search now having no automated test coverage, this is the highest-value safety net:

npm run pagefind | tee /tmp/pf.log
grep -qE "Indexed [1-9][0-9]+ pages" /tmp/pf.log || { echo "Pagefind index suspiciously small"; exit 1; }

🟢 Loose version pin. pagefind: ^1.3.0 resolves to 1.5.2 in the lockfile, but the docs-ui markup targets the 1.5.x Component UI web components (<pagefind-modal> etc.), which don't exist in 1.3.x. A fresh install honoring only ^1.3.0 could resolve an older minor → broken search. Recommend bumping the floor to ^1.5.0 to match the UI requirement.

🟢 Nit: site.yml still carries a commented #- ./ext-antora/comp-version.js (pre-existing). While editing the extensions: block, a good moment to drop the dead comment.

Verdict

Approve, with one pre-merge ask: bump pagefind to ^1.5.0 (the UI depends on it). The empty-index CI guard is a strong follow-up but needn't block. The removed credential + 5 secrets and the large net reduction make this high-value. Merge in lockstep with owncloud/docs-ui#1018.

Address review feedback on the Elasticsearch -> Pagefind migration:

- CI ran `npm run pagefind` unconditionally. pagefind exits 0 even when it
  indexes nothing (e.g. the output dir moves or no indexable pages are
  emitted), so a broken index would deploy with silently empty search. Tee
  the output and fail the build unless at least 10 pages were indexed.
- Bump the pagefind floor from ^1.3.0 to ^1.5.0. The docs-ui search markup
  uses the Component UI web components (pagefind-modal, pagefind-modal-trigger),
  which were introduced in Pagefind 1.5.0; ^1.3.0 declared compatibility with
  releases that lack them. The lockfile already resolved 1.5.2, so this only
  corrects the declared range (no dependency change).

Signed-off-by: Thomas Müller <1005065+DeepDiver1975@users.noreply.github.com>
@DeepDiver1975

Copy link
Copy Markdown
Member Author

Addressed the review findings in c2fb874a:

  • 🟡 Empty-index CI guardnpm run pagefind ran unconditionally and exits 0 even on an empty index, which would deploy silently broken search. CI now tees the output and fails the build unless ≥10 pages were indexed. Tested the regex both ways: real index (1360 pages) passes; 0–9 pages block the build.
  • 🟢 Version floor — bumped pagefind ^1.3.0^1.5.0. The docs-ui Component UI web components (pagefind-modal, pagefind-modal-trigger) were introduced in Pagefind 1.5.0, so ^1.3.0 declared compatibility with releases that lack them. Lockfile already resolved 1.5.2, so the resolved version is unchanged.

Retracting the comp-version.js nit from my earlier review — it was wrong. That extension is active in site.yml (line 82), not a commented-out dead line; I'd confused it with a similarly-named entry in another repo. No change made there.

@DeepDiver1975 DeepDiver1975 merged commit 7d3ceba into master Jun 16, 2026
2 checks passed
@DeepDiver1975 DeepDiver1975 deleted the feature/pagefind-search branch June 16, 2026 11:16

@DeepDiver1975 DeepDiver1975 left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Automated review by Claude Code review agent.

Overview

This PR migrates documentation search from a self-hosted Elasticsearch index to Pagefind, a static search library that crawls the built HTML in public/ and emits a chunked static index into public/pagefind/. The change is clean and well-scoped: it removes the 192-line generate-index.js Antora extension, the local ES docker-compose file, the ELASTICSEARCH_* CI secrets/env, and four runtime dependencies (@elastic/elasticsearch, cheerio, html-entities, lodash), replacing them with a single pagefind dev dependency and an npm script. The docs and CI build are updated accordingly. This is a strong net reduction in operational complexity (no server, container, CORS workaround, or secrets) and a sensible architectural direction.

Code quality / style

  • The CI guard around npm run pagefind is a nice touch — failing the build when the index is empty avoids silently shipping broken search, and the inline comment explains the rationale well.
  • Dependency cleanup is thorough: the package-lock.json correctly drops the entire transitive tree (cheerio/parse5/htmlparser2/undici/etc.) and the shared deps debug/ms/iconv-lite/safer-buffer are correctly re-flagged "dev": true.
  • The rewritten Search section in build-the-docs.md is much simpler and accurate to the new workflow.

Specific suggestions

  1. package-lock.json name changed "docs""orchestrator" (likely accidental). Line ~2 of the lockfile diff changes the top-level "name" field from docs to orchestrator, but package.json still declares the project (it is not shown changing its name). This looks like an artifact of regenerating the lockfile in the wrong directory / from a different package.json. It should be reverted so the lockfile name matches package.json, otherwise npm may warn and it is confusing. Please confirm package.json's name and the lockfile agree.

  2. CI grep regex vs. Pagefind output wording. Pagefind's summary line uses the template Indexed {N} page{s} (singular "page" for 1, "pages" for >1). The guard grep -qE 'Indexed [1-9][0-9]+ pages' requires a 2+-digit count immediately followed by the literal pages, i.e. it passes only at >=10 pages — which matches the intent. Two caveats worth a quick check:

    • If Pagefind ever emits the count with a thousands separator (e.g. 1,359) or with ANSI color codes around the number, the grep would fail to match a perfectly good index and break the build. Consider stripping ANSI (sed 's/\x1b\[[0-9;]*m//g') before grepping, or loosening the pattern. Given the PR reports ~1359 pages, a separator-free large integer is the current reality, but it is a brittle coupling to log formatting.
    • The error message says "fewer than 10 pages" while the comment above says the concern is indexing nothing; consider aligning the wording so the threshold intent is unambiguous.
  3. pagefind.log handling. npm run pagefind | tee pagefind.log then rm -f pagefind.log. Because of the && / || {} block and set -e semantics in a run: step, the rm only executes on the success path; on failure the script exit 1s first, leaving pagefind.log behind. That is harmless (ephemeral CI runner) but means the cleanup is effectively only cosmetic. Minor.

  4. npm run pagefind script. pagefind --site public is correct. Note newer Pagefind versions prefer --site (kept) over the deprecated --source; good.

Potential issues / risks

  • End-to-end search depends on docs-ui. The PR body correctly notes that loading Pagefind's UI bundle lives in the separate docs-ui repo and needs a matching change. Until that lands, the deployed site will ship a valid public/pagefind/ index but the search bar will not be wired to it — i.e. search is effectively non-functional end-to-end after this merges alone. Recommend coordinating the merge with the docs-ui change (or noting the rollout order in the PR) to avoid a window of broken search in production.

  • Removed functionality — per-heading anchored results. The old indexer captured each h1..h6 with its anchor (url#id) so results could deep-link to a specific section. Pagefind supports sub-results/anchored headings too, but only if the UI is configured for it. Worth confirming the docs-ui integration preserves the heading-level deep-linking that users currently get, otherwise this is a subtle UX regression.

  • Excluded pages. The old extension explicitly skipped pages with a leading _ (no page.pub.url). Pagefind indexes whatever HTML Antora actually publishes, so behavior should be equivalent (unpublished partials are not emitted to public/), but worth a sanity check that no internal/partial pages leak into the index.

  • No automated test of the index. Beyond the CI page-count floor, there is no assertion that public/pagefind/ is structurally valid or that a known term returns a hit. The page-count guard is a reasonable, pragmatic smoke test for a docs repo.

Overall: a well-executed, simplifying migration. The main blocker to verify before merge is the package-lock.json name change (item 1); the docs-ui coordination (first risk) is the key functional dependency to sequence correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants