Add an AI "Reformat" option for manuscript sections#1277
Merged
Conversation
…ed by regex The deterministic Format button can't resolve genuinely scrambled PDF-paste text (duplicated/misplaced quotes, ambiguous wraps before a capitalized word). Add a second per-section "Reformat (AI)" button that sends the text to the selected AI provider with a strict "fix only whitespace, line breaks, drop-caps, and quote placement — change no words" prompt (new stage `manuscript-reformat`), then persists the result (snapshotted, revertible). Trustworthy by construction: a server-side integrity guard compares the letter/digit skeleton of the text before and after and rejects (400, nothing saved) any result that rewrote, added, or dropped words. The model may only move whitespace and re-attach quotes, plus delete a tiny budget of artifact characters (e.g. a duplicated drop-cap). The core reformatManuscriptText() is exported so the importer can reuse it (PLAN item) to clean prose at ingest. - prompt: data.reference/prompts/stages/manuscript-reformat.md (+ stage-config) - migration 087 seeds the stage into existing installs (boot runs migrations, not setup-data) - service: reformatManuscriptText / reformatManuscriptSection in manuscriptFix.js - route: POST /pipeline/series/:id/manuscript/sections/:issueId/reformat - client: reformatPipelineManuscriptSection + the second header button
…hange path - the proportional deletion budget let a large section silently absorb a dropped contiguous clause (a clause is a valid subsequence deletion). Replaced the run-length heuristic (which under-reported when deleted chars coincidentally matched later text) with a small ABSOLUTE total-deletion budget: for a subsequence the net length drop is the exact total deleted, so an 8-char cap rejects any dropped clause/sentence regardless of document size while still allowing a few scattered artifact glyphs (duplicated drop-cap) - the no-change reformat path no longer flashes a 'saved' badge or rewrites baseline; it only toasts 'Already well-formatted'
…, edit-safe Three correctness fixes from codex review of the AI reformat path: 1. Unsaved edits were ignored — the endpoint loaded the SAVED stage text, so reformatting with unsaved edits in the textarea reformatted stale text and the response overwrote the edits. The endpoint is now compute-only and takes the client's live `content`; the client sends its current (possibly unsaved) text. 2. Stale overwrite during the slow call — the endpoint no longer persists; the client owns the save and, before applying, checks the section's live content still equals what it sent (via a live-sections ref), discarding the result instead of clobbering a mid-call edit. Cross-tab concurrent saves are last-write-wins, identical to the Save button. 3. Integrity guard accepted short word deletions — `do not go` → `do go` (3 skeleton chars) slipped past the deletion budget yet inverts meaning. Replaced the budget/subsequence allowance with an EXACT skeleton match: the model may only move whitespace and quotation-mark/punctuation glyphs, never delete a letter or word. A duplicated word stays put (the deterministic Format button owns that dedup). Prompt updated to match. Route moves to POST /series/:id/manuscript/reformat (compute-only, no issue mutation). reformatManuscriptText stays exported for the importer.
…sitive The skeleton lowercased before comparing, so a case-only rewrite (US → us, a de-capitalized name/heading) passed the guard despite the 'every letter preserved exactly' contract. No reformat operation changes letter case, so compare case-sensitively and reject case-only changes too.
…y skeleton The skeleton ignored all non-alphanumerics, so a mangled contraction (don't → dont) passed. Keep apostrophes (curly normalized to straight, so a benign smart-quote pass isn't flagged) — they're never touched by a legitimate reformat. Documented the residual cases that genuinely can't be guarded without breaking reformatting: a mid-word hyphen removal (X-ray → Xray) is indistinguishable from de-hyphenating a wrap-split word, and a removed inter-word boundary (now here → nowhere) from a drop-cap rejoin (T\nhe → The). Changelog scoped to the actual guarantee.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The deterministic Format button (regex) keeps hitting PDF-paste cases it fundamentally can't resolve — genuinely scrambled text: duplicated/misplaced quotes, a dropped
"I, ambiguous wraps before a capitalized word. This adds a second per-section Reformat (AI) button (both Live and Review modes) that resolves these semantically.It sends the section to the AI provider selected in the sidebar with a strict prompt (new
manuscript-reformatstage): fix only whitespace, line breaks, drop-caps, and quotation-mark placement — change no words. The result is persisted server-side (snapshotted, so History ▸ Revert undoes it).Trustworthy by construction — the integrity guard. Reformatting only moves whitespace and re-attaches quotes, so the letter/digit "skeleton" of the text is (near-)identical before and after. The service compares skeletons and rejects (400, nothing saved) any result that rewrote, added, or dropped words. The model is only ever allowed to move whitespace/quotes plus delete a tiny bounded budget of artifact characters (e.g. a duplicated drop-cap, verified to form a subsequence — no substitutions). So the AI can never silently alter your prose.
The core
reformatManuscriptText()is exported so the importer can reuse it to clean prose at ingest (tracked as a PLAN item — it needs opt-in UX because it's one LLM call per issue).What's here
data.reference/prompts/stages/manuscript-reformat.md+ astage-config.jsonentry; migration 087 seeds both into existing installs (boot runs migrations, notsetup-data.js).reformatManuscriptText/reformatManuscriptSection+ the integrity guard inmanuscriptFix.js.POST /pipeline/series/:id/manuscript/sections/:issueId/reformat(providerOverride/modelOverride, errors → 400/404 via the shared mapper).reformatPipelineManuscriptSectionAPI wrapper + the second header button, wired through both section components with the editor's provider override.Test plan
server/services/pipeline/manuscriptReformat.test.js— 10 tests on the integrity guard: pure-whitespace change accepted, de-hyphenation accepted, word substitution / inserted sentence / large deletion rejected, tiny artifact deletion accepted, code-fence + marker stripping, empty-input no-op, and the format-label/source pass-through. All green.eslintclean (client + server); existing manuscript suites pass.Note
The first AI render on a section costs one LLM call (uses the sidebar provider). The plain Format button stays for instant/offline cleanup.