feat: add client.parse() for the Data Extraction API (/extraction/parse) by nickwinder · Pull Request #47 · PSPDFKit/nutrient-dws-client-python

nickwinder · 2026-05-27T07:31:50Z

Why

The Data Extraction API (/extraction/parse) is now generally available. This PR adds first-class client support so users can call it directly from NutrientClient without constructing raw HTTP requests.

Summary

New client.parse() method covering all four processing modes (text, structure, understand, agentic) and both output formats (spatial element list, whole-document markdown).
Typed ParseResponse envelope with a discriminated union of element variants (paragraph, table, formula, picture, keyValueRegion, handwriting) — if element["type"] == "table": ... narrows correctly via the type discriminator.
New ExtractionCredits type module to surface the extraction-credit billing bucket, which is separate from the processor-credit bucket consumed by existing endpoints. README, changelog, and method docstring all make the distinction explicit so callers do not conflate the two.
16 new unit tests covering request shape (per mode), response handling for both output formats, and error propagation (401 / 400 / 402 / 500).

Verification — static

mypy clean on src/ (strict)
ruff check clean on touched files
pytest tests/unit — 263 / 263 passing (16 new in tests/unit/test_parse.py)

Verification — live (prod)

A full sweep against the prod API using tests/data/sample.pdf (6 pages) covered every documented (mode, output_format) combination plus the spec-rejected case, both alternative input shapes (bytes, file-like), and both error paths. All 12 calls behaved as expected:

#	Mode	Format	Input	Result	Cost (credits)	Latency
1	`text`	`markdown`	path	1922-char markdown returned	6.0	2.9 s
2	`text`	`spatial`	path	`ValidationError` HTTP 400 (rejected per spec)	—	2.4 s
3	`structure`	`spatial`	path	72 elements over 6 pages	9.0	3.0 s
4	`structure`	`markdown`	path	2560-char markdown returned	1.5	2.6 s
5	`understand`	`spatial`	path	124 elements over 6 pages	54.0	14.9 s
6	`understand`	`markdown`	path	5608-char markdown returned	9.0	14.4 s
7	`agentic`	`spatial`	path	122 elements over 6 pages	108.0	36.4 s
8	`agentic`	`markdown`	path	7176-char markdown returned	18.0	37.7 s
9	`structure`	`spatial`	`bytes`	72 elements over 6 pages	9.0	2.9 s
10	`structure`	`spatial`	file-like	72 elements over 6 pages	9.0	2.8 s
11	—	—	invalid path	`FileNotFoundError` raised	—	—
12	`structure`	`spatial`	bad API key	`AuthenticationError` HTTP 401 raised	—	1.6 s

Adds first-class support for the Data Extraction API on NutrientClient. Covers all four processing modes (text, structure, understand, agentic) and both output shapes (spatial elements and whole-document Markdown). The response surface is a fully typed ParseResponse TypedDict with a discriminated union of element variants (paragraph, table, formula, picture, keyValueRegion, handwriting) so callers can narrow on `type`. The Data Extraction API is billed against extraction credits, which are a separate billing bucket from the processor API credits consumed by the other endpoints used by this client (Build, sign, OCR, watermarking, etc.). Docstrings, README, and changelog make that distinction explicit so callers do not conflate the two buckets. Verification: - 16 new unit tests in tests/unit/test_parse.py (request shape per mode, response parsing, error propagation for 401 / 400 / 402 / 500). - mypy strict and ruff clean on src/. Endpoint surface (httpx-multipart): POST /extraction/parse with a 'file' part and an optional 'instructions' part carrying the JSON {mode, output:{format}} body. Extends the existing send_request infra (RequestConfig + TypeGuard + overload) without churn to existing endpoint paths.

The extraction-credits accounting shape (cost + remainingCredits) will surface on every future endpoint billed against the extraction-credits bucket, not just /extraction/parse. Factor it out of types/parse.py into its own module so other endpoints can import it without pulling in the whole parse type tree. Also clarify ParseBounds: document that (x, y) is the top-left corner and that bounds share a coordinate space with the page dimensions in ParsePageRef. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three small style nits surfaced in code review against the patterns set by sign() and the other raw-send_request methods (get_account_info, create_token, delete_token): - Drop the redundant inner cast("ParseOutput", {"format": output_format}). ParseOutput is a single-key TypedDict with total=False; the literal already satisfies it structurally via the surrounding ParseInstructions annotation. No other call site in client.py casts an inner literal this way. - Replace the RequestConfig(...) constructor call with an inline dict literal at the send_request boundary, matching sign / create_token / delete_token / get_account_info. RequestConfig is a generic TypedDict; the constructor form is the outlier. - Broaden the file parameter docstring to call out that the endpoint accepts PDFs, Office documents, and images. Unlike sign(), parsing is not PDF-only, and the previous docstring implicitly invited readers to transplant sign()'s PDF-only mental model. No behavior change. format) combinations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The README's Data Extraction section previously described WHAT parse() does (modes, output formats, billing) without explaining WHY a user would reach for it over the existing extract_* helpers. Rework so the positioning leads: - New "designed for" bullets up top — RAG ingestion, search indexing, content migration, form/invoice extraction, layout-aware document understanding. - New output-format selector table mapping each format to its primary use case (markdown → RAG/search; spatial → form/layout). - Modes table reworded so each row says when to pick it, not just what it technically does (text = born-digital only; structure = OCR for scanned input; understand = AI-augmented for complex layouts; agentic = + VLM for image-heavy content). - Two worked recipes: RAG ingestion (PDF → markdown → embed) and form extraction (PDF → spatial elements → structured dict). Also adds a parse() entry to docs/METHODS.md (it was missing entirely) and a "Designed for" preamble to the parse() docstring so the method's positioning is visible in IDE hover popups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HungKNguyen

The main blocker is that DWS Extract actually require a different API key from DWS Processor, maybe the client can be initilize with multiple API key for different products

DWS Extract is a separate product from DWS Processor with its own API key and credit pool. Calling /extraction/parse with the Processor key returns 403. Add an optional extract_api_key constructor parameter (str or async callable) that parse() prefers over api_key when set; non-parse methods keep using api_key. Falling back to api_key keeps a single-key setup working once tenants get global DWS keys. Also reject mode='text' + output_format='spatial' before the request goes out — the text mode only produces markdown, so the combination would 502 on the server side. Surface it as a ValidationError with guidance. Addresses PR #47 review feedback from HungKNguyen.

The docstring promises pageIndex/width/height are always populated and only pageNumber may be absent, but the class was declared `total=False`, which contradicts that and forces type-strict callers to guard every subscript access on guaranteed-present fields. Switch to the default (`total=True`) shape with pageNumber explicitly `NotRequired`, matching the precedent set by ParseBounds in the same module. No runtime impact — the wire already populates these fields.

nickwinder · 2026-05-27T09:45:39Z

Live smoke against the DWS APIs (commit `64a3159`)

Verified the dual-key flow end-to-end against the real /extraction/parse and Processor surfaces, plus the new client-side validation.

Case	Status	Detail
P1 processor: `extract_text`	OK	6 pages
P2 processor: `get_account_info`	OK	subscription=enterprise
E1 parse text + markdown	OK	cost=6.0, pages=6, md_len=1922
E2 parse structure + spatial	OK	cost=9.0, pages=6, elements=72
E3 parse structure + markdown	OK	cost=1.5, pages=1, md_len=2560
E4 parse understand + spatial	OK	cost=54.0, pages=6, elements=124
E5 parse understand + markdown	OK	cost=9.0, pages=1, md_len=5607
E6 parse agentic + markdown	OK	cost=18.0, pages=1, md_len=6770, ~54s elapsed (first run hit a transient `NetworkError`; passed on retry with a longer timeout)
V1 parse text + spatial	OK	`ValidationError` raised pre-network — the new client-side guard works
V2 Processor key against `/extraction/parse`	OK	403 — the exact failure called out in review, now provably gated
B1 parse with `bytes` input	OK	elements=72, cost=9.0

What this verifies

extract_api_key routes /extraction/parse through the Extract key end-to-end (E1-E6, B1 all billed against data_extraction_credits).
Existing Processor methods on the same client keep using the Processor key (P1, P2).
Passing only api_key (Processor) to parse() is correctly rejected with 403 — i.e. the documented restriction holds (V2).
mode='text' + output_format='spatial' raises ValidationError before any HTTP round-trip (V1).
bytes input behaves identically to a file-path input (B1).

nickwinder and others added 3 commits May 27, 2026 19:24

nickwinder added enhancement New feature or request claude-code-assisted labels May 27, 2026

nickwinder self-assigned this May 27, 2026

nickwinder marked this pull request as ready for review May 27, 2026 07:38

nickwinder requested a review from HungKNguyen May 27, 2026 07:38

HungKNguyen reviewed May 27, 2026

View reviewed changes

Comment thread src/nutrient_dws/client.py

HungKNguyen reviewed May 27, 2026

View reviewed changes

Comment thread README.md Outdated

HungKNguyen requested changes May 27, 2026

View reviewed changes

nickwinder mentioned this pull request May 27, 2026

feat: add client.parse() for the Data Extraction API (/extraction/parse) PSPDFKit-labs/nutrient-dws-client-typescript#12

Draft

16 tasks

HungKNguyen approved these changes May 27, 2026

View reviewed changes

nickwinder merged commit 2705176 into main May 27, 2026
14 checks passed

nickwinder deleted the feat/task-134-parse-ga branch May 27, 2026 09:48

nickwinder mentioned this pull request May 27, 2026

Release/3.1.0 #48

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add client.parse() for the Data Extraction API (/extraction/parse)#47

feat: add client.parse() for the Data Extraction API (/extraction/parse)#47
nickwinder merged 6 commits into
mainfrom
feat/task-134-parse-ga

nickwinder commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

HungKNguyen left a comment

Uh oh!

nickwinder commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nickwinder commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Summary

Verification — static

Verification — live (prod)

Uh oh!

Uh oh!

Uh oh!

HungKNguyen left a comment

Choose a reason for hiding this comment

Uh oh!

nickwinder commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Live smoke against the DWS APIs (commit 64a3159)

What this verifies

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nickwinder commented May 27, 2026 •

edited

Loading

nickwinder commented May 27, 2026 •

edited

Loading

Live smoke against the DWS APIs (commit `64a3159`)