Skip to content

feat: add client.parse() for the Data Extraction API (/extraction/parse)#47

Merged
nickwinder merged 6 commits into
mainfrom
feat/task-134-parse-ga
May 27, 2026
Merged

feat: add client.parse() for the Data Extraction API (/extraction/parse)#47
nickwinder merged 6 commits into
mainfrom
feat/task-134-parse-ga

Conversation

@nickwinder
Copy link
Copy Markdown
Contributor

@nickwinder nickwinder commented May 27, 2026

Why

The Data Extraction API (/extraction/parse) is now generally available. This PR adds first-class client support so users can call it directly from NutrientClient without constructing raw HTTP requests.

Summary

  • New client.parse() method covering all four processing modes (text, structure, understand, agentic) and both output formats (spatial element list, whole-document markdown).
  • Typed ParseResponse envelope with a discriminated union of element variants (paragraph, table, formula, picture, keyValueRegion, handwriting) — if element["type"] == "table": ... narrows correctly via the type discriminator.
  • New ExtractionCredits type module to surface the extraction-credit billing bucket, which is separate from the processor-credit bucket consumed by existing endpoints. README, changelog, and method docstring all make the distinction explicit so callers do not conflate the two.
  • 16 new unit tests covering request shape (per mode), response handling for both output formats, and error propagation (401 / 400 / 402 / 500).

Verification — static

  • mypy clean on src/ (strict)
  • ruff check clean on touched files
  • pytest tests/unit — 263 / 263 passing (16 new in tests/unit/test_parse.py)

Verification — live (prod)

A full sweep against the prod API using tests/data/sample.pdf (6 pages) covered every documented (mode, output_format) combination plus the spec-rejected case, both alternative input shapes (bytes, file-like), and both error paths. All 12 calls behaved as expected:

# Mode Format Input Result Cost (credits) Latency
1 text markdown path 1922-char markdown returned 6.0 2.9 s
2 text spatial path ValidationError HTTP 400 (rejected per spec) 2.4 s
3 structure spatial path 72 elements over 6 pages 9.0 3.0 s
4 structure markdown path 2560-char markdown returned 1.5 2.6 s
5 understand spatial path 124 elements over 6 pages 54.0 14.9 s
6 understand markdown path 5608-char markdown returned 9.0 14.4 s
7 agentic spatial path 122 elements over 6 pages 108.0 36.4 s
8 agentic markdown path 7176-char markdown returned 18.0 37.7 s
9 structure spatial bytes 72 elements over 6 pages 9.0 2.9 s
10 structure spatial file-like 72 elements over 6 pages 9.0 2.8 s
11 invalid path FileNotFoundError raised
12 structure spatial bad API key AuthenticationError HTTP 401 raised 1.6 s

nickwinder and others added 3 commits May 27, 2026 19:24
Adds first-class support for the Data Extraction API on NutrientClient.
Covers all four processing modes (text, structure, understand, agentic)
and both output shapes (spatial elements and whole-document Markdown).

The response surface is a fully typed ParseResponse TypedDict with a
discriminated union of element variants (paragraph, table, formula,
picture, keyValueRegion, handwriting) so callers can narrow on `type`.

The Data Extraction API is billed against extraction credits, which are
a separate billing bucket from the processor API credits consumed by the
other endpoints used by this client (Build, sign, OCR, watermarking,
etc.). Docstrings, README, and changelog make that distinction explicit
so callers do not conflate the two buckets.

Verification:
- 16 new unit tests in tests/unit/test_parse.py (request shape per mode,
  response parsing, error propagation for 401 / 400 / 402 / 500).
- mypy strict and ruff clean on src/.

Endpoint surface (httpx-multipart): POST /extraction/parse with a
'file' part and an optional 'instructions' part carrying the JSON
{mode, output:{format}} body. Extends the existing send_request infra
(RequestConfig + TypeGuard + overload) without churn to existing
endpoint paths.
The extraction-credits accounting shape (cost + remainingCredits) will
surface on every future endpoint billed against the extraction-credits
bucket, not just /extraction/parse. Factor it out of types/parse.py into
its own module so other endpoints can import it without pulling in the
whole parse type tree.

Also clarify ParseBounds: document that (x, y) is the top-left corner
and that bounds share a coordinate space with the page dimensions in
ParsePageRef.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small style nits surfaced in code review against the patterns set
by sign() and the other raw-send_request methods (get_account_info,
create_token, delete_token):

- Drop the redundant inner cast("ParseOutput", {"format": output_format}).
  ParseOutput is a single-key TypedDict with total=False; the literal
  already satisfies it structurally via the surrounding ParseInstructions
  annotation. No other call site in client.py casts an inner literal
  this way.

- Replace the RequestConfig(...) constructor call with an inline dict
  literal at the send_request boundary, matching sign / create_token /
  delete_token / get_account_info. RequestConfig is a generic TypedDict;
  the constructor form is the outlier.

- Broaden the file parameter docstring to call out that the endpoint
  accepts PDFs, Office documents, and images. Unlike sign(), parsing is
  not PDF-only, and the previous docstring implicitly invited readers
  to transplant sign()'s PDF-only mental model.

No behavior change.
format) combinations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nickwinder nickwinder self-assigned this May 27, 2026
@nickwinder nickwinder marked this pull request as ready for review May 27, 2026 07:38
@nickwinder nickwinder requested a review from HungKNguyen May 27, 2026 07:38
The README's Data Extraction section previously described WHAT parse()
does (modes, output formats, billing) without explaining WHY a user
would reach for it over the existing extract_* helpers. Rework so the
positioning leads:

- New "designed for" bullets up top — RAG ingestion, search indexing,
  content migration, form/invoice extraction, layout-aware document
  understanding.
- New output-format selector table mapping each format to its primary
  use case (markdown → RAG/search; spatial → form/layout).
- Modes table reworded so each row says when to pick it, not just what
  it technically does (text = born-digital only; structure = OCR for
  scanned input; understand = AI-augmented for complex layouts; agentic
  = + VLM for image-heavy content).
- Two worked recipes: RAG ingestion (PDF → markdown → embed) and form
  extraction (PDF → spatial elements → structured dict).

Also adds a parse() entry to docs/METHODS.md (it was missing entirely)
and a "Designed for" preamble to the parse() docstring so the method's
positioning is visible in IDE hover popups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/nutrient_dws/client.py
Comment thread README.md Outdated
Copy link
Copy Markdown
Collaborator

@HungKNguyen HungKNguyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main blocker is that DWS Extract actually require a different API key from DWS Processor, maybe the client can be initilize with multiple API key for different products

DWS Extract is a separate product from DWS Processor with its own API key
and credit pool. Calling /extraction/parse with the Processor key returns
403. Add an optional extract_api_key constructor parameter (str or async
callable) that parse() prefers over api_key when set; non-parse methods
keep using api_key. Falling back to api_key keeps a single-key setup
working once tenants get global DWS keys.

Also reject mode='text' + output_format='spatial' before the request goes
out — the text mode only produces markdown, so the combination would 502
on the server side. Surface it as a ValidationError with guidance.

Addresses PR #47 review feedback from HungKNguyen.
The docstring promises pageIndex/width/height are always populated and
only pageNumber may be absent, but the class was declared `total=False`,
which contradicts that and forces type-strict callers to guard every
subscript access on guaranteed-present fields. Switch to the default
(`total=True`) shape with pageNumber explicitly `NotRequired`, matching
the precedent set by ParseBounds in the same module.

No runtime impact — the wire already populates these fields.
@nickwinder
Copy link
Copy Markdown
Contributor Author

nickwinder commented May 27, 2026

Live smoke against the DWS APIs (commit 64a3159)

Verified the dual-key flow end-to-end against the real /extraction/parse and Processor surfaces, plus the new client-side validation.

Case Status Detail
P1 processor: extract_text OK 6 pages
P2 processor: get_account_info OK subscription=enterprise
E1 parse text + markdown OK cost=6.0, pages=6, md_len=1922
E2 parse structure + spatial OK cost=9.0, pages=6, elements=72
E3 parse structure + markdown OK cost=1.5, pages=1, md_len=2560
E4 parse understand + spatial OK cost=54.0, pages=6, elements=124
E5 parse understand + markdown OK cost=9.0, pages=1, md_len=5607
E6 parse agentic + markdown OK cost=18.0, pages=1, md_len=6770, ~54s elapsed (first run hit a transient NetworkError; passed on retry with a longer timeout)
V1 parse text + spatial OK ValidationError raised pre-network — the new client-side guard works
V2 Processor key against /extraction/parse OK 403 — the exact failure called out in review, now provably gated
B1 parse with bytes input OK elements=72, cost=9.0

What this verifies

  • extract_api_key routes /extraction/parse through the Extract key end-to-end (E1-E6, B1 all billed against data_extraction_credits).
  • Existing Processor methods on the same client keep using the Processor key (P1, P2).
  • Passing only api_key (Processor) to parse() is correctly rejected with 403 — i.e. the documented restriction holds (V2).
  • mode='text' + output_format='spatial' raises ValidationError before any HTTP round-trip (V1).
  • bytes input behaves identically to a file-path input (B1).

@nickwinder nickwinder merged commit 2705176 into main May 27, 2026
14 checks passed
@nickwinder nickwinder deleted the feat/task-134-parse-ga branch May 27, 2026 09:48
@nickwinder nickwinder mentioned this pull request May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants