feat: unify rewrite domain metadata into a single source by memadi-nv · Pull Request #143 · NVIDIA-NeMo/Anonymizer

memadi-nv · 2026-04-29T23:19:21Z

Summary

Refactors rewrite-pipeline domain metadata into a single source of truth and updates the domain taxonomy.

Refactor

Collapse three parallel structures (_DOMAIN_LIST, DOMAIN_SUPPLEMENT_MAP, DOMAIN_SUPPLEMENT_PRIVACY_MAP) into one DomainMetadata dataclass + one DOMAIN_METADATA tuple keyed by Domain.
New _build_domain_index raises RuntimeError at import time on duplicate or missing entries — drift fails fast instead of surfacing later as a mid-pipeline KeyError.
Rename rewrite_supplement → quality_supplement (the field has always been quality guidance, never rewrite-specific).
privacy_supplement is str | None and defaults to quality_supplement via __post_init__, so domains without dedicated privacy guidance can omit it. Today only LEGAL has a distinct privacy supplement.

Taxonomy

24 prior Domain values → 21 new ones:

The Domain enum's serialized values have changed. Pipelines or downstream
consumers holding rows with prior values will fail validation. Migration:

Old value	New value
`SECURITY_INFOSEC`	`SECURITY_INFOSEC`
`FINANCIAL`	`FINANCIAL`
`LEGAL`	`LEGAL`
`META_TEXT`	`META_TEXT`
`ENTERTAINMENT_MEDIA`	`ENTERTAINMENT_MEDIA`
`OTHER`	`OTHER`
`BIOGRAPHY`	`BIOGRAPHY_PROFILE`
`CLINICAL_EHR_MEDICAL`	`MEDICAL_CLINICAL`
`HR_PEOPLE_OPS`	`HR_EMPLOYMENT`
`MANAGEMENT_OPERATIONS`	`BUSINESS_OPERATIONS`
`NEWS_JOURNALISM`	`NEWS_PUBLIC_AFFAIRS`
`SCIENTIFIC_ACADEMIC`	`RESEARCH_SCIENTIFIC`
`TECHNICAL_ENGINEERING_SOFTWARE`	`TECHNICAL_SOFTWARE_ENGINEERING`
`EDUCATIONAL_PEDAGOGICAL`	`EDUCATION`
`FICTION_CREATIVE`	`CREATIVE_FICTION`
`ECONOMIC`	`ECONOMIC_ANALYSIS`
`POLICY_REGULATORY_COMPLIANCE`	`POLICY_REGULATORY`
`MARKETING_ADVERTISING`, `PRODUCT_REVIEW`	`MARKETING_COMMERCIAL`
`SOCIAL_CULTURAL_OPED`, `SOCIAL_MEDIA`	`SOCIAL_COMMENTARY`
`CHAT_EMAIL_CSAT`	dropped (re-classify)
`PROCEDURAL_INSTRUCTIONAL`	dropped (re-classify)
`TRANSCRIPTS_INTERVIEWS`	dropped (re-classify)
`None`	`INSURANCE`
`None`	`GOVERNMENT_PUBLIC_RECORDS`
`CHAT_EMAIL_CSAT`	dropped
`PROCEDURAL_INSTRUCTIONAL`	dropped
`TRANSCRIPTS_INTERVIEWS`	dropped

Tests

Refreshed existing tests for the new field name and renamed identifiers.
Added two error-path tests for _build_domain_index covering duplicate entries and missing enum coverage.
make test — 647/647 pass; no new typecheck diagnostics in edited files.

Related Issues

Closes #55

Signed-off-by: memadi <memadi@nvidia.com>

greptile-apps · 2026-05-07T00:59:21Z

Greptile Summary

This PR unifies three previously parallel domain data structures (_DOMAIN_LIST, DOMAIN_SUPPLEMENT_MAP, DOMAIN_SUPPLEMENT_PRIVACY_MAP) into a single DomainMetadata dataclass, and adds an import-time completeness guard via _build_domain_index. The PR goes beyond pure structural refactoring: it also renames 14 Domain enum members, removes three (PROCEDURAL_INSTRUCTIONAL, SOCIAL_MEDIA, TRANSCRIPTS_INTERVIEWS), and adds two new ones (INSURANCE, GOVERNMENT_PUBLIC_RECORDS).

DomainMetadata + _build_domain_index: New frozen dataclass centralises classification description, quality supplement, and privacy supplement per domain; _build_domain_index raises RuntimeError at import time on any duplicate or missing enum coverage.
Domain enum overhaul: 14 values renamed to cleaner, more semantic names; net count drops from 24 to 21.
Tests updated and extended: enrichment tests point to new API; new tests cover _build_domain_index error paths and privacy_supplement defaulting behaviour.

Confidence Score: 4/5

Safe to merge once the domain rename's impact on existing persisted data is confirmed to be handled (migration or clean-slate deployment).

The Domain enum string values are the serialized form stored in data rows. Any existing row carrying an old value (e.g. "BIOGRAPHY") will trigger a Pydantic ValidationError in _enrich_domain/_enrich_domain_privacy after deployment. The PR description labels this "no behavior change/refactoring", but the enum renames and removals are a hard compatibility break for existing data. Whether this is safe depends on whether there is persisted data from the old schema — if the pipeline only processes freshly classified data, the impact is zero; if not, a migration layer is needed.

src/anonymizer/engine/schemas/rewrite.py — the Domain enum renames and removals are the main deployment risk.

Important Files Changed

Filename	Overview
src/anonymizer/engine/rewrite/domain_classification.py	Collapses three parallel data structures into a single `DomainMetadata` dataclass with import-time drift detection. Code is clean; one double-space typo in the `BUSINESS_OPERATIONS` classification description appears in the LLM prompt.
src/anonymizer/engine/schemas/rewrite.py	Domain enum undergoes a significant rename (14 values changed) and net reduction (3 removed, 2 added). String values are the serialized form; existing persisted data with old domain strings will fail Pydantic validation after deployment.
tests/engine/test_domain_classification.py	Tests updated to new API; new tests for `_build_domain_index` error paths and `DomainMetadata` defaulting behaviour address both previously noted gaps.
tests/engine/test_rewrite_workflow.py	Mechanical update of hardcoded `"BIOGRAPHY"` → `"BIOGRAPHY_PROFILE"` strings across test fixtures; no logic changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Domain enum value] -->|DomainClassificationSchema.model_validate| B[parsed_domain]
    B --> C{_DOMAIN_BY_ENUM lookup}
    C -->|.quality_supplement| D[COL_DOMAIN_SUPPLEMENT\nmeaning-unit extraction prompt]
    C -->|.privacy_supplement| E[COL_DOMAIN_SUPPLEMENT_PRIVACY\nsensitivity disposition prompt]
    C -->|.classification_description| F[domain classification prompt\nshown to LLM]

    subgraph Import Time
        G[DOMAIN_METADATA tuple] -->|_build_domain_index| H{Validation}
        H -->|duplicate Domain| I[RuntimeError: duplicate]
        H -->|missing Domain| J[RuntimeError: missing]
        H -->|all OK| K[_DOMAIN_BY_ENUM dict]
    end

    K --> C

_{Reviews (2): Last reviewed commit: "address feedback" | Re-trigger Greptile}

greptile-apps · 2026-05-07T00:59:24Z

+    privacy_supplement: str | None = None
+
+    def __post_init__(self) -> None:
+        if self.privacy_supplement is None:
+            object.__setattr__(self, "privacy_supplement", self.rewrite_supplement)


The privacy_supplement field is annotated str | None but __post_init__ ensures it is always a str after construction. Because the declared type is str | None, static type checkers will infer .privacy_supplement as potentially None at every call site — including _enrich_domain_privacy — requiring callers to add unnecessary null-guards or accept type warnings. Consider separating the init-time parameter from the stored invariant by using field(default=None) on an _init_privacy_supplement InitVar and storing the resolved value in a str-typed field, or simply widening the public annotation to match what __post_init__ guarantees.

Suggested change

privacy_supplement: str | None = None

def __post_init__(self) -> None:

if self.privacy_supplement is None:

object.__setattr__(self, "privacy_supplement", self.rewrite_supplement)

privacy_supplement: str = field(default="")

def __post_init__(self) -> None:

# Allow callers to omit privacy_supplement; default to rewrite_supplement.

if not self.privacy_supplement:

object.__setattr__(self, "privacy_supplement", self.rewrite_supplement)

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv · 2026-05-07T22:34:14Z

+    DomainMetadata(
+        domain=Domain.NEWS_PUBLIC_AFFAIRS,
+        classification_description="Journalism, news articles, press releases, public affairs reporting",
+        quality_supplement="Extract every distinct information-bearing unit from the article. Do NOT summarize or compress multiple claims into one unit. A meaning unit is ONE independent idea: a fact, stance, causal link, prediction, stakeholder impact, trend, or description of who is doing what at a role/group/institution level.\n\nAlways capture, if present:\n1) Policy or event: stances, approvals, bans, reversals, regulatory shifts, key decisions.\n2) Actors as roles or groups: parties, governments, business sectors, lobbies, institutions (never names).\n3) Motivations and reasoning: economic, political, ideological justifications on all sides.\n4) Polarity: who supports, who opposes, and on what grounds.\n5) Evidence and analogies used to support arguments (paraphrased, not quoted verbatim).\n6) Stakeholder impacts: effects on farmers, workers, retailers, investors, consumers, or public services.\n7) System-level consequences: investor confidence, monopoly risk, inflation, growth, governance stability.\n8) Broader trends: membership growth, reform momentum, shifts in public sentiment or party credibility.\n9) Policy lineage: how the current stance differs from previous governments or earlier policy.\n10) Temporal framing: before/after elections, reforms, court rulings, crises, or other milestones.\n\n11) Extract historical/political lineage when present (election wins, shifts from prior ruling party).\n\nSegmentation rules:\n• If a sentence contains multiple independent ideas (e.g., a reason AND a separate consequence), split them.\n• If two sentences express one tightly bound idea that cannot stand alone if split, keep them as one unit.\n• When in doubt about splitting, err on the side of creating MORE, smaller units rather than fewer, larger ones.\n\nDo NOT use personal names or specific PII. Refer to actors only by their roles or affiliations (e.g., 'the party', 'a business-sector recruit', 'small-trader lobbies', 'the previous government').\n\nThere is NO fixed number of units required. Continue extracting meaning units until no substantial claim, stance, cause, effect, or stakeholder-impact statement remains unrepresented.",


Since NEWS_PUBLIC_AFFAIRS and INSURANCE are new domains, the quality_supplement for each was generated by the agent. They look valid to me, but I’d appreciate an additional review. cc @asteier2026

memadi-nv · 2026-05-07T22:35:49Z

 </domains>

 <disambiguation_guidelines>
- If the text is an email/chat discussing a contract → "CHAT_EMAIL_CSAT", not "LEGAL".


This had to be removed since CHAT_EMAIL_CSAT was not among the new domains.

unify domain metadata

ebe10a3

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv requested review from asteier2026 and lipikaramaswamy April 29, 2026 23:19

memadi-nv requested a review from a team as a code owner April 29, 2026 23:19

more clear docstring

0770aec

Signed-off-by: memadi <memadi@nvidia.com>

lipikaramaswamy reviewed May 6, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/domain_classification.py

lipikaramaswamy reviewed May 6, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated

address feedback

3ba16d4

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv requested a review from lipikaramaswamy May 7, 2026 00:57

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

asteier2026 reviewed May 7, 2026

View reviewed changes

Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated

memadi-nv added 2 commits May 7, 2026 12:55

address feedback-update parameter name

60b45d7

Signed-off-by: memadi <memadi@nvidia.com>

address feedback

a26c8bc

Signed-off-by: memadi <memadi@nvidia.com>

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/anonymizer/engine/schemas/rewrite.py

memadi-nv commented May 7, 2026

View reviewed changes

memadi-nv requested a review from asteier2026 May 7, 2026 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unify rewrite domain metadata into a single source#143

feat: unify rewrite domain metadata into a single source#143
memadi-nv wants to merge 5 commits intomainfrom
memadi/feature/unify-rewrite-domain-metadata

memadi-nv commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented May 7, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

memadi-nv May 7, 2026

Uh oh!

memadi-nv May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

memadi-nv commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Refactor

Taxonomy

Tests

Related Issues

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

memadi-nv May 7, 2026

Choose a reason for hiding this comment

Uh oh!

memadi-nv May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

memadi-nv commented Apr 29, 2026 •

edited

Loading

greptile-apps Bot commented May 7, 2026 •

edited

Loading

memadi-nv May 7, 2026 •

edited

Loading