Skip to content

feat: unify rewrite domain metadata into a single source#143

Open
memadi-nv wants to merge 5 commits intomainfrom
memadi/feature/unify-rewrite-domain-metadata
Open

feat: unify rewrite domain metadata into a single source#143
memadi-nv wants to merge 5 commits intomainfrom
memadi/feature/unify-rewrite-domain-metadata

Conversation

@memadi-nv
Copy link
Copy Markdown
Contributor

@memadi-nv memadi-nv commented Apr 29, 2026

Summary

Refactors rewrite-pipeline domain metadata into a single source of truth and updates the domain taxonomy.

Refactor

  • Collapse three parallel structures (_DOMAIN_LIST, DOMAIN_SUPPLEMENT_MAP, DOMAIN_SUPPLEMENT_PRIVACY_MAP) into one DomainMetadata dataclass + one DOMAIN_METADATA tuple keyed by Domain.
  • New _build_domain_index raises RuntimeError at import time on duplicate or missing entries — drift fails fast instead of surfacing later as a mid-pipeline KeyError.
  • Rename rewrite_supplementquality_supplement (the field has always been quality guidance, never rewrite-specific).
  • privacy_supplement is str | None and defaults to quality_supplement via __post_init__, so domains without dedicated privacy guidance can omit it. Today only LEGAL has a distinct privacy supplement.

Taxonomy

24 prior Domain values → 21 new ones:

The Domain enum's serialized values have changed. Pipelines or downstream
consumers holding rows with prior values will fail validation. Migration:

Old value New value
SECURITY_INFOSEC SECURITY_INFOSEC
FINANCIAL FINANCIAL
LEGAL LEGAL
META_TEXT META_TEXT
ENTERTAINMENT_MEDIA ENTERTAINMENT_MEDIA
OTHER OTHER
BIOGRAPHY BIOGRAPHY_PROFILE
CLINICAL_EHR_MEDICAL MEDICAL_CLINICAL
HR_PEOPLE_OPS HR_EMPLOYMENT
MANAGEMENT_OPERATIONS BUSINESS_OPERATIONS
NEWS_JOURNALISM NEWS_PUBLIC_AFFAIRS
SCIENTIFIC_ACADEMIC RESEARCH_SCIENTIFIC
TECHNICAL_ENGINEERING_SOFTWARE TECHNICAL_SOFTWARE_ENGINEERING
EDUCATIONAL_PEDAGOGICAL EDUCATION
FICTION_CREATIVE CREATIVE_FICTION
ECONOMIC ECONOMIC_ANALYSIS
POLICY_REGULATORY_COMPLIANCE POLICY_REGULATORY
MARKETING_ADVERTISING, PRODUCT_REVIEW MARKETING_COMMERCIAL
SOCIAL_CULTURAL_OPED, SOCIAL_MEDIA SOCIAL_COMMENTARY
CHAT_EMAIL_CSAT dropped (re-classify)
PROCEDURAL_INSTRUCTIONAL dropped (re-classify)
TRANSCRIPTS_INTERVIEWS dropped (re-classify)
None INSURANCE
None GOVERNMENT_PUBLIC_RECORDS
CHAT_EMAIL_CSAT dropped
PROCEDURAL_INSTRUCTIONAL dropped
TRANSCRIPTS_INTERVIEWS dropped

Tests

  • Refreshed existing tests for the new field name and renamed identifiers.
  • Added two error-path tests for _build_domain_index covering duplicate entries and missing enum coverage.
  • make test — 647/647 pass; no new typecheck diagnostics in edited files.

Related Issues

Closes #55

Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv requested a review from a team as a code owner April 29, 2026 23:19
Signed-off-by: memadi <memadi@nvidia.com>
Comment thread src/anonymizer/engine/rewrite/domain_classification.py
Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv requested a review from lipikaramaswamy May 7, 2026 00:57
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

This PR unifies three previously parallel domain data structures (_DOMAIN_LIST, DOMAIN_SUPPLEMENT_MAP, DOMAIN_SUPPLEMENT_PRIVACY_MAP) into a single DomainMetadata dataclass, and adds an import-time completeness guard via _build_domain_index. The PR goes beyond pure structural refactoring: it also renames 14 Domain enum members, removes three (PROCEDURAL_INSTRUCTIONAL, SOCIAL_MEDIA, TRANSCRIPTS_INTERVIEWS), and adds two new ones (INSURANCE, GOVERNMENT_PUBLIC_RECORDS).

  • DomainMetadata + _build_domain_index: New frozen dataclass centralises classification description, quality supplement, and privacy supplement per domain; _build_domain_index raises RuntimeError at import time on any duplicate or missing enum coverage.
  • Domain enum overhaul: 14 values renamed to cleaner, more semantic names; net count drops from 24 to 21.
  • Tests updated and extended: enrichment tests point to new API; new tests cover _build_domain_index error paths and privacy_supplement defaulting behaviour.

Confidence Score: 4/5

Safe to merge once the domain rename's impact on existing persisted data is confirmed to be handled (migration or clean-slate deployment).

The Domain enum string values are the serialized form stored in data rows. Any existing row carrying an old value (e.g. "BIOGRAPHY") will trigger a Pydantic ValidationError in _enrich_domain/_enrich_domain_privacy after deployment. The PR description labels this "no behavior change/refactoring", but the enum renames and removals are a hard compatibility break for existing data. Whether this is safe depends on whether there is persisted data from the old schema — if the pipeline only processes freshly classified data, the impact is zero; if not, a migration layer is needed.

src/anonymizer/engine/schemas/rewrite.py — the Domain enum renames and removals are the main deployment risk.

Important Files Changed

Filename Overview
src/anonymizer/engine/rewrite/domain_classification.py Collapses three parallel data structures into a single DomainMetadata dataclass with import-time drift detection. Code is clean; one double-space typo in the BUSINESS_OPERATIONS classification description appears in the LLM prompt.
src/anonymizer/engine/schemas/rewrite.py Domain enum undergoes a significant rename (14 values changed) and net reduction (3 removed, 2 added). String values are the serialized form; existing persisted data with old domain strings will fail Pydantic validation after deployment.
tests/engine/test_domain_classification.py Tests updated to new API; new tests for _build_domain_index error paths and DomainMetadata defaulting behaviour address both previously noted gaps.
tests/engine/test_rewrite_workflow.py Mechanical update of hardcoded "BIOGRAPHY""BIOGRAPHY_PROFILE" strings across test fixtures; no logic changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Domain enum value] -->|DomainClassificationSchema.model_validate| B[parsed_domain]
    B --> C{_DOMAIN_BY_ENUM lookup}
    C -->|.quality_supplement| D[COL_DOMAIN_SUPPLEMENT\nmeaning-unit extraction prompt]
    C -->|.privacy_supplement| E[COL_DOMAIN_SUPPLEMENT_PRIVACY\nsensitivity disposition prompt]
    C -->|.classification_description| F[domain classification prompt\nshown to LLM]

    subgraph Import Time
        G[DOMAIN_METADATA tuple] -->|_build_domain_index| H{Validation}
        H -->|duplicate Domain| I[RuntimeError: duplicate]
        H -->|missing Domain| J[RuntimeError: missing]
        H -->|all OK| K[_DOMAIN_BY_ENUM dict]
    end

    K --> C
Loading

Reviews (2): Last reviewed commit: "address feedback" | Re-trigger Greptile

Comment on lines +55 to +59
privacy_supplement: str | None = None

def __post_init__(self) -> None:
if self.privacy_supplement is None:
object.__setattr__(self, "privacy_supplement", self.rewrite_supplement)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The privacy_supplement field is annotated str | None but __post_init__ ensures it is always a str after construction. Because the declared type is str | None, static type checkers will infer .privacy_supplement as potentially None at every call site — including _enrich_domain_privacy — requiring callers to add unnecessary null-guards or accept type warnings. Consider separating the init-time parameter from the stored invariant by using field(default=None) on an _init_privacy_supplement InitVar and storing the resolved value in a str-typed field, or simply widening the public annotation to match what __post_init__ guarantees.

Suggested change
privacy_supplement: str | None = None
def __post_init__(self) -> None:
if self.privacy_supplement is None:
object.__setattr__(self, "privacy_supplement", self.rewrite_supplement)
privacy_supplement: str = field(default="")
def __post_init__(self) -> None:
# Allow callers to omit privacy_supplement; default to rewrite_supplement.
if not self.privacy_supplement:
object.__setattr__(self, "privacy_supplement", self.rewrite_supplement)

Comment thread tests/engine/test_domain_classification.py
Comment thread src/anonymizer/engine/rewrite/domain_classification.py Outdated
memadi-nv added 2 commits May 7, 2026 12:55
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Comment thread src/anonymizer/engine/schemas/rewrite.py
DomainMetadata(
domain=Domain.NEWS_PUBLIC_AFFAIRS,
classification_description="Journalism, news articles, press releases, public affairs reporting",
quality_supplement="Extract every distinct information-bearing unit from the article. Do NOT summarize or compress multiple claims into one unit. A meaning unit is ONE independent idea: a fact, stance, causal link, prediction, stakeholder impact, trend, or description of who is doing what at a role/group/institution level.\n\nAlways capture, if present:\n1) Policy or event: stances, approvals, bans, reversals, regulatory shifts, key decisions.\n2) Actors as roles or groups: parties, governments, business sectors, lobbies, institutions (never names).\n3) Motivations and reasoning: economic, political, ideological justifications on all sides.\n4) Polarity: who supports, who opposes, and on what grounds.\n5) Evidence and analogies used to support arguments (paraphrased, not quoted verbatim).\n6) Stakeholder impacts: effects on farmers, workers, retailers, investors, consumers, or public services.\n7) System-level consequences: investor confidence, monopoly risk, inflation, growth, governance stability.\n8) Broader trends: membership growth, reform momentum, shifts in public sentiment or party credibility.\n9) Policy lineage: how the current stance differs from previous governments or earlier policy.\n10) Temporal framing: before/after elections, reforms, court rulings, crises, or other milestones.\n\n11) Extract historical/political lineage when present (election wins, shifts from prior ruling party).\n\nSegmentation rules:\n• If a sentence contains multiple independent ideas (e.g., a reason AND a separate consequence), split them.\n• If two sentences express one tightly bound idea that cannot stand alone if split, keep them as one unit.\n• When in doubt about splitting, err on the side of creating MORE, smaller units rather than fewer, larger ones.\n\nDo NOT use personal names or specific PII. Refer to actors only by their roles or affiliations (e.g., 'the party', 'a business-sector recruit', 'small-trader lobbies', 'the previous government').\n\nThere is NO fixed number of units required. Continue extracting meaning units until no substantial claim, stance, cause, effect, or stakeholder-impact statement remains unrepresented.",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since NEWS_PUBLIC_AFFAIRS and INSURANCE are new domains, the quality_supplement for each was generated by the agent. They look valid to me, but I’d appreciate an additional review. cc @asteier2026

</domains>

<disambiguation_guidelines>
- If the text is an email/chat discussing a contract → "CHAT_EMAIL_CSAT", not "LEGAL".
Copy link
Copy Markdown
Contributor Author

@memadi-nv memadi-nv May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This had to be removed since CHAT_EMAIL_CSAT was not among the new domains.

@memadi-nv memadi-nv requested a review from asteier2026 May 7, 2026 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: unify rewrite domain metadata into a single source of truth

3 participants