feat: unify rewrite domain metadata into a single source#143
feat: unify rewrite domain metadata into a single source#143
Conversation
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Greptile SummaryThis PR unifies three previously parallel domain data structures (
Confidence Score: 4/5Safe to merge once the domain rename's impact on existing persisted data is confirmed to be handled (migration or clean-slate deployment). The Domain enum string values are the serialized form stored in data rows. Any existing row carrying an old value (e.g. src/anonymizer/engine/schemas/rewrite.py — the Domain enum renames and removals are the main deployment risk. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Domain enum value] -->|DomainClassificationSchema.model_validate| B[parsed_domain]
B --> C{_DOMAIN_BY_ENUM lookup}
C -->|.quality_supplement| D[COL_DOMAIN_SUPPLEMENT\nmeaning-unit extraction prompt]
C -->|.privacy_supplement| E[COL_DOMAIN_SUPPLEMENT_PRIVACY\nsensitivity disposition prompt]
C -->|.classification_description| F[domain classification prompt\nshown to LLM]
subgraph Import Time
G[DOMAIN_METADATA tuple] -->|_build_domain_index| H{Validation}
H -->|duplicate Domain| I[RuntimeError: duplicate]
H -->|missing Domain| J[RuntimeError: missing]
H -->|all OK| K[_DOMAIN_BY_ENUM dict]
end
K --> C
Reviews (2): Last reviewed commit: "address feedback" | Re-trigger Greptile |
| privacy_supplement: str | None = None | ||
|
|
||
| def __post_init__(self) -> None: | ||
| if self.privacy_supplement is None: | ||
| object.__setattr__(self, "privacy_supplement", self.rewrite_supplement) |
There was a problem hiding this comment.
The
privacy_supplement field is annotated str | None but __post_init__ ensures it is always a str after construction. Because the declared type is str | None, static type checkers will infer .privacy_supplement as potentially None at every call site — including _enrich_domain_privacy — requiring callers to add unnecessary null-guards or accept type warnings. Consider separating the init-time parameter from the stored invariant by using field(default=None) on an _init_privacy_supplement InitVar and storing the resolved value in a str-typed field, or simply widening the public annotation to match what __post_init__ guarantees.
| privacy_supplement: str | None = None | |
| def __post_init__(self) -> None: | |
| if self.privacy_supplement is None: | |
| object.__setattr__(self, "privacy_supplement", self.rewrite_supplement) | |
| privacy_supplement: str = field(default="") | |
| def __post_init__(self) -> None: | |
| # Allow callers to omit privacy_supplement; default to rewrite_supplement. | |
| if not self.privacy_supplement: | |
| object.__setattr__(self, "privacy_supplement", self.rewrite_supplement) |
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
| DomainMetadata( | ||
| domain=Domain.NEWS_PUBLIC_AFFAIRS, | ||
| classification_description="Journalism, news articles, press releases, public affairs reporting", | ||
| quality_supplement="Extract every distinct information-bearing unit from the article. Do NOT summarize or compress multiple claims into one unit. A meaning unit is ONE independent idea: a fact, stance, causal link, prediction, stakeholder impact, trend, or description of who is doing what at a role/group/institution level.\n\nAlways capture, if present:\n1) Policy or event: stances, approvals, bans, reversals, regulatory shifts, key decisions.\n2) Actors as roles or groups: parties, governments, business sectors, lobbies, institutions (never names).\n3) Motivations and reasoning: economic, political, ideological justifications on all sides.\n4) Polarity: who supports, who opposes, and on what grounds.\n5) Evidence and analogies used to support arguments (paraphrased, not quoted verbatim).\n6) Stakeholder impacts: effects on farmers, workers, retailers, investors, consumers, or public services.\n7) System-level consequences: investor confidence, monopoly risk, inflation, growth, governance stability.\n8) Broader trends: membership growth, reform momentum, shifts in public sentiment or party credibility.\n9) Policy lineage: how the current stance differs from previous governments or earlier policy.\n10) Temporal framing: before/after elections, reforms, court rulings, crises, or other milestones.\n\n11) Extract historical/political lineage when present (election wins, shifts from prior ruling party).\n\nSegmentation rules:\n• If a sentence contains multiple independent ideas (e.g., a reason AND a separate consequence), split them.\n• If two sentences express one tightly bound idea that cannot stand alone if split, keep them as one unit.\n• When in doubt about splitting, err on the side of creating MORE, smaller units rather than fewer, larger ones.\n\nDo NOT use personal names or specific PII. Refer to actors only by their roles or affiliations (e.g., 'the party', 'a business-sector recruit', 'small-trader lobbies', 'the previous government').\n\nThere is NO fixed number of units required. Continue extracting meaning units until no substantial claim, stance, cause, effect, or stakeholder-impact statement remains unrepresented.", |
There was a problem hiding this comment.
Since NEWS_PUBLIC_AFFAIRS and INSURANCE are new domains, the quality_supplement for each was generated by the agent. They look valid to me, but I’d appreciate an additional review. cc @asteier2026
| </domains> | ||
|
|
||
| <disambiguation_guidelines> | ||
| - If the text is an email/chat discussing a contract → "CHAT_EMAIL_CSAT", not "LEGAL". |
There was a problem hiding this comment.
This had to be removed since CHAT_EMAIL_CSAT was not among the new domains.
Summary
Refactors rewrite-pipeline domain metadata into a single source of truth and updates the domain taxonomy.
Refactor
_DOMAIN_LIST,DOMAIN_SUPPLEMENT_MAP,DOMAIN_SUPPLEMENT_PRIVACY_MAP) into oneDomainMetadatadataclass + oneDOMAIN_METADATAtuple keyed byDomain._build_domain_indexraisesRuntimeErrorat import time on duplicate or missing entries — drift fails fast instead of surfacing later as a mid-pipelineKeyError.rewrite_supplement→quality_supplement(the field has always been quality guidance, never rewrite-specific).privacy_supplementisstr | Noneand defaults toquality_supplementvia__post_init__, so domains without dedicated privacy guidance can omit it. Today onlyLEGALhas a distinct privacy supplement.Taxonomy
24 prior
Domainvalues → 21 new ones:The
Domainenum's serialized values have changed. Pipelines or downstreamconsumers holding rows with prior values will fail validation. Migration:
SECURITY_INFOSECSECURITY_INFOSECFINANCIALFINANCIALLEGALLEGALMETA_TEXTMETA_TEXTENTERTAINMENT_MEDIAENTERTAINMENT_MEDIAOTHEROTHERBIOGRAPHYBIOGRAPHY_PROFILECLINICAL_EHR_MEDICALMEDICAL_CLINICALHR_PEOPLE_OPSHR_EMPLOYMENTMANAGEMENT_OPERATIONSBUSINESS_OPERATIONSNEWS_JOURNALISMNEWS_PUBLIC_AFFAIRSSCIENTIFIC_ACADEMICRESEARCH_SCIENTIFICTECHNICAL_ENGINEERING_SOFTWARETECHNICAL_SOFTWARE_ENGINEERINGEDUCATIONAL_PEDAGOGICALEDUCATIONFICTION_CREATIVECREATIVE_FICTIONECONOMICECONOMIC_ANALYSISPOLICY_REGULATORY_COMPLIANCEPOLICY_REGULATORYMARKETING_ADVERTISING,PRODUCT_REVIEWMARKETING_COMMERCIALSOCIAL_CULTURAL_OPED,SOCIAL_MEDIASOCIAL_COMMENTARYCHAT_EMAIL_CSATPROCEDURAL_INSTRUCTIONALTRANSCRIPTS_INTERVIEWSNoneINSURANCENoneGOVERNMENT_PUBLIC_RECORDSCHAT_EMAIL_CSATPROCEDURAL_INSTRUCTIONALTRANSCRIPTS_INTERVIEWSTests
_build_domain_indexcovering duplicate entries and missing enum coverage.make test— 647/647 pass; no new typecheck diagnostics in edited files.Related Issues
Closes #55