diff --git a/.gitignore b/.gitignore index 1f8fcb1..6548d6e 100644 --- a/.gitignore +++ b/.gitignore @@ -103,6 +103,9 @@ CLAUDE.local.md .claude/settings.local.json ai/tmp/ +# Claude worktrees +.claude/worktrees/ + # Anonymizer execution artifacts .anonymizer-artifacts/ docs/notebook_source/data/synth_bios_sample10_anonymized.csv diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..9aa4178 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,120 @@ + + + +# AGENTS.md + +This file is for agents **developing** NeMo Anonymizer — the codebase you are working in. +If you are an agent helping a user **anonymize data**, use the [product documentation](https://nvidia-nemo.github.io/Anonymizer/) instead. + +**NeMo Anonymizer** detects and protects PII through context-aware entity replacement and LLM-powered rewriting. Users supply a text dataset and a strategy; Anonymizer detects entities and transforms the text. + +## Module Map + +`nemo-anonymizer` is a single package with three modules: + +- **`anonymizer.config`** — user-facing configuration: `AnonymizerConfig`, `AnonymizerInput`, replace strategies (`Substitute`, `Redact`, `Annotate`, `Hash`), and rewrite config (`Rewrite`, `EvaluationCriteria`, `RiskTolerance`). New user-facing knobs go here. +- **`anonymizer.engine`** — internal pipeline implementation: detection, replacement, and rewrite sub-workflows, the NDD adapter, prompt utilities, and all `COL_*` column constants. Never imported directly by users. +- **`anonymizer.interface`** — user-facing entry points: the `Anonymizer` class, CLI, `AnonymizerResult`, `PreviewResult`, and canonical error types. Thin layer that wires config → engine and exposes results. + +NeMo Anonymizer wraps [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (NDD) for LLM column generation. `NddAdapter` is the only place this dependency crosses — engine sub-workflows declare NDD column configs and hand them to the adapter, which manages DataDesigner internally. + +## Core Concepts + +- **Entity** — a detected span of text with a label (e.g. `"Alice"` → `first_name`) and character offsets +- **Latent entity** — an entity detected in rewrite mode that is sensitive but not directly named; used to guide rewriting without explicit replacement +- **Replacement map** — a per-record dict mapping entity text → substitute value, built by `LlmReplaceWorkflow` and injected into rewrite prompts +- **Leakage mass** — a weighted score measuring how much sensitive information survives in a rewritten record; drives the repair loop +- **Utility score** — a 0–1 score measuring how much semantic content the rewritten record preserves +- **RiskTolerance** — a preset (`minimal` / `low` / `moderate` / `high`) that bundles the leakage threshold, repair behaviour, and human-review flags into a single user-facing knob +- **Repair loop** — the evaluate → repair → re-evaluate cycle in `RewriteWorkflow`; runs up to `max_repair_iterations` times on failing rows +- **FailedRecord** — a record that was dropped by an NDD workflow; surfaced explicitly rather than silently lost + +## Pipelines + +### Replace mode — `AnonymizerConfig(replace=...)` + +``` +input_df + → EntityDetectionWorkflow.run() # engine/detection/detection_workflow.py + GLiNER detection + → parse + tag + → LLM augmentation (add entities GLiNER missed) + → LLM validation (keep / drop candidates) + → merge + finalize → COL_DETECTED_ENTITIES, COL_FINAL_ENTITIES + → ReplacementWorkflow.run() # engine/replace/replace_runner.py + Redact / Annotate / Hash → applied locally, no LLM + Substitute → LlmReplaceWorkflow → NddAdapter + → output: {text_col}_replaced, {text_col}_with_spans, final_entities +``` + +### Rewrite mode — `AnonymizerConfig(rewrite=...)` + +``` +input_df + → EntityDetectionWorkflow.run() # same as above, plus latent entity tagging + → RewriteWorkflow.run() # engine/rewrite/rewrite_workflow.py + LlmReplaceWorkflow.generate_map_only() # build replacement map for prompt + → single NDD adapter call (pipeline_columns): + DomainClassificationWorkflow → _domain, _domain_supplement + SensitivityDispositionWorkflow → _sensitivity_disposition + QAGenerationWorkflow → _quality_qa, _privacy_qa + RewriteGenerationWorkflow → _rewritten_text + → evaluate-repair loop (up to max_repair_iterations): + EvaluateWorkflow → leakage_mass, utility_score, _needs_repair + RepairWorkflow → _rewritten_text (failing rows only) + → FinalJudgeWorkflow (non-critical) → _judge_evaluation, needs_human_review + → output: {text_col}_rewritten, utility_score, leakage_mass, needs_human_review, … +``` + +Records with no detected entities skip all LLM sub-workflows and pass through with default metrics (utility=1.0, leakage=0.0). + +## Config Pattern + +`AnonymizerConfig.rewrite` is the user-facing `Rewrite` model. The engine never receives `Rewrite` directly — it receives `EvaluationCriteria` via the `Rewrite.evaluation` property. + +`Rewrite` and `EvaluationCriteria` both hold `max_repair_iterations`. They must stay in sync: + +- `Rewrite.max_repair_iterations` is the user-facing field (default 3) +- `Rewrite.evaluation` constructs `EvaluationCriteria(risk_tolerance=..., max_repair_iterations=self.max_repair_iterations)` +- **Never construct `EvaluationCriteria` with hardcoded values** — always go through `Rewrite.evaluation` + +Leakage thresholds and repair parameters are derived from `RiskTolerance` via `_RiskToleranceBundle` in `config/rewrite.py`. Don't hardcode them elsewhere. + +## NDD Adapter + +`NddAdapter.run_workflow()` (`engine/ndd/adapter.py`) wraps a DataFrame slice + NDD column configs into a DataDesigner run and returns `WorkflowRunResult(dataframe, failed_records)`. Records missing from the output surface as `FailedRecord` objects rather than silently disappearing. Never access DataDesigner directly from engine workflows — always go through `NddAdapter`. + +## Prompt Conventions + +All column references in NDD prompt templates go through `_jinja()` (`engine/constants.py`) — never format column names directly into strings. Dynamic prompt values use `substitute_placeholders()` (`engine/prompt_utils.py`) with `<>` markers; see its docstring for the substitution contract. Prompts are inline triple-quoted strings in the workflow file that uses them; there is no separate registry. + +## Structural Invariants + +- `from __future__ import annotations` in every Python file +- Absolute imports only (enforced by ruff `TID`) +- Type annotations on all functions, methods, and class attributes +- SPDX license header on every file +- All column names defined in `engine/constants.py` — never use string literals for column names +- `COL_TEXT` is the internal name for the input text column; renamed to the user's original column name in final output + +## What NOT To Do + +- **Don't bypass `Rewrite.evaluation`** — don't construct `EvaluationCriteria` with hardcoded thresholds +- **Don't call DataDesigner directly** — always go through `NddAdapter.run_workflow()` +- **Don't use string literals for column names** — use `COL_*` constants from `engine/constants.py` +- **Don't add a domain to only one supplement map** — see `engine/rewrite/domain_classification.py` for the sync invariant +- **Don't hardcode `gliner_threshold`** — it belongs in `Detect` config (default 0.3) + +## Development + +```bash +make test # run all tests +make bootstrap # install dev dependencies +make format # ruff format + sort imports +make format-check # read-only lint check (used in CI) +make typecheck # ty type check (advisory) +make docs-serve # local MkDocs server at http://127.0.0.1:8000 +``` + +For contributor workflow and branch naming see [CONTRIBUTING.md](CONTRIBUTING.md). +For code style and naming conventions see [STYLEGUIDE.md](STYLEGUIDE.md). diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..595a27e --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,6 @@ + + + +# In ./CLAUDE.md + +@AGENTS.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 370bff5..244f674 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -208,6 +208,17 @@ The `main` branch has the following protections: - All `src` and `tests` files: `@NVIDIA-NeMo/anonymizer-reviewers` - All remaining files (`pyproject.toml`, `uv.lock`, `SECURITY.md`, `LICENSE`, `.github/`, etc.): `@NVIDIA-NeMo/anonymizer-maintainers` +### Agent-Assisted Development + +If you use Claude Code, Cursor, Codex, or another coding agent, follow the standard [Pull Request Process](#pull-request-process) plus these additions: + +1. **For non-trivial changes, draft a plan first.** Non-trivial includes: changes spanning more than one of the `config` / `engine` / `interface` subsystems, introducing a new public API, or modifying an invariant called out in [AGENTS.md](AGENTS.md) or [STYLEGUIDE.md](STYLEGUIDE.md). + - Write a markdown file detailing the approach, trade-offs considered, affected subsystems, and delivery strategy — enough for reviewers to evaluate the design before implementation begins. (Have the agent draft it; review and refine before submitting.) + - Save it at `plans//.md` and submit it as its own PR for review. + - Once the plan is approved, implement it in a follow-up PR. + +2. **Implement following [AGENTS.md](AGENTS.md) and [STYLEGUIDE.md](STYLEGUIDE.md).** Both capture pipeline structure, naming conventions, and invariants ruff and ty cannot enforce. The agent should read these before non-trivial changes. + ## Issues and Discussions ### Issue Templates diff --git a/STYLEGUIDE.md b/STYLEGUIDE.md new file mode 100644 index 0000000..1802b76 --- /dev/null +++ b/STYLEGUIDE.md @@ -0,0 +1,174 @@ + + + +# Style Guide + +Conventions for NeMo Anonymizer that ruff and ty cannot enforce. Read before adding a new module, workflow, or config class. + +NeMo Anonymizer wraps [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (NDD) for LLM column generation. References to NDD below mean that library. + +For architecture and pipeline identity, see [AGENTS.md](AGENTS.md). +For contribution workflow and branch naming, see [CONTRIBUTING.md](CONTRIBUTING.md). + +--- + +## Pydantic vs Dataclasses + +**Pydantic** for config, validation, and serialization. **Dataclasses** for simple typed containers in the engine. + +| Need | Use | +|------|-----| +| User-facing config, validation, JSON schema | `BaseModel` | +| Internal result type, frozen value object | `@dataclass(frozen=True)` | + +```python +# Config — Pydantic +class Detect(BaseModel): + gliner_threshold: float = Field(default=0.3, ge=0.0, le=1.0) + +# Internal result — dataclass +@dataclass(frozen=True) +class WorkflowRunResult: + dataframe: pd.DataFrame + failed_records: list[FailedRecord] +``` + +Use `Field()` only when you need constraints (`ge`, `le`), descriptions, or `default_factory`. Use bare defaults for simple flags and strings. + +--- + +## Error Handling + +Wrap exceptions from NDD and other third-party calls at module boundaries into canonical types from `interface/errors.py`. Callers should never see raw NDD exceptions. + +Preserve the traceback: + +```python +# Good +try: + run_results = self._data_designer.create(...) +except Exception as exc: + raise AnonymizerWorkflowError(f"Workflow failed: {exc}") from exc + +# Bad — swallows the traceback +except Exception as exc: + raise AnonymizerWorkflowError("Workflow failed") +``` + +Don't use defensive `try/except` on trusted internal calls that shouldn't fail — only catch at module boundaries. The final judge step is the intentional exception: it's explicitly non-critical and catches broadly, logging with `exc_info=True` and substituting safe defaults. + +**Error messages** must identify the actual bad value. Use `!r` to make interpolated values unambiguous: + +```python +# Good +raise ValueError(f"Unsupported strategy: {strategy!r}") + +# Bad +raise ValueError("Invalid strategy") +``` + +**No `assert` for validation** — `assert` statements are stripped when Python runs with `-O`. Use `if/raise` instead: + +```python +# Good +if not isinstance(config, AnonymizerConfig): + raise TypeError(f"Expected AnonymizerConfig, got {type(config)!r}") + +# Bad +assert isinstance(config, AnonymizerConfig) +``` + +--- + +## Column Names + +All column names are constants in `engine/constants.py`. Never use string literals for column names. + +```python +# Good +df[COL_DETECTED_ENTITIES] + +# Bad +df["_detected_entities"] +``` + +Internal (intermediate) columns are prefixed with `_`. User-facing output columns use clean names (`final_entities`, `utility_score`). The input text column is always `COL_TEXT` internally and renamed to the user's original column name in `Anonymizer._rename_output_columns()`. + +--- + +## Prompt Construction + +**`_jinja(col, key=None)`** from `engine/constants.py` — use for NDD prompt template column references. Never format column names directly into prompt strings; `_jinja` keeps column references grep-able. + +```python +# Good +f"The text is: {_jinja(COL_TEXT)}" + +# Bad +f"The text is: {{{{ {COL_TEXT} }}}}" +``` + +**`substitute_placeholders(template, replacements)`** from `engine/prompt_utils.py` — use for dynamic prompt values. The `<>` format avoids collisions with Jinja2 syntax. Never use f-strings or `.format()` for prompt templates with dynamic values; single-pass substitution prevents a replacement value from being interpreted as a placeholder. + +Prompts live as inline triple-quoted strings in the workflow file that uses them. There is no separate prompt registry. + +--- + +## Type Annotations + +Type annotations are required on all functions, methods, and class attributes including tests. + +Use `TYPE_CHECKING` blocks for imports needed *only* in type annotations. This prevents circular imports and avoids loading heavy libraries at import time: + +```python +from typing import TYPE_CHECKING + +if TYPE_CHECKING: + import pandas as pd +``` + +If a module uses `pandas` at runtime — calls `pd.DataFrame`, indexes a DataFrame in a function body, etc. — import it at the top level. A `TYPE_CHECKING` import raises `NameError` if you reference it at runtime. `pandas` is import-time expensive, so keep top-level imports of it limited to modules that genuinely need it. + +--- + +## Code Organization + +- Public functions and methods before private (`_`-prefixed) ones within a module or class +- Define helpers at module or class level — avoid nested functions. Nested functions hide logic, make testing harder, and complicate stack traces. The only acceptable use is a closure that genuinely needs to capture local state. + +--- + +## Naming + +- Functions and variables: `snake_case` +- Classes: `PascalCase` +- Constants: `UPPER_SNAKE_CASE` +- Function names start with a verb: `run_workflow`, `build_entity_id`, not `entity_id` or `workflow` + +--- + +## Comments + +Only add a comment when the WHY is non-obvious — a hidden constraint, a subtle invariant, a workaround for a specific bug. Don't narrate what the code already says: + +```python +# Good — explains a non-obvious invariant +# uuid5 is deterministic so input/output IDs match for missing-record tracking. + +# Bad — narrates what the code does +# Loop through the records and append to list +for record in records: + results.append(record) +``` + +--- + +## Future Annotations + +Every Python file must include `from __future__ import annotations` after the license header. This defers annotation evaluation, enables forward references, and keeps behavior consistent across the codebase. + +--- + +## Docstrings + +Google style (`Args:`, `Returns:`, `Raises:`). Public API classes and methods get docstrings; private helpers (`_`-prefixed) only when the logic is non-obvious. Don't restate the signature — explain why or what, not what the type annotation already says.