Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,9 @@ CLAUDE.local.md
.claude/settings.local.json
ai/tmp/

# Claude worktrees
.claude/worktrees/

# Anonymizer execution artifacts
.anonymizer-artifacts/
docs/notebook_source/data/synth_bios_sample10_anonymized.csv
Expand Down
120 changes: 120 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# AGENTS.md

This file is for agents **developing** NeMo Anonymizer — the codebase you are working in.
If you are an agent helping a user **anonymize data**, use the [product documentation](https://nvidia-nemo.github.io/Anonymizer/) instead.

**NeMo Anonymizer** detects and protects PII through context-aware entity replacement and LLM-powered rewriting. Users supply a text dataset and a strategy; Anonymizer detects entities and transforms the text.

## Module Map

`nemo-anonymizer` is a single package with three modules:

- **`anonymizer.config`** — user-facing configuration: `AnonymizerConfig`, `AnonymizerInput`, replace strategies (`Substitute`, `Redact`, `Annotate`, `Hash`), and rewrite config (`Rewrite`, `EvaluationCriteria`, `RiskTolerance`). New user-facing knobs go here.
- **`anonymizer.engine`** — internal pipeline implementation: detection, replacement, and rewrite sub-workflows, the NDD adapter, prompt utilities, and all `COL_*` column constants. Never imported directly by users.
- **`anonymizer.interface`** — user-facing entry points: the `Anonymizer` class, CLI, `AnonymizerResult`, `PreviewResult`, and canonical error types. Thin layer that wires config → engine and exposes results.

NeMo Anonymizer wraps [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (NDD) for LLM column generation. `NddAdapter` is the only place this dependency crosses — engine sub-workflows declare NDD column configs and hand them to the adapter, which manages DataDesigner internally.

## Core Concepts

- **Entity** — a detected span of text with a label (e.g. `"Alice"` → `first_name`) and character offsets
- **Latent entity** — an entity detected in rewrite mode that is sensitive but not directly named; used to guide rewriting without explicit replacement
- **Replacement map** — a per-record dict mapping entity text → substitute value, built by `LlmReplaceWorkflow` and injected into rewrite prompts
- **Leakage mass** — a weighted score measuring how much sensitive information survives in a rewritten record; drives the repair loop
- **Utility score** — a 0–1 score measuring how much semantic content the rewritten record preserves
- **RiskTolerance** — a preset (`minimal` / `low` / `moderate` / `high`) that bundles the leakage threshold, repair behaviour, and human-review flags into a single user-facing knob
- **Repair loop** — the evaluate → repair → re-evaluate cycle in `RewriteWorkflow`; runs up to `max_repair_iterations` times on failing rows
- **FailedRecord** — a record that was dropped by an NDD workflow; surfaced explicitly rather than silently lost

## Pipelines

### Replace mode — `AnonymizerConfig(replace=...)`

```
input_df
→ EntityDetectionWorkflow.run() # engine/detection/detection_workflow.py
GLiNER detection
→ parse + tag
→ LLM augmentation (add entities GLiNER missed)
→ LLM validation (keep / drop candidates)
→ merge + finalize → COL_DETECTED_ENTITIES, COL_FINAL_ENTITIES
→ ReplacementWorkflow.run() # engine/replace/replace_runner.py
Redact / Annotate / Hash → applied locally, no LLM
Substitute → LlmReplaceWorkflow → NddAdapter
→ output: {text_col}_replaced, {text_col}_with_spans, final_entities
```

### Rewrite mode — `AnonymizerConfig(rewrite=...)`

```
input_df
→ EntityDetectionWorkflow.run() # same as above, plus latent entity tagging
→ RewriteWorkflow.run() # engine/rewrite/rewrite_workflow.py
LlmReplaceWorkflow.generate_map_only() # build replacement map for prompt
→ single NDD adapter call (pipeline_columns):
DomainClassificationWorkflow → _domain, _domain_supplement
SensitivityDispositionWorkflow → _sensitivity_disposition
QAGenerationWorkflow → _quality_qa, _privacy_qa
RewriteGenerationWorkflow → _rewritten_text
→ evaluate-repair loop (up to max_repair_iterations):
EvaluateWorkflow → leakage_mass, utility_score, _needs_repair
RepairWorkflow → _rewritten_text (failing rows only)
→ FinalJudgeWorkflow (non-critical) → _judge_evaluation, needs_human_review
→ output: {text_col}_rewritten, utility_score, leakage_mass, needs_human_review, …
```

Records with no detected entities skip all LLM sub-workflows and pass through with default metrics (utility=1.0, leakage=0.0).

## Config Pattern

`AnonymizerConfig.rewrite` is the user-facing `Rewrite` model. The engine never receives `Rewrite` directly — it receives `EvaluationCriteria` via the `Rewrite.evaluation` property.

`Rewrite` and `EvaluationCriteria` both hold `max_repair_iterations`. They must stay in sync:

- `Rewrite.max_repair_iterations` is the user-facing field (default 3)
- `Rewrite.evaluation` constructs `EvaluationCriteria(risk_tolerance=..., max_repair_iterations=self.max_repair_iterations)`
- **Never construct `EvaluationCriteria` with hardcoded values** — always go through `Rewrite.evaluation`

Leakage thresholds and repair parameters are derived from `RiskTolerance` via `_RiskToleranceBundle` in `config/rewrite.py`. Don't hardcode them elsewhere.

## NDD Adapter

`NddAdapter.run_workflow()` (`engine/ndd/adapter.py`) wraps a DataFrame slice + NDD column configs into a DataDesigner run and returns `WorkflowRunResult(dataframe, failed_records)`. Records missing from the output surface as `FailedRecord` objects rather than silently disappearing. Never access DataDesigner directly from engine workflows — always go through `NddAdapter`.

## Prompt Conventions

All column references in NDD prompt templates go through `_jinja()` (`engine/constants.py`) — never format column names directly into strings. Dynamic prompt values use `substitute_placeholders()` (`engine/prompt_utils.py`) with `<<PLACEHOLDER>>` markers; see its docstring for the substitution contract. Prompts are inline triple-quoted strings in the workflow file that uses them; there is no separate registry.

## Structural Invariants

- `from __future__ import annotations` in every Python file
- Absolute imports only (enforced by ruff `TID`)
- Type annotations on all functions, methods, and class attributes
- SPDX license header on every file
- All column names defined in `engine/constants.py` — never use string literals for column names
- `COL_TEXT` is the internal name for the input text column; renamed to the user's original column name in final output

## What NOT To Do

- **Don't bypass `Rewrite.evaluation`** — don't construct `EvaluationCriteria` with hardcoded thresholds
- **Don't call DataDesigner directly** — always go through `NddAdapter.run_workflow()`
- **Don't use string literals for column names** — use `COL_*` constants from `engine/constants.py`
- **Don't add a domain to only one supplement map** — see `engine/rewrite/domain_classification.py` for the sync invariant
- **Don't hardcode `gliner_threshold`** — it belongs in `Detect` config (default 0.3)

## Development

```bash
make test # run all tests
make bootstrap # install dev dependencies
make format # ruff format + sort imports
make format-check # read-only lint check (used in CI)
make typecheck # ty type check (advisory)
make docs-serve # local MkDocs server at http://127.0.0.1:8000
```

For contributor workflow and branch naming see [CONTRIBUTING.md](CONTRIBUTING.md).
For code style and naming conventions see [STYLEGUIDE.md](STYLEGUIDE.md).
6 changes: 6 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# In ./CLAUDE.md

@AGENTS.md
11 changes: 11 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,17 @@ The `main` branch has the following protections:
- All `src` and `tests` files: `@NVIDIA-NeMo/anonymizer-reviewers`
- All remaining files (`pyproject.toml`, `uv.lock`, `SECURITY.md`, `LICENSE`, `.github/`, etc.): `@NVIDIA-NeMo/anonymizer-maintainers`

### Agent-Assisted Development

If you use Claude Code, Cursor, Codex, or another coding agent, follow the standard [Pull Request Process](#pull-request-process) plus these additions:

1. **For non-trivial changes, draft a plan first.** Non-trivial includes: changes spanning more than one of the `config` / `engine` / `interface` subsystems, introducing a new public API, or modifying an invariant called out in [AGENTS.md](AGENTS.md) or [STYLEGUIDE.md](STYLEGUIDE.md).
- Write a markdown file detailing the approach, trade-offs considered, affected subsystems, and delivery strategy — enough for reviewers to evaluate the design before implementation begins. (Have the agent draft it; review and refine before submitting.)
- Save it at `plans/<issue-number>/<short-name>.md` and submit it as its own PR for review.
- Once the plan is approved, implement it in a follow-up PR.

2. **Implement following [AGENTS.md](AGENTS.md) and [STYLEGUIDE.md](STYLEGUIDE.md).** Both capture pipeline structure, naming conventions, and invariants ruff and ty cannot enforce. The agent should read these before non-trivial changes.

## Issues and Discussions

### Issue Templates
Expand Down
174 changes: 174 additions & 0 deletions STYLEGUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
<!-- SPDX-License-Identifier: Apache-2.0 -->

# Style Guide

Conventions for NeMo Anonymizer that ruff and ty cannot enforce. Read before adding a new module, workflow, or config class.

NeMo Anonymizer wraps [DataDesigner](https://github.com/NVIDIA-NeMo/DataDesigner) (NDD) for LLM column generation. References to NDD below mean that library.

For architecture and pipeline identity, see [AGENTS.md](AGENTS.md).
For contribution workflow and branch naming, see [CONTRIBUTING.md](CONTRIBUTING.md).

---

## Pydantic vs Dataclasses

**Pydantic** for config, validation, and serialization. **Dataclasses** for simple typed containers in the engine.

| Need | Use |
|------|-----|
| User-facing config, validation, JSON schema | `BaseModel` |
| Internal result type, frozen value object | `@dataclass(frozen=True)` |

```python
# Config — Pydantic
class Detect(BaseModel):
gliner_threshold: float = Field(default=0.3, ge=0.0, le=1.0)

# Internal result — dataclass
@dataclass(frozen=True)
class WorkflowRunResult:
dataframe: pd.DataFrame
failed_records: list[FailedRecord]
```

Use `Field()` only when you need constraints (`ge`, `le`), descriptions, or `default_factory`. Use bare defaults for simple flags and strings.

---

## Error Handling

Wrap exceptions from NDD and other third-party calls at module boundaries into canonical types from `interface/errors.py`. Callers should never see raw NDD exceptions.

Preserve the traceback:

```python
# Good
try:
run_results = self._data_designer.create(...)
except Exception as exc:
raise AnonymizerWorkflowError(f"Workflow failed: {exc}") from exc

# Bad — swallows the traceback
except Exception as exc:
raise AnonymizerWorkflowError("Workflow failed")
```

Don't use defensive `try/except` on trusted internal calls that shouldn't fail — only catch at module boundaries. The final judge step is the intentional exception: it's explicitly non-critical and catches broadly, logging with `exc_info=True` and substituting safe defaults.

**Error messages** must identify the actual bad value. Use `!r` to make interpolated values unambiguous:

```python
# Good
raise ValueError(f"Unsupported strategy: {strategy!r}")

# Bad
raise ValueError("Invalid strategy")
```

**No `assert` for validation** — `assert` statements are stripped when Python runs with `-O`. Use `if/raise` instead:

```python
# Good
if not isinstance(config, AnonymizerConfig):
raise TypeError(f"Expected AnonymizerConfig, got {type(config)!r}")

# Bad
assert isinstance(config, AnonymizerConfig)
```

---

## Column Names

All column names are constants in `engine/constants.py`. Never use string literals for column names.

```python
# Good
df[COL_DETECTED_ENTITIES]

# Bad
df["_detected_entities"]
```

Internal (intermediate) columns are prefixed with `_`. User-facing output columns use clean names (`final_entities`, `utility_score`). The input text column is always `COL_TEXT` internally and renamed to the user's original column name in `Anonymizer._rename_output_columns()`.

---

## Prompt Construction

**`_jinja(col, key=None)`** from `engine/constants.py` — use for NDD prompt template column references. Never format column names directly into prompt strings; `_jinja` keeps column references grep-able.

```python
# Good
f"The text is: {_jinja(COL_TEXT)}"

# Bad
f"The text is: {{{{ {COL_TEXT} }}}}"
```

**`substitute_placeholders(template, replacements)`** from `engine/prompt_utils.py` — use for dynamic prompt values. The `<<PLACEHOLDER>>` format avoids collisions with Jinja2 syntax. Never use f-strings or `.format()` for prompt templates with dynamic values; single-pass substitution prevents a replacement value from being interpreted as a placeholder.

Prompts live as inline triple-quoted strings in the workflow file that uses them. There is no separate prompt registry.

---

## Type Annotations

Type annotations are required on all functions, methods, and class attributes including tests.

Use `TYPE_CHECKING` blocks for imports needed *only* in type annotations. This prevents circular imports and avoids loading heavy libraries at import time:

```python
from typing import TYPE_CHECKING

if TYPE_CHECKING:
import pandas as pd
```

If a module uses `pandas` at runtime — calls `pd.DataFrame`, indexes a DataFrame in a function body, etc. — import it at the top level. A `TYPE_CHECKING` import raises `NameError` if you reference it at runtime. `pandas` is import-time expensive, so keep top-level imports of it limited to modules that genuinely need it.

---

## Code Organization

- Public functions and methods before private (`_`-prefixed) ones within a module or class
- Define helpers at module or class level — avoid nested functions. Nested functions hide logic, make testing harder, and complicate stack traces. The only acceptable use is a closure that genuinely needs to capture local state.

---

## Naming

- Functions and variables: `snake_case`
- Classes: `PascalCase`
- Constants: `UPPER_SNAKE_CASE`
- Function names start with a verb: `run_workflow`, `build_entity_id`, not `entity_id` or `workflow`

---

## Comments

Only add a comment when the WHY is non-obvious — a hidden constraint, a subtle invariant, a workaround for a specific bug. Don't narrate what the code already says:

```python
# Good — explains a non-obvious invariant
# uuid5 is deterministic so input/output IDs match for missing-record tracking.

# Bad — narrates what the code does
# Loop through the records and append to list
for record in records:
results.append(record)
```

---

## Future Annotations

Every Python file must include `from __future__ import annotations` after the license header. This defers annotation evaluation, enables forward references, and keeps behavior consistent across the codebase.

---

## Docstrings

Google style (`Args:`, `Returns:`, `Raises:`). Public API classes and methods get docstrings; private helpers (`_`-prefixed) only when the logic is non-obvious. Don't restate the signature — explain why or what, not what the type annotation already says.
Loading