refactor: replace df.attrs with typed PipelineContext dataclass by memadi-nv · Pull Request #141 · NVIDIA-NeMo/Anonymizer

memadi-nv · 2026-04-28T19:53:30Z

Summary

Replaces the experimental df.attrs mechanism for threading the user-facing
text-column name through the pipeline with a typed ResolvedInput dataclass
owned by the orchestrator. Workflows no longer need to preserve attrs
across pandas operations that silently drop them.

Changes

New ResolvedInput dataclass (engine/resolved_input.py): frozen
dataclass holding the input dataframe, the user's requested_text_column,
and the resolved_text_column (post-collision rename). read_input returns
this directly.
Reader (engine/io/reader.py): populates both requested and resolved
text-column fields; collision rename logic unchanged but now exposed
explicitly via the dataclass.
Orchestrator (interface/anonymizer.py): _run_internal accepts
ResolvedInput and pulls resolved_text_column from it for output-column
restoration — no more df.attrs reads.
Workflows (detection_workflow.py, llm_replace_workflow.py,
rewrite_workflow.py, row_partitioning.py): drop their attrs= plumbing
/ merge_and_reorder no longer takes the attrs parameter.
Public result types (interface/results.py): AnonymizerResult and
PreviewResult gain resolved_text_column (replaces original_text_column).
render_record_html parameter renamed accordingly.

Breaking Changes

AnonymizerResult.original_text_column → AnonymizerResult.resolved_text_column
PreviewResult.original_text_column → PreviewResult.resolved_text_column
render_record_html(..., original_text_column=...) → render_record_html(..., resolved_text_column=...)

Tests

New tests/engine/test_resolved_input.py: covers construction, immutability,
the requested-vs-resolved asymmetry, and the headline invariant that
metadata survives pandas merge/concat/groupby.
Workflow-level assert "..." not in df.attrs regression guards removed
(they enforced an implementation detail, not the orchestrator's contract).
Replaced with a behavioral orchestrator test that injects deliberately
misleading attrs into mocked workflow output and asserts Anonymizer.run()
- display_record still produce correct user-facing columns.

Follow-up

Add renamed_input_columns field to ResolvedInput so users can see all
collision-driven renames, not just the text column (deferred per review).

greptile-apps · 2026-04-28T19:55:54Z

Greptile Summary

This PR eliminates the use of pandas' experimental DataFrame.attrs for threading pipeline-level metadata (original_text_column) through workflow stages, replacing it with a typed ResolvedInput dataclass that carries both the requested_text_column (what the user asked for) and resolved_text_column (the post-collision-rename identifier). The orchestrator now owns this metadata explicitly rather than smuggling it through a fragile, drop-prone attrs slot.

ResolvedInput is introduced as a frozen dataclass in engine/resolved_input.py, combining the role of the old ResolvedInputColumns helper and the old df.attrs[\"original_text_column\"] key. read_input() returns it instead of a bare pd.DataFrame.
AnonymizerResult and PreviewResult gain a resolved_text_column field; _DisplayMixin.display_record reads from it instead of trace_dataframe.attrs. All manual attrs={**a.attrs, **b.attrs} plumbing in merge_and_reorder, LlmReplaceWorkflow, and RewriteWorkflow is removed.
Tests are migrated from result.attrs[...] assertions to result.requested_text_column / result.resolved_text_column / result.dataframe.*, and a new integration-level regression guard verifies the orchestrator never consults df.attrs for text-column resolution.

Confidence Score: 5/5

Safe to merge — the refactor correctly threads metadata through an explicit typed field at the orchestration layer, and all test stubs have been updated to match the new API surface.

The removal of df.attrs plumbing is complete and consistent across detection, replace, rewrite, and row-partitioning. The new ResolvedInput dataclass correctly preserves both the user-requested and post-collision-resolution column names throughout the pipeline. _rename_output_columns and _build_user_dataframe receive resolved_text_column explicitly, and the new integration-level regression guard verifies this end-to-end.

No files require special attention.

Important Files Changed

Filename	Overview
src/anonymizer/engine/resolved_input.py	New frozen dataclass replacing the old ResolvedInputColumns + df.attrs pattern; with_dataframe() uses dataclasses.replace correctly; docstring explicitly documents the eq/hash unsupportability with a pd.DataFrame field.
src/anonymizer/interface/anonymizer.py	_run_internal now takes context: ResolvedInput; resolved_text_column is threaded explicitly to _rename_output_columns, _build_user_dataframe, and AnonymizerResult without touching df.attrs.
src/anonymizer/interface/results.py	AnonymizerResult and PreviewResult gain resolved_text_column: str as a required field; _DisplayMixin reads from it instead of trace_dataframe.attrs.
src/anonymizer/engine/io/reader.py	read_input now returns ResolvedInput; the rename of the text column to COL_TEXT uses resolved.resolved_text_column correctly, and the result is wrapped with with_dataframe() cleanly.
src/anonymizer/engine/row_partitioning.py	attrs parameter dropped from merge_and_reorder; the combined.attrs assignment is removed, simplifying the function signature.
tests/engine/test_resolved_input.py	New test file covering construction, immutability, with_dataframe(), and the headline invariant that metadata survives merge/concat/groupby on the wrapped DataFrame.
tests/interface/test_anonymizer_interface.py	All df.attrs setups removed from stubs; new regression guard verifies orchestrator uses ResolvedInput rather than workflow-output df.attrs.

_{Reviews (4): Last reviewed commit: "address feedback" | Re-trigger Greptile}

Signed-off-by: memadi <memadi@nvidia.com>

lipikaramaswamy · 2026-05-05T22:58:49Z

IMO we should tighten the abstraction before merging. The name PipelineContext indicates something broader than what this object currently is. It is not really passed through the whole pipeline; it is the result of input normalization plus metadata the top-level orchestrator needs when turning internal columns back into user facing columns. A name like ResolvedInput, ResolvedInputFrame, InputContext, or similar would describe the current role better.

Related: now that I think about it, original_text_column is misleading. In collision cases this can become something like final_entities__input, which is not the original requested column name anymore. It is the resolved user facing/output text column. I think the object can capture the full input-resolution result, for example: dataframe, requested_text_column, resolved_text_column, and renamed_input_columns. Ok to punt some of that as follow-up, but if we're introducing the dataclass now, it's probably worth getting the core vocabulary right.

Re tests: assert "original_text_column" not in result.dataframe.attrs enforces an implementation detail that workflows must not preserve attrs. But the real invariant is that the orchestrator no longer depends on attrs. We should instead test that mocked downstream workflow outputs have no attrs and Anonymizer.run() still restores the correct user-facing columns/display behavior.

…_column Address review feedback that the dataclass name was too generic and that ``original_text_column`` was misleading (in collision cases it actually held the renamed identifier, not the user-requested name). - Renames the class and module: ``PipelineContext`` -> ``ResolvedInput`` in ``engine/resolved_input.py``. The new name reflects its actual role: the result of input normalization. - Adds a ``requested_text_column`` field alongside ``resolved_text_column``. The reader now populates both: the former preserves the user's request; the latter is the post-collision identifier the orchestrator routes downstream rendering through. - ``read_input`` and the orchestrator's ``_run_internal`` switch to ``ResolvedInput`` and use ``resolved_text_column`` for output-column restoration. - ``tests/engine/test_pipeline_context.py`` is renamed to ``test_resolved_input.py`` and gains coverage for the requested/resolved asymmetry. ``test_io.py`` asserts both fields. The public ``AnonymizerResult.original_text_column`` field is intentionally left untouched here; that public-API rename lands in the next commit so it shows up isolated in the PR diff. Co-authored-by: Cursor <cursoragent@cursor.com>

Public-API rename in its own commit so reviewers see the breaking change isolated from the upstream vocabulary refactor. - ``AnonymizerResult.original_text_column`` -> ``AnonymizerResult.resolved_text_column`` - ``PreviewResult.original_text_column`` -> ``PreviewResult.resolved_text_column`` - ``render_record_html`` parameter renamed accordingly. - Internal helpers ``_rename_output_columns`` and ``_build_user_dataframe`` switch their kwarg names for consistency with the new vocabulary. The new name is more accurate: in collision cases the value is the post-rename identifier (e.g. ``"final_entities__input"``), not the user's literal request. ``ResolvedInput.requested_text_column`` is the field that holds the user-requested name. BREAKING CHANGE: any external code accessing ``result.original_text_column`` must switch to ``result.resolved_text_column``. Co-authored-by: Cursor <cursoragent@cursor.com>

…ests Address review feedback that the existing ``assert "original_text_column" not in result.dataframe.attrs`` guards enforced an implementation detail (workflows must not preserve attrs) rather than the real invariant (the orchestrator does not depend on attrs for text-column resolution). - Drops the workflow-level guards from ``test_rewrite_workflow.py`` (fast-path and full-pipeline) and ``test_detection_workflow.py``. These never tested orchestrator behavior; they merely asserted that workflow output carried no metadata, which is incidental to the contract. - Replaces ``test_run_threads_original_text_column_via_context_not_df_attrs`` in the interface tests with ``test_run_ignores_workflow_output_attrs_for_text_column_resolution``. The new test mocks workflow outputs whose ``attrs`` carry deliberately misleading values (``{"resolved_text_column": "WRONG_COL", ...}``) and asserts that the orchestrator's user-facing dataframe, trace dataframe, and ``render_record_html`` output all reflect the user's actual ``text_column``, never the bogus ``WRONG_COL`` from attrs. - Also fixes import ordering in ``interface/anonymizer.py`` flagged by ruff after the ``ResolvedInput`` rename. Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: memadi <memadi@nvidia.com>

refactor: replace df.attrs with typed PipelineContext dataclass (#4)

cac2783

memadi-nv requested a review from a team as a code owner April 28, 2026 19:53

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread src/anonymizer/engine/pipeline_context.py Outdated

nit

6a6ba61

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv changed the title ~~refactor: replace df.attrs with typed PipelineContext dataclass (#4)~~ refactor: replace df.attrs with typed PipelineContext dataclass Apr 28, 2026

memadi-nv linked an issue Apr 28, 2026 that may be closed by this pull request

refactor: replace df.attrs with typed pipeline context dataclass #4

Open

address greptile comment

1f980be

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv requested review from asteier2026 and lipikaramaswamy April 28, 2026 21:33

memadi-nv and others added 4 commits May 6, 2026 16:24

address feedback

3c61c7c

Signed-off-by: memadi <memadi@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: replace df.attrs with typed PipelineContext dataclass #141

refactor: replace df.attrs with typed PipelineContext dataclass #141
memadi-nv wants to merge 7 commits intomainfrom
memadi/feature/refactor-df-attribute

memadi-nv commented Apr 28, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

lipikaramaswamy commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

memadi-nv commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Breaking Changes

Tests

Follow-up

Uh oh!

greptile-apps Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

lipikaramaswamy commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

memadi-nv commented Apr 28, 2026 •

edited

Loading

greptile-apps Bot commented Apr 28, 2026 •

edited

Loading