Skip to content

refactor: replace df.attrs with typed PipelineContext dataclass #141

Open
memadi-nv wants to merge 7 commits intomainfrom
memadi/feature/refactor-df-attribute
Open

refactor: replace df.attrs with typed PipelineContext dataclass #141
memadi-nv wants to merge 7 commits intomainfrom
memadi/feature/refactor-df-attribute

Conversation

@memadi-nv
Copy link
Copy Markdown
Contributor

@memadi-nv memadi-nv commented Apr 28, 2026

Summary

Replaces the experimental df.attrs mechanism for threading the user-facing
text-column name through the pipeline with a typed ResolvedInput dataclass
owned by the orchestrator. Workflows no longer need to preserve attrs
across pandas operations that silently drop them.

Changes

  • New ResolvedInput dataclass (engine/resolved_input.py): frozen
    dataclass holding the input dataframe, the user's requested_text_column,
    and the resolved_text_column (post-collision rename). read_input returns
    this directly.
  • Reader (engine/io/reader.py): populates both requested and resolved
    text-column fields; collision rename logic unchanged but now exposed
    explicitly via the dataclass.
  • Orchestrator (interface/anonymizer.py): _run_internal accepts
    ResolvedInput and pulls resolved_text_column from it for output-column
    restoration — no more df.attrs reads.
  • Workflows (detection_workflow.py, llm_replace_workflow.py,
    rewrite_workflow.py, row_partitioning.py): drop their attrs= plumbing
    / merge_and_reorder no longer takes the attrs parameter.
  • Public result types (interface/results.py): AnonymizerResult and
    PreviewResult gain resolved_text_column (replaces original_text_column).
    render_record_html parameter renamed accordingly.

Breaking Changes

  • AnonymizerResult.original_text_columnAnonymizerResult.resolved_text_column
  • PreviewResult.original_text_columnPreviewResult.resolved_text_column
  • render_record_html(..., original_text_column=...)render_record_html(..., resolved_text_column=...)

Tests

  • New tests/engine/test_resolved_input.py: covers construction, immutability,
    the requested-vs-resolved asymmetry, and the headline invariant that
    metadata survives pandas merge/concat/groupby.
  • Workflow-level assert "..." not in df.attrs regression guards removed
    (they enforced an implementation detail, not the orchestrator's contract).
  • Replaced with a behavioral orchestrator test that injects deliberately
    misleading attrs into mocked workflow output and asserts Anonymizer.run()
    • display_record still produce correct user-facing columns.

Follow-up

  • Add renamed_input_columns field to ResolvedInput so users can see all
    collision-driven renames, not just the text column (deferred per review).

@memadi-nv memadi-nv requested a review from a team as a code owner April 28, 2026 19:53
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR eliminates the use of pandas' experimental DataFrame.attrs for threading pipeline-level metadata (original_text_column) through workflow stages, replacing it with a typed ResolvedInput dataclass that carries both the requested_text_column (what the user asked for) and resolved_text_column (the post-collision-rename identifier). The orchestrator now owns this metadata explicitly rather than smuggling it through a fragile, drop-prone attrs slot.

  • ResolvedInput is introduced as a frozen dataclass in engine/resolved_input.py, combining the role of the old ResolvedInputColumns helper and the old df.attrs[\"original_text_column\"] key. read_input() returns it instead of a bare pd.DataFrame.
  • AnonymizerResult and PreviewResult gain a resolved_text_column field; _DisplayMixin.display_record reads from it instead of trace_dataframe.attrs. All manual attrs={**a.attrs, **b.attrs} plumbing in merge_and_reorder, LlmReplaceWorkflow, and RewriteWorkflow is removed.
  • Tests are migrated from result.attrs[...] assertions to result.requested_text_column / result.resolved_text_column / result.dataframe.*, and a new integration-level regression guard verifies the orchestrator never consults df.attrs for text-column resolution.

Confidence Score: 5/5

Safe to merge — the refactor correctly threads metadata through an explicit typed field at the orchestration layer, and all test stubs have been updated to match the new API surface.

The removal of df.attrs plumbing is complete and consistent across detection, replace, rewrite, and row-partitioning. The new ResolvedInput dataclass correctly preserves both the user-requested and post-collision-resolution column names throughout the pipeline. _rename_output_columns and _build_user_dataframe receive resolved_text_column explicitly, and the new integration-level regression guard verifies this end-to-end.

No files require special attention.

Important Files Changed

Filename Overview
src/anonymizer/engine/resolved_input.py New frozen dataclass replacing the old ResolvedInputColumns + df.attrs pattern; with_dataframe() uses dataclasses.replace correctly; docstring explicitly documents the eq/hash unsupportability with a pd.DataFrame field.
src/anonymizer/interface/anonymizer.py _run_internal now takes context: ResolvedInput; resolved_text_column is threaded explicitly to _rename_output_columns, _build_user_dataframe, and AnonymizerResult without touching df.attrs.
src/anonymizer/interface/results.py AnonymizerResult and PreviewResult gain resolved_text_column: str as a required field; _DisplayMixin reads from it instead of trace_dataframe.attrs.
src/anonymizer/engine/io/reader.py read_input now returns ResolvedInput; the rename of the text column to COL_TEXT uses resolved.resolved_text_column correctly, and the result is wrapped with with_dataframe() cleanly.
src/anonymizer/engine/row_partitioning.py attrs parameter dropped from merge_and_reorder; the combined.attrs assignment is removed, simplifying the function signature.
tests/engine/test_resolved_input.py New test file covering construction, immutability, with_dataframe(), and the headline invariant that metadata survives merge/concat/groupby on the wrapped DataFrame.
tests/interface/test_anonymizer_interface.py All df.attrs setups removed from stubs; new regression guard verifies orchestrator uses ResolvedInput rather than workflow-output df.attrs.

Reviews (4): Last reviewed commit: "address feedback" | Re-trigger Greptile

Comment thread src/anonymizer/engine/pipeline_context.py Outdated
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv changed the title refactor: replace df.attrs with typed PipelineContext dataclass (#4) refactor: replace df.attrs with typed PipelineContext dataclass Apr 28, 2026
@memadi-nv memadi-nv linked an issue Apr 28, 2026 that may be closed by this pull request
Signed-off-by: memadi <memadi@nvidia.com>
@lipikaramaswamy
Copy link
Copy Markdown
Collaborator

IMO we should tighten the abstraction before merging. The name PipelineContext indicates something broader than what this object currently is. It is not really passed through the whole pipeline; it is the result of input normalization plus metadata the top-level orchestrator needs when turning internal columns back into user facing columns. A name like ResolvedInput, ResolvedInputFrame, InputContext, or similar would describe the current role better.

Related: now that I think about it, original_text_column is misleading. In collision cases this can become something like final_entities__input, which is not the original requested column name anymore. It is the resolved user facing/output text column. I think the object can capture the full input-resolution result, for example: dataframe, requested_text_column, resolved_text_column, and renamed_input_columns. Ok to punt some of that as follow-up, but if we're introducing the dataclass now, it's probably worth getting the core vocabulary right.

Re tests: assert "original_text_column" not in result.dataframe.attrs enforces an implementation detail that workflows must not preserve attrs. But the real invariant is that the orchestrator no longer depends on attrs. We should instead test that mocked downstream workflow outputs have no attrs and Anonymizer.run() still restores the correct user-facing columns/display behavior.

memadi-nv and others added 4 commits May 6, 2026 16:24
…_column

Address review feedback that the dataclass name was too generic and that
``original_text_column`` was misleading (in collision cases it actually
held the renamed identifier, not the user-requested name).

- Renames the class and module: ``PipelineContext`` ->
  ``ResolvedInput`` in ``engine/resolved_input.py``. The new name
  reflects its actual role: the result of input normalization.
- Adds a ``requested_text_column`` field alongside
  ``resolved_text_column``. The reader now populates both: the former
  preserves the user's request; the latter is the post-collision
  identifier the orchestrator routes downstream rendering through.
- ``read_input`` and the orchestrator's ``_run_internal`` switch to
  ``ResolvedInput`` and use ``resolved_text_column`` for output-column
  restoration.
- ``tests/engine/test_pipeline_context.py`` is renamed to
  ``test_resolved_input.py`` and gains coverage for the
  requested/resolved asymmetry. ``test_io.py`` asserts both fields.

The public ``AnonymizerResult.original_text_column`` field is intentionally
left untouched here; that public-API rename lands in the next commit so it
shows up isolated in the PR diff.

Co-authored-by: Cursor <cursoragent@cursor.com>
Public-API rename in its own commit so reviewers see the breaking change
isolated from the upstream vocabulary refactor.

- ``AnonymizerResult.original_text_column`` ->
  ``AnonymizerResult.resolved_text_column``
- ``PreviewResult.original_text_column`` ->
  ``PreviewResult.resolved_text_column``
- ``render_record_html`` parameter renamed accordingly.
- Internal helpers ``_rename_output_columns`` and
  ``_build_user_dataframe`` switch their kwarg names for consistency
  with the new vocabulary.

The new name is more accurate: in collision cases the value is the
post-rename identifier (e.g. ``"final_entities__input"``), not the
user's literal request. ``ResolvedInput.requested_text_column`` is the
field that holds the user-requested name.

BREAKING CHANGE: any external code accessing
``result.original_text_column`` must switch to
``result.resolved_text_column``.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ests

Address review feedback that the existing
``assert "original_text_column" not in result.dataframe.attrs`` guards
enforced an implementation detail (workflows must not preserve attrs)
rather than the real invariant (the orchestrator does not depend on
attrs for text-column resolution).

- Drops the workflow-level guards from ``test_rewrite_workflow.py``
  (fast-path and full-pipeline) and ``test_detection_workflow.py``.
  These never tested orchestrator behavior; they merely asserted that
  workflow output carried no metadata, which is incidental to the
  contract.
- Replaces ``test_run_threads_original_text_column_via_context_not_df_attrs``
  in the interface tests with
  ``test_run_ignores_workflow_output_attrs_for_text_column_resolution``.
  The new test mocks workflow outputs whose ``attrs`` carry deliberately
  misleading values (``{"resolved_text_column": "WRONG_COL", ...}``) and
  asserts that the orchestrator's user-facing dataframe, trace
  dataframe, and ``render_record_html`` output all reflect the user's
  actual ``text_column``, never the bogus ``WRONG_COL`` from attrs.
- Also fixes import ordering in ``interface/anonymizer.py`` flagged by
  ruff after the ``ResolvedInput`` rename.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: memadi <memadi@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: replace df.attrs with typed pipeline context dataclass

2 participants