refactor: replace df.attrs with typed PipelineContext dataclass #141
refactor: replace df.attrs with typed PipelineContext dataclass #141
Conversation
Greptile SummaryThis PR eliminates the use of pandas' experimental
Confidence Score: 5/5Safe to merge — the refactor correctly threads metadata through an explicit typed field at the orchestration layer, and all test stubs have been updated to match the new API surface. The removal of df.attrs plumbing is complete and consistent across detection, replace, rewrite, and row-partitioning. The new ResolvedInput dataclass correctly preserves both the user-requested and post-collision-resolution column names throughout the pipeline. _rename_output_columns and _build_user_dataframe receive resolved_text_column explicitly, and the new integration-level regression guard verifies this end-to-end. No files require special attention. Important Files Changed
Reviews (4): Last reviewed commit: "address feedback" | Re-trigger Greptile |
Signed-off-by: memadi <memadi@nvidia.com>
|
IMO we should tighten the abstraction before merging. The name Related: now that I think about it, Re tests: |
…_column Address review feedback that the dataclass name was too generic and that ``original_text_column`` was misleading (in collision cases it actually held the renamed identifier, not the user-requested name). - Renames the class and module: ``PipelineContext`` -> ``ResolvedInput`` in ``engine/resolved_input.py``. The new name reflects its actual role: the result of input normalization. - Adds a ``requested_text_column`` field alongside ``resolved_text_column``. The reader now populates both: the former preserves the user's request; the latter is the post-collision identifier the orchestrator routes downstream rendering through. - ``read_input`` and the orchestrator's ``_run_internal`` switch to ``ResolvedInput`` and use ``resolved_text_column`` for output-column restoration. - ``tests/engine/test_pipeline_context.py`` is renamed to ``test_resolved_input.py`` and gains coverage for the requested/resolved asymmetry. ``test_io.py`` asserts both fields. The public ``AnonymizerResult.original_text_column`` field is intentionally left untouched here; that public-API rename lands in the next commit so it shows up isolated in the PR diff. Co-authored-by: Cursor <cursoragent@cursor.com>
Public-API rename in its own commit so reviewers see the breaking change isolated from the upstream vocabulary refactor. - ``AnonymizerResult.original_text_column`` -> ``AnonymizerResult.resolved_text_column`` - ``PreviewResult.original_text_column`` -> ``PreviewResult.resolved_text_column`` - ``render_record_html`` parameter renamed accordingly. - Internal helpers ``_rename_output_columns`` and ``_build_user_dataframe`` switch their kwarg names for consistency with the new vocabulary. The new name is more accurate: in collision cases the value is the post-rename identifier (e.g. ``"final_entities__input"``), not the user's literal request. ``ResolvedInput.requested_text_column`` is the field that holds the user-requested name. BREAKING CHANGE: any external code accessing ``result.original_text_column`` must switch to ``result.resolved_text_column``. Co-authored-by: Cursor <cursoragent@cursor.com>
…ests
Address review feedback that the existing
``assert "original_text_column" not in result.dataframe.attrs`` guards
enforced an implementation detail (workflows must not preserve attrs)
rather than the real invariant (the orchestrator does not depend on
attrs for text-column resolution).
- Drops the workflow-level guards from ``test_rewrite_workflow.py``
(fast-path and full-pipeline) and ``test_detection_workflow.py``.
These never tested orchestrator behavior; they merely asserted that
workflow output carried no metadata, which is incidental to the
contract.
- Replaces ``test_run_threads_original_text_column_via_context_not_df_attrs``
in the interface tests with
``test_run_ignores_workflow_output_attrs_for_text_column_resolution``.
The new test mocks workflow outputs whose ``attrs`` carry deliberately
misleading values (``{"resolved_text_column": "WRONG_COL", ...}``) and
asserts that the orchestrator's user-facing dataframe, trace
dataframe, and ``render_record_html`` output all reflect the user's
actual ``text_column``, never the bogus ``WRONG_COL`` from attrs.
- Also fixes import ordering in ``interface/anonymizer.py`` flagged by
ruff after the ``ResolvedInput`` rename.
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: memadi <memadi@nvidia.com>
Summary
Replaces the experimental
df.attrsmechanism for threading the user-facingtext-column name through the pipeline with a typed
ResolvedInputdataclassowned by the orchestrator. Workflows no longer need to preserve
attrsacross pandas operations that silently drop them.
Changes
ResolvedInputdataclass (engine/resolved_input.py): frozendataclass holding the input
dataframe, the user'srequested_text_column,and the
resolved_text_column(post-collision rename).read_inputreturnsthis directly.
engine/io/reader.py): populates both requested and resolvedtext-column fields; collision rename logic unchanged but now exposed
explicitly via the dataclass.
interface/anonymizer.py):_run_internalacceptsResolvedInputand pullsresolved_text_columnfrom it for output-columnrestoration — no more
df.attrsreads.detection_workflow.py,llm_replace_workflow.py,rewrite_workflow.py,row_partitioning.py): drop theirattrs=plumbing/
merge_and_reorderno longer takes theattrsparameter.interface/results.py):AnonymizerResultandPreviewResultgainresolved_text_column(replacesoriginal_text_column).render_record_htmlparameter renamed accordingly.Breaking Changes
AnonymizerResult.original_text_column→AnonymizerResult.resolved_text_columnPreviewResult.original_text_column→PreviewResult.resolved_text_columnrender_record_html(..., original_text_column=...)→render_record_html(..., resolved_text_column=...)Tests
tests/engine/test_resolved_input.py: covers construction, immutability,the requested-vs-resolved asymmetry, and the headline invariant that
metadata survives pandas merge/concat/groupby.
assert "..." not in df.attrsregression guards removed(they enforced an implementation detail, not the orchestrator's contract).
misleading
attrsinto mocked workflow output and assertsAnonymizer.run()display_recordstill produce correct user-facing columns.Follow-up
renamed_input_columnsfield toResolvedInputso users can see allcollision-driven renames, not just the text column (deferred per review).