Updated TF attack by bzamanlooy · Pull Request #142 · VectorInstitute/midst-toolkit

bzamanlooy · 2026-05-27T16:20:59Z

PR Type

[Feature | Fix | Documentation | Other ]
Fix

Short Description

This pull request is doing three things:

Modifying the Clavaddpm setup to allow for profited label encoders to be passed during training.
Modifying the TF attack to figure out the Gaussian noise dimension from the data.
Modifying the TF attack to not refit transformations when loading tables.

Tests

Updated there TF attack tests to reflect a more realistic setup

coderabbitai · 2026-05-27T16:28:21Z

📝 Walkthrough

Walkthrough

This PR refines the Tartan Federer membership inference attack implementation by introducing label encoder reuse, improving dataset preprocessing, and refining noise dimension inference. The changes add optional support in encode_and_merge_features to load pre-fitted label encoders from cached pickle files, update the attack's dataset preparation to apply pre-fitted numerical transforms directly and disable normalization in the final output, and replace static noise dimension calculation with dynamic probing of the diffusion model's numerical feature count at runtime. Integration tests are updated with new hyperparameters and expected performance metrics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Updated TF attack' is vague and generic, failing to convey what the actual changes accomplish.	Use a more descriptive title such as 'Allow pretrained label encoders and fix noise dimension in TF attack' to clearly communicate the main changes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description covers the main changes and includes the required PR Type and Short Description sections, though it lacks detail in some areas.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch diabetes-tf-attack

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (1)
127-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align categorical dtype with checkpointed LabelEncoder expectations.

src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py builds categorical_features with to_numpy() (no dtype enforcement), but src/midst_toolkit/models/clavaddpm/dataset.py stringifies categorical columns with to_numpy(dtype=np.str_) when constructing the training categorical arrays used for LabelEncoder fitting/saving. This dtype mismatch can cause LabelEncoder.transform() to fail for int/bool categories treated as unseen labels at attack time.
🔧 Minimal fix
-    categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy()}
+    categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy(dtype=np.str_)}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py` around
lines 127 - 145, The categorical features are created with
data[categorical_column_names].to_numpy() which can yield non-string dtypes and
mismatch the stringified categories used when fitting/saving LabelEncoder in
clavaddpm.dataset; change the construction of categorical_features (and the
local all_categorical_features used before encoding) to explicitly convert to
strings (e.g., to_numpy(dtype=np.str_) or .astype(str)) so
label_encoders[column_index].transform receives the same dtype it was trained
on; keep the rest of the loop (noise_scale handling and encoding steps)
unchanged and use the existing symbols categorical_features,
all_categorical_features, label_encoders,
get_categorical_and_numerical_column_names, and DataSplit.TRAIN to locate the
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py`:
- Around line 468-475: The current loop sets _relation_order to [] for
non-"tabddpm" so noise_dimension is never assigned and later causes
UnboundLocalError; add an explicit guard that validates model_type before
attempting to probe the checkpoint (e.g., check model_type == "tabddpm" and
raise a clear exception like ValueError("unsupported model_type: ...")
otherwise) so you never open first_model_path/_parent_child ckpt for unsupported
types; update the logic around _relation_order, the checkpoint probe using
CustomUnpickler, and ensure get_score()’s error path remains the single source
of truth for unsupported models by raising the clearer error early.

In `@src/midst_toolkit/models/clavaddpm/dataset_utils.py`:
- Around line 103-122: When a label_encoders_path is supplied you must fail fast
instead of mixing preloaded and newly-fitted encoders: after loading
preloaded_encoders (from label_encoders_path) validate that
categorical_column_names is not None and that every name in
categorical_column_names exists as a key in preloaded_encoders; if any are
missing, raise a clear error (or return/raise ValueError) rather than falling
back to fitting per-column. Update the loop that currently checks
preloaded_encoders and conditionally fits (the block using preloaded_encoders,
label_encoder, encoded_labels and the fallback LabelEncoder()) to assume
encoders are present when label_encoders_path was provided and only fit new
encoders when no path was provided; include the check up front so you never mix
cached and freshly-fit encoders.

In `@src/midst_toolkit/models/clavaddpm/dataset.py`:
- Around line 380-390: The Dataset.from_df constructor currently probes the CWD
for attack-specific relative files using the local _le_path loop (import os as
_os and the for _parent ... if _os.path.exists ...), which must be removed;
instead add an explicit optional parameter (e.g., encoder_path=None) to
Dataset.from_df and use that value as the label-encoder path (leave None if not
provided) rather than auto-discovering whitebox_single_table_* files, remove the
os import and the _parent loop, and update callers to pass the encoder_path from
their context so encoder discovery is deterministic and not CWD-dependent.

---

Outside diff comments:
In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py`:
- Around line 127-145: The categorical features are created with
data[categorical_column_names].to_numpy() which can yield non-string dtypes and
mismatch the stringified categories used when fitting/saving LabelEncoder in
clavaddpm.dataset; change the construction of categorical_features (and the
local all_categorical_features used before encoding) to explicitly convert to
strings (e.g., to_numpy(dtype=np.str_) or .astype(str)) so
label_encoders[column_index].transform receives the same dtype it was trained
on; keep the rest of the loop (noise_scale handling and encoding steps)
unchanged and use the existing symbols categorical_features,
all_categorical_features, label_encoders,
get_categorical_and_numerical_column_names, and DataSplit.TRAIN to locate the
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bb57c420-9245-46a5-8738-a40f7dda137e

📥 Commits

Reviewing files that changed from the base of the PR and between f56d16e and 8b9415a.

📒 Files selected for processing (4)

src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py
src/midst_toolkit/models/clavaddpm/dataset.py
src/midst_toolkit/models/clavaddpm/dataset_utils.py
tests/integration/attacks/tartan_federer/test_tartan_federer_attack.py

emersodb

Thanks for putting these fixes into a PR! Most of my comments are pretty small. Some of the hoops you're jumping through is a result of inflexible code elsewhere, which we should fix at some other point.

emersodb · 2026-05-28T20:50:47Z

    )

+    # Load pre-fitted label encoders from pkl if provided, otherwise fit on current data
+    preloaded_encoders: dict[str, LabelEncoder] | None = None


I will note that I was very confused here. Your label encoder dictionary here is index by strings (column names) by the label_encoders that are to be returned are index by column indices. You do end up taking care of this later on. However, I would suggest you add a comment here to explain what's happening because the label encoder you're preloading is definitely not of the same "kind" as the ones we are constructing here (that is, it must be constructed somewhere else, we're not reusing an artifact formed by the process)

Yeah fair enough I will do that.

emersodb

Changes look good. Just two very small comments.

emersodb · 2026-06-03T19:40:44Z


 # TODO: Unify this with the Dataset.from_df function.
-# TODO: Noise scale is always called with a value of 0 for the attack.
+# TODO: Noise scale is always called with a value of 0 for the attack. So we should remove it from the f


I think the f here is a typo?

emersodb · 2026-06-03T19:42:30Z

+    _parent, _child = _relation_order[0]
+    _ckpt_path = first_model_path / f"{_parent}_{_child}_ckpt.pkl"
+    with open(_ckpt_path, "rb") as _f:
+        _probe_model = CustomUnpickler(_f).load()


Any reason we're using the _ prefixes here? I'd say drop them unless they are serving a purpose that I'm missing 🙂

bzamanlooy added 3 commits May 26, 2026 18:06

diabetes adapeted TF attack

dda5dc8

Adapt Tartan Federer attack for diabetes

a5b00ed

Updated test

8b9415a

bzamanlooy requested review from emersodb, fatemetkl and lotif and removed request for lotif May 27, 2026 16:21

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

Comment thread src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py Outdated

Comment thread src/midst_toolkit/models/clavaddpm/dataset_utils.py

Comment thread src/midst_toolkit/models/clavaddpm/dataset.py Outdated

bzamanlooy added 6 commits May 27, 2026 12:30

cleaning up and ruff check comment

616c96d

ruff

0b698b1

changed atack numbers with a cpu run to make more stable

200eb88

addressed coderabbit comments

c01f3a9

fix mypy issues

a66d5e9

fix mypy error

fee5134

bzamanlooy commented May 28, 2026

View reviewed changes

Comment thread src/midst_toolkit/models/clavaddpm/dataset_utils.py

emersodb reviewed May 28, 2026

View reviewed changes

bzamanlooy added 2 commits June 3, 2026 12:30

addressed David's comments

53f0535

minor update

3ac0620

emersodb approved these changes Jun 3, 2026

View reviewed changes

minor comments

9885fd7

bzamanlooy merged commit 638558e into main Jun 3, 2026
6 of 8 checks passed

bzamanlooy deleted the diabetes-tf-attack branch June 3, 2026 21:11

Conversation

bzamanlooy commented May 27, 2026

PR Type

Short Description

Tests

Uh oh!

coderabbitai Bot commented May 27, 2026

Walkthrough

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emersodb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emersodb May 28, 2026

Choose a reason for hiding this comment

Uh oh!

bzamanlooy Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

emersodb left a comment

Choose a reason for hiding this comment

Uh oh!

emersodb Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

emersodb Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants