Updated TF attack#142
Conversation
📝 WalkthroughWalkthroughThis PR refines the Tartan Federer membership inference attack implementation by introducing label encoder reuse, improving dataset preprocessing, and refining noise dimension inference. The changes add optional support in Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (1)
127-145:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlign categorical dtype with checkpointed
LabelEncoderexpectations.
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.pybuildscategorical_featureswithto_numpy()(no dtype enforcement), butsrc/midst_toolkit/models/clavaddpm/dataset.pystringifies categorical columns withto_numpy(dtype=np.str_)when constructing the training categorical arrays used forLabelEncoderfitting/saving. This dtype mismatch can causeLabelEncoder.transform()to fail for int/bool categories treated as unseen labels at attack time.🔧 Minimal fix
- categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy()} + categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy(dtype=np.str_)}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py` around lines 127 - 145, The categorical features are created with data[categorical_column_names].to_numpy() which can yield non-string dtypes and mismatch the stringified categories used when fitting/saving LabelEncoder in clavaddpm.dataset; change the construction of categorical_features (and the local all_categorical_features used before encoding) to explicitly convert to strings (e.g., to_numpy(dtype=np.str_) or .astype(str)) so label_encoders[column_index].transform receives the same dtype it was trained on; keep the rest of the loop (noise_scale handling and encoding steps) unchanged and use the existing symbols categorical_features, all_categorical_features, label_encoders, get_categorical_and_numerical_column_names, and DataSplit.TRAIN to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py`:
- Around line 468-475: The current loop sets _relation_order to [] for
non-"tabddpm" so noise_dimension is never assigned and later causes
UnboundLocalError; add an explicit guard that validates model_type before
attempting to probe the checkpoint (e.g., check model_type == "tabddpm" and
raise a clear exception like ValueError("unsupported model_type: ...")
otherwise) so you never open first_model_path/_parent_child ckpt for unsupported
types; update the logic around _relation_order, the checkpoint probe using
CustomUnpickler, and ensure get_score()’s error path remains the single source
of truth for unsupported models by raising the clearer error early.
In `@src/midst_toolkit/models/clavaddpm/dataset_utils.py`:
- Around line 103-122: When a label_encoders_path is supplied you must fail fast
instead of mixing preloaded and newly-fitted encoders: after loading
preloaded_encoders (from label_encoders_path) validate that
categorical_column_names is not None and that every name in
categorical_column_names exists as a key in preloaded_encoders; if any are
missing, raise a clear error (or return/raise ValueError) rather than falling
back to fitting per-column. Update the loop that currently checks
preloaded_encoders and conditionally fits (the block using preloaded_encoders,
label_encoder, encoded_labels and the fallback LabelEncoder()) to assume
encoders are present when label_encoders_path was provided and only fit new
encoders when no path was provided; include the check up front so you never mix
cached and freshly-fit encoders.
In `@src/midst_toolkit/models/clavaddpm/dataset.py`:
- Around line 380-390: The Dataset.from_df constructor currently probes the CWD
for attack-specific relative files using the local _le_path loop (import os as
_os and the for _parent ... if _os.path.exists ...), which must be removed;
instead add an explicit optional parameter (e.g., encoder_path=None) to
Dataset.from_df and use that value as the label-encoder path (leave None if not
provided) rather than auto-discovering whitebox_single_table_* files, remove the
os import and the _parent loop, and update callers to pass the encoder_path from
their context so encoder discovery is deterministic and not CWD-dependent.
---
Outside diff comments:
In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py`:
- Around line 127-145: The categorical features are created with
data[categorical_column_names].to_numpy() which can yield non-string dtypes and
mismatch the stringified categories used when fitting/saving LabelEncoder in
clavaddpm.dataset; change the construction of categorical_features (and the
local all_categorical_features used before encoding) to explicitly convert to
strings (e.g., to_numpy(dtype=np.str_) or .astype(str)) so
label_encoders[column_index].transform receives the same dtype it was trained
on; keep the rest of the loop (noise_scale handling and encoding steps)
unchanged and use the existing symbols categorical_features,
all_categorical_features, label_encoders,
get_categorical_and_numerical_column_names, and DataSplit.TRAIN to locate the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bb57c420-9245-46a5-8738-a40f7dda137e
📒 Files selected for processing (4)
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.pysrc/midst_toolkit/models/clavaddpm/dataset.pysrc/midst_toolkit/models/clavaddpm/dataset_utils.pytests/integration/attacks/tartan_federer/test_tartan_federer_attack.py
emersodb
left a comment
There was a problem hiding this comment.
Thanks for putting these fixes into a PR! Most of my comments are pretty small. Some of the hoops you're jumping through is a result of inflexible code elsewhere, which we should fix at some other point.
| ) | ||
|
|
||
| # Load pre-fitted label encoders from pkl if provided, otherwise fit on current data | ||
| preloaded_encoders: dict[str, LabelEncoder] | None = None |
There was a problem hiding this comment.
I will note that I was very confused here. Your label encoder dictionary here is index by strings (column names) by the label_encoders that are to be returned are index by column indices. You do end up taking care of this later on. However, I would suggest you add a comment here to explain what's happening because the label encoder you're preloading is definitely not of the same "kind" as the ones we are constructing here (that is, it must be constructed somewhere else, we're not reusing an artifact formed by the process)
There was a problem hiding this comment.
Yeah fair enough I will do that.
emersodb
left a comment
There was a problem hiding this comment.
Changes look good. Just two very small comments.
|
|
||
| # TODO: Unify this with the Dataset.from_df function. | ||
| # TODO: Noise scale is always called with a value of 0 for the attack. | ||
| # TODO: Noise scale is always called with a value of 0 for the attack. So we should remove it from the f |
There was a problem hiding this comment.
I think the f here is a typo?
| _parent, _child = _relation_order[0] | ||
| _ckpt_path = first_model_path / f"{_parent}_{_child}_ckpt.pkl" | ||
| with open(_ckpt_path, "rb") as _f: | ||
| _probe_model = CustomUnpickler(_f).load() |
There was a problem hiding this comment.
Any reason we're using the _ prefixes here? I'd say drop them unless they are serving a purpose that I'm missing 🙂
PR Type
[Feature | Fix | Documentation | Other ]
Fix
Short Description
This pull request is doing three things:
Tests