Skip to content

fix(tests): clean up sys.modules pollution in training fixtures#2168

Open
rob-luke wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
rob-luke:rob-luke/fix/test_timers_sys_modules_leak
Open

fix(tests): clean up sys.modules pollution in training fixtures#2168
rob-luke wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
rob-luke:rob-luke/fix/test_timers_sys_modules_leak

Conversation

@rob-luke
Copy link
Copy Markdown

@rob-luke rob-luke commented May 7, 2026

What does this PR do ?

Fix two cooperating fixture bugs in tests/unit_tests/training/ that cause order-dependent failures when the affected files are run as a subset (e.g. focused local iteration), without affecting the full-suite CI run.

Changelog

  • tests/unit_tests/training/test_timers.py
    • Replace direct sys.modules["torch.distributed"] = dist_stub with monkeypatch.setitem(sys.modules, "torch.distributed", dist_stub) so the stub is reverted at fixture teardown. The previous assignment bypassed monkeypatch's undo log and leaked the stub for the rest of the pytest session, causing later tests that triggered a fresh torchao import to fail with ModuleNotFoundError: No module named 'torch.distributed._tensor'; 'torch.distributed' is not a package.
  • tests/unit_tests/training/test_train_ft_mlflow_logging.py
    • Set wandb.__spec__ on the fake wandb stub via importlib.util.spec_from_loader("wandb", loader=None). types.ModuleType("wandb") defaults __spec__ to None, which made accelerate's importlib.util.find_spec("wandb") raise ValueError: wandb.__spec__ is None whenever the file was run in isolation.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests? No new tests — the fix is to test fixtures themselves; correctness is observable via the existing tests, which now pass when run as a subset.
  • Did you add or update any necessary documentation? Not applicable.

Additional Information

To reproduce the fixed errors, run (against main before this PR merges):

uv run --no-sync pytest \
    tests/unit_tests/training/test_timers.py \
    tests/unit_tests/training/test_train_ft_mlflow_logging.py \
    -v --tb=short

Pre-fix: both tests in test_train_ft_mlflow_logging.py fail with ModuleNotFoundError: No module named 'torch.distributed._tensor'; 'torch.distributed' is not a package. After this PR: 17 / 17 pass.

Why CI did not catch this

Full-suite runs collect tests/unit_tests/ alphabetically, so directories like _cli/ (e.g. test_app.py, test_pretrain_cli.py) run long before training/ and import nemo_automodel.recipes.llm.train_ft, which transitively caches torchao.float8 in sys.modules. Once cached, the leaked torch.distributed stub from test_timers.py no longer triggers a re-import of torchao, so the failure is masked. Verified locally:

Scenario (pre-fix) Result
pytest training/test_timers.py training/test_train_ft_mlflow_logging.py 2 failed, 15 passed
pytest _cli/test_app.py training/test_timers.py training/test_train_ft_mlflow_logging.py 33 passed

Full tests/unit_tests/training/ regression sweep after fix: 55 / 55. ruff format and ruff check clean on the two changed files.

Two cooperating fixture bugs caused order-dependent failures in
tests/unit_tests/training/:

1. test_timers.py:67 — `sys.modules["torch.distributed"] = dist_stub`
   bypassed pytest's monkeypatch finalizer, leaking the stub for the
   rest of the session. Any later test that imported torchao failed
   with `ModuleNotFoundError: No module named 'torch.distributed.<X>';
   'torch.distributed' is not a package`. Fixed by switching to
   `monkeypatch.setitem(sys.modules, ...)` which is auto-reverted.

2. test_train_ft_mlflow_logging.py:_install_fake_wandb — the fake
   wandb stub had `__spec__ = None`, so accelerate's
   `importlib.util.find_spec("wandb")` raised
   `ValueError: wandb.__spec__ is None`, breaking the test in
   isolation. Fixed by setting a valid ModuleSpec via
   `importlib.util.spec_from_loader`.

Verified: `pytest tests/unit_tests/training/` 55 passed (previously
2 failed, both fixture bugs visible).

Signed-off-by: Robert Luke <code@robertluke.net>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rob-luke
Copy link
Copy Markdown
Author

rob-luke commented May 7, 2026

@akoumpa & @ko3n1g thank you for the great package. I opened this PR during a GitHub PR outage, and it does not appear in GitHubs Pull Request tab for me, please let me know if I should reopen the PR so it is visible for review. Please disregard this message, the outage appears to be over and the PR is now visible. Sorry

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 7, 2026

/ok to test 11ccaf1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request waiting-on-maintainers Waiting on maintainers to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants