fix(tests): clean up sys.modules pollution in training fixtures by rob-luke · Pull Request #2168 · NVIDIA-NeMo/Automodel

rob-luke · 2026-05-07T05:13:38Z

What does this PR do ?

Fix two cooperating fixture bugs in tests/unit_tests/training/ that cause order-dependent failures when the affected files are run as a subset (e.g. focused local iteration), without affecting the full-suite CI run.

Changelog

tests/unit_tests/training/test_timers.py
- Replace direct sys.modules["torch.distributed"] = dist_stub with monkeypatch.setitem(sys.modules, "torch.distributed", dist_stub) so the stub is reverted at fixture teardown. The previous assignment bypassed monkeypatch's undo log and leaked the stub for the rest of the pytest session, causing later tests that triggered a fresh torchao import to fail with ModuleNotFoundError: No module named 'torch.distributed._tensor'; 'torch.distributed' is not a package.
tests/unit_tests/training/test_train_ft_mlflow_logging.py
- Set wandb.__spec__ on the fake wandb stub via importlib.util.spec_from_loader("wandb", loader=None). types.ModuleType("wandb") defaults __spec__ to None, which made accelerate's importlib.util.find_spec("wandb") raise ValueError: wandb.__spec__ is None whenever the file was run in isolation.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests? No new tests — the fix is to test fixtures themselves; correctness is observable via the existing tests, which now pass when run as a subset.
Did you add or update any necessary documentation? Not applicable.

Additional Information

To reproduce the fixed errors, run (against main before this PR merges):

uv run --no-sync pytest \
    tests/unit_tests/training/test_timers.py \
    tests/unit_tests/training/test_train_ft_mlflow_logging.py \
    -v --tb=short

Pre-fix: both tests in test_train_ft_mlflow_logging.py fail with ModuleNotFoundError: No module named 'torch.distributed._tensor'; 'torch.distributed' is not a package. After this PR: 17 / 17 pass.

Why CI did not catch this

Full-suite runs collect tests/unit_tests/ alphabetically, so directories like _cli/ (e.g. test_app.py, test_pretrain_cli.py) run long before training/ and import nemo_automodel.recipes.llm.train_ft, which transitively caches torchao.float8 in sys.modules. Once cached, the leaked torch.distributed stub from test_timers.py no longer triggers a re-import of torchao, so the failure is masked. Verified locally:

Scenario (pre-fix)	Result
`pytest training/test_timers.py training/test_train_ft_mlflow_logging.py`	2 failed, 15 passed
`pytest _cli/test_app.py training/test_timers.py training/test_train_ft_mlflow_logging.py`	33 passed

Full tests/unit_tests/training/ regression sweep after fix: 55 / 55. ruff format and ruff check clean on the two changed files.

Two cooperating fixture bugs caused order-dependent failures in tests/unit_tests/training/: 1. test_timers.py:67 — `sys.modules["torch.distributed"] = dist_stub` bypassed pytest's monkeypatch finalizer, leaking the stub for the rest of the session. Any later test that imported torchao failed with `ModuleNotFoundError: No module named 'torch.distributed.<X>'; 'torch.distributed' is not a package`. Fixed by switching to `monkeypatch.setitem(sys.modules, ...)` which is auto-reverted. 2. test_train_ft_mlflow_logging.py:_install_fake_wandb — the fake wandb stub had `__spec__ = None`, so accelerate's `importlib.util.find_spec("wandb")` raised `ValueError: wandb.__spec__ is None`, breaking the test in isolation. Fixed by setting a valid ModuleSpec via `importlib.util.spec_from_loader`. Verified: `pytest tests/unit_tests/training/` 55 passed (previously 2 failed, both fixture bugs visible). Signed-off-by: Robert Luke <code@robertluke.net>

copy-pr-bot · 2026-05-07T05:13:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

rob-luke · 2026-05-07T06:19:09Z

@akoumpa & @ko3n1g thank you for the great package. I opened this PR during a GitHub PR outage, and it does not appear in GitHubs Pull Request tab for me, please let me know if I should reopen the PR so it is visible for review. Please disregard this message, the outage appears to be over and the PR is now visible. Sorry

akoumpa · 2026-05-07T06:34:08Z

/ok to test 11ccaf1

github-actions Bot added the community-request label May 7, 2026

copy-pr-bot Bot temporarily deployed to test May 7, 2026 06:34 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 06:34 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 06:39 Inactive

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tests): clean up sys.modules pollution in training fixtures#2168

fix(tests): clean up sys.modules pollution in training fixtures#2168
rob-luke wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
rob-luke:rob-luke/fix/test_timers_sys_modules_leak

rob-luke commented May 7, 2026

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

rob-luke commented May 7, 2026 •

edited

Loading

Uh oh!

akoumpa commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rob-luke commented May 7, 2026

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

Why CI did not catch this

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

rob-luke commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akoumpa commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rob-luke commented May 7, 2026 •

edited

Loading