fix(tests): clean up sys.modules pollution in training fixtures#2168
Open
rob-luke wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Open
fix(tests): clean up sys.modules pollution in training fixtures#2168rob-luke wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
rob-luke wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
Two cooperating fixture bugs caused order-dependent failures in
tests/unit_tests/training/:
1. test_timers.py:67 — `sys.modules["torch.distributed"] = dist_stub`
bypassed pytest's monkeypatch finalizer, leaking the stub for the
rest of the session. Any later test that imported torchao failed
with `ModuleNotFoundError: No module named 'torch.distributed.<X>';
'torch.distributed' is not a package`. Fixed by switching to
`monkeypatch.setitem(sys.modules, ...)` which is auto-reverted.
2. test_train_ft_mlflow_logging.py:_install_fake_wandb — the fake
wandb stub had `__spec__ = None`, so accelerate's
`importlib.util.find_spec("wandb")` raised
`ValueError: wandb.__spec__ is None`, breaking the test in
isolation. Fixed by setting a valid ModuleSpec via
`importlib.util.spec_from_loader`.
Verified: `pytest tests/unit_tests/training/` 55 passed (previously
2 failed, both fixture bugs visible).
Signed-off-by: Robert Luke <code@robertluke.net>
Author
|
|
Contributor
|
/ok to test 11ccaf1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Fix two cooperating fixture bugs in
tests/unit_tests/training/that cause order-dependent failures when the affected files are run as a subset (e.g. focused local iteration), without affecting the full-suite CI run.Changelog
tests/unit_tests/training/test_timers.pysys.modules["torch.distributed"] = dist_stubwithmonkeypatch.setitem(sys.modules, "torch.distributed", dist_stub)so the stub is reverted at fixture teardown. The previous assignment bypassedmonkeypatch's undo log and leaked the stub for the rest of the pytest session, causing later tests that triggered a freshtorchaoimport to fail withModuleNotFoundError: No module named 'torch.distributed._tensor'; 'torch.distributed' is not a package.tests/unit_tests/training/test_train_ft_mlflow_logging.pywandb.__spec__on the fake wandb stub viaimportlib.util.spec_from_loader("wandb", loader=None).types.ModuleType("wandb")defaults__spec__toNone, which madeaccelerate'simportlib.util.find_spec("wandb")raiseValueError: wandb.__spec__ is Nonewhenever the file was run in isolation.Before your PR is "Ready for review"
Pre checks:
Additional Information
To reproduce the fixed errors, run (against
mainbefore this PR merges):uv run --no-sync pytest \ tests/unit_tests/training/test_timers.py \ tests/unit_tests/training/test_train_ft_mlflow_logging.py \ -v --tb=shortPre-fix: both tests in
test_train_ft_mlflow_logging.pyfail withModuleNotFoundError: No module named 'torch.distributed._tensor'; 'torch.distributed' is not a package. After this PR: 17 / 17 pass.Why CI did not catch this
Full-suite runs collect
tests/unit_tests/alphabetically, so directories like_cli/(e.g.test_app.py,test_pretrain_cli.py) run long beforetraining/and importnemo_automodel.recipes.llm.train_ft, which transitively cachestorchao.float8insys.modules. Once cached, the leakedtorch.distributedstub fromtest_timers.pyno longer triggers a re-import of torchao, so the failure is masked. Verified locally:pytest training/test_timers.py training/test_train_ft_mlflow_logging.pypytest _cli/test_app.py training/test_timers.py training/test_train_ft_mlflow_logging.pyFull
tests/unit_tests/training/regression sweep after fix: 55 / 55.ruff formatandruff checkclean on the two changed files.