test: fix canaries-v2#5932
Open
lucasjia-aws wants to merge 4 commits into
Open
Conversation
Under pytest-xdist (-n 120) each worker created its own private hub, exhausting the per-account hub limit (100) and triggering destructive cross-worker cleanup that deleted hubs other workers were actively using, causing "Hub ... does not exist" failures. The add_model_references fixture also swallowed all errors and did not wait for async reference propagation, causing "Hub content ... does not exist" failures. - Share a single hub across all xdist workers via filelock + a JSON state file with reference counting; only the last worker tears it down. - Make _cleanup_old_hubs non-destructive: only delete hubs older than STALE_HUB_AGE_HOURS and never the active run's hub. - Add add_model_references_to_hub helper that creates references idempotently (keyed by hub + model set) and polls until each reference is resolvable before tests run.
…ngs pollution ModelBuilder mutates session.settings._local_download_dir to a temporary /tmp/sagemaker/model-builder/<uuid> path. The serve integ tests passed the repo-wide session-scoped sagemaker_session fixture into ModelBuilder, so that mutation leaked across test modules. After the temp dir was cleaned up, the lingering setting broke unrelated tests sharing the same session, notably tests/integ/sagemaker/workflow/test_tuning_steps.py::test_tuning_multi_algos with "ValueError: Inputted directory ... does not exist". Override sagemaker_session in tests/integ/sagemaker/serve/conftest.py with a dedicated session (constructed identically to the parent fixture) so the ModelBuilder mutation stays contained within the serve package.
The previous reference-counted teardown in the session fixture finalizer was unsafe: pytest-xdist distributes tests dynamically, so a worker could finish its session (running finalizers) while other workers still had hub tests pending. Decrementing to zero there deleted the shared hub mid-run, causing "Hub ... does not exist" / "Hub content ... does not exist" failures in gated hub tests. Workers now only create-or-reuse the shared hub (never delete it). Teardown runs exactly once in pytest_sessionfinish on the controller process (no workerinput), which is guaranteed to run after all workers finish. Stale hub reclamation continues to be handled by the age-based _cleanup_old_hubs.
…ut in integ tests Two unrelated v2 integ-test failures, fixed together: - test_spark_processing.py::test_sagemaker_pyspark_v3 (Spark 3.x): build_jar ran javac/jar without checking exit codes, so a failed jar rebuild (which truncates the committed hello-spark-java.jar) was swallowed and surfaced later as a misleading "code ... wasn't found" error, especially under xdist where the fixture runs per worker. Run the build commands with explicit return-code checks and assert the jar exists afterward. - test_serve_model_builder_inference_component_happy.py:: test_model_builder_ic_sagemaker_endpoint: deploying a 7B JumpStart model as an inference component on ml.g5.24xlarge regularly needs more than the 15-minute standard endpoint timeout to reach InService (the failure was a deploy timeout, not a quota cap). Add a dedicated 30-minute timeout (SERVE_SAGEMAKER_IC_ENDPOINT_TIMEOUT) for this flow without changing the standard serve endpoint timeout.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.