Skip to content

test: fix canaries-v2#5932

Open
lucasjia-aws wants to merge 4 commits into
aws:master-v2from
lucasjia-aws:fix/canary-v2
Open

test: fix canaries-v2#5932
lucasjia-aws wants to merge 4 commits into
aws:master-v2from
lucasjia-aws:fix/canary-v2

Conversation

@lucasjia-aws
Copy link
Copy Markdown
Collaborator

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Under pytest-xdist (-n 120) each worker created its own private hub,
exhausting the per-account hub limit (100) and triggering destructive
cross-worker cleanup that deleted hubs other workers were actively
using, causing "Hub ... does not exist" failures. The add_model_references
fixture also swallowed all errors and did not wait for async reference
propagation, causing "Hub content ... does not exist" failures.

- Share a single hub across all xdist workers via filelock + a JSON
  state file with reference counting; only the last worker tears it down.
- Make _cleanup_old_hubs non-destructive: only delete hubs older than
  STALE_HUB_AGE_HOURS and never the active run's hub.
- Add add_model_references_to_hub helper that creates references
  idempotently (keyed by hub + model set) and polls until each
  reference is resolvable before tests run.
…ngs pollution

ModelBuilder mutates session.settings._local_download_dir to a temporary
/tmp/sagemaker/model-builder/<uuid> path. The serve integ tests passed the
repo-wide session-scoped sagemaker_session fixture into ModelBuilder, so that
mutation leaked across test modules. After the temp dir was cleaned up, the
lingering setting broke unrelated tests sharing the same session, notably
tests/integ/sagemaker/workflow/test_tuning_steps.py::test_tuning_multi_algos
with "ValueError: Inputted directory ... does not exist".

Override sagemaker_session in tests/integ/sagemaker/serve/conftest.py with a
dedicated session (constructed identically to the parent fixture) so the
ModelBuilder mutation stays contained within the serve package.
@lucasjia-aws lucasjia-aws requested a review from a team as a code owner June 5, 2026 22:51
@lucasjia-aws lucasjia-aws requested a review from zhaoqizqwang June 5, 2026 22:51
@lucasjia-aws lucasjia-aws changed the title fix: fix canaries-v2 test: fix canaries-v2 Jun 5, 2026
The previous reference-counted teardown in the session fixture finalizer
was unsafe: pytest-xdist distributes tests dynamically, so a worker could
finish its session (running finalizers) while other workers still had hub
tests pending. Decrementing to zero there deleted the shared hub mid-run,
causing "Hub ... does not exist" / "Hub content ... does not exist"
failures in gated hub tests.

Workers now only create-or-reuse the shared hub (never delete it). Teardown
runs exactly once in pytest_sessionfinish on the controller process (no
workerinput), which is guaranteed to run after all workers finish. Stale
hub reclamation continues to be handled by the age-based _cleanup_old_hubs.
…ut in integ tests

Two unrelated v2 integ-test failures, fixed together:

- test_spark_processing.py::test_sagemaker_pyspark_v3 (Spark 3.x): build_jar
  ran javac/jar without checking exit codes, so a failed jar rebuild (which
  truncates the committed hello-spark-java.jar) was swallowed and surfaced
  later as a misleading "code ... wasn't found" error, especially under xdist
  where the fixture runs per worker. Run the build commands with explicit
  return-code checks and assert the jar exists afterward.

- test_serve_model_builder_inference_component_happy.py::
  test_model_builder_ic_sagemaker_endpoint: deploying a 7B JumpStart model as
  an inference component on ml.g5.24xlarge regularly needs more than the
  15-minute standard endpoint timeout to reach InService (the failure was a
  deploy timeout, not a quota cap). Add a dedicated 30-minute timeout
  (SERVE_SAGEMAKER_IC_ENDPOINT_TIMEOUT) for this flow without changing the
  standard serve endpoint timeout.
@lucasjia-aws lucasjia-aws deployed to auto-approve June 6, 2026 06:47 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant