Skip to content

Skip container startup for empty scenarios#6752

Draft
nccatoni wants to merge 36 commits into
mainfrom
nccatoni/collection-rework
Draft

Skip container startup for empty scenarios#6752
nccatoni wants to merge 36 commits into
mainfrom
nccatoni/collection-rework

Conversation

@nccatoni

@nccatoni nccatoni commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

Context

In CI, ~38% of scenario invocations are "empty" — all collected tests are deselected for the given library/weblog combination. Docker infrastructure was still started and torn down for every empty scenario, wasting ~17h of compute per full CI run.

Root cause: containers started in pytest_sessionstart, before pytest knew whether any tests would run.

What this does

  1. Writes library version at build time — each weblog image gets a /system-tests-library-version file and a system-tests-library-version Docker label via install_ddtrace.sh. Agent version is read from the agent image label.

  2. Defers container startup to post-collection — new post_collection_warmups hook (in pytest_collection_finish). When both versions are known from labels (common case), containers are never created if no tests are selected.

  3. Preserves log outputAgent:, Library:, and Weblog variant: lines are still printed before test session starts using label data.

  4. Graceful fallback — older images without the label fall through to the legacy path (containers start in pytest_sessionstart as before).

Impact

Library Empty runs Time saved
PHP ~879 / 1561 (56%) ~8.8h
Ruby ~612 / 1346 (45%) ~4.5h
Golang ~205 / 396 (52%) ~1.4h
Python ~141 / 454 (31%) ~1.3h
Node.js ~138 / 330 (42%) ~1.0h
Total ~1995 / 5194 (38%) ~17h

Empty scenarios now complete in ~2.5s instead of 20–40s.

Notes

  • See docs/adr/002-skip-empty-scenario-containers.md for the full decision record.
  • Edge cases (replay mode, buddy containers, OTel, include_agent=False) fall through to the fallback/legacy path and are unaffected.

@github-actions

github-actions Bot commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

CODEOWNERS have been resolved as:

tests/test_the_test/test_collection_warmups.py                          @DataDog/system-tests-core
utils/build/docker/dotnet/version-tool.Dockerfile                       @DataDog/apm-dotnet @DataDog/asm-dotnet @DataDog/system-tests-core
conftest.py                                                             @DataDog/system-tests-core
pyproject.toml                                                          @DataDog/system-tests-core
tests/test_the_test/test_decorators.py                                  @DataDog/system-tests-core
tests/test_the_test/test_docker_scenario.py                             @DataDog/system-tests-core
utils/_context/_scenarios/core.py                                       @DataDog/system-tests-core
utils/_context/_scenarios/debugger.py                                   @DataDog/system-tests-core
utils/_context/_scenarios/endtoend.py                                   @DataDog/system-tests-core
utils/_context/_scenarios/go_proxies.py                                 @DataDog/system-tests-core
utils/_context/containers.py                                            @DataDog/system-tests-core
utils/build/build.sh                                                    @DataDog/system-tests-core
utils/build/docker/cpp_httpd/install_ddtrace.sh                         @DataDog/system-tests-core
utils/build/docker/cpp_kong/install_ddtrace.sh                          @DataDog/system-tests-core
utils/build/docker/cpp_nginx/install_ddtrace.sh                         @DataDog/system-tests-core
utils/build/docker/dotnet/install_ddtrace.sh                            @DataDog/apm-dotnet @DataDog/asm-dotnet @DataDog/system-tests-core
utils/build/docker/dotnet/poc.Dockerfile                                @DataDog/apm-dotnet @DataDog/asm-dotnet @DataDog/system-tests-core
utils/build/docker/dotnet/uds.Dockerfile                                @DataDog/apm-dotnet @DataDog/asm-dotnet @DataDog/system-tests-core
utils/build/docker/golang/install_ddtrace.sh                            @DataDog/dd-trace-go-guild @DataDog/system-tests-core
utils/build/docker/java/install_ddtrace.sh                              @DataDog/apm-java @DataDog/asm-java @DataDog/system-tests-core
utils/build/docker/nodejs/install_ddtrace.sh                            @DataDog/dd-trace-js @DataDog/system-tests-core
utils/build/docker/php/common/install_ddtrace.sh                        @DataDog/apm-php @DataDog/system-tests-core
utils/build/docker/python/install_ddtrace.sh                            @DataDog/apm-python @DataDog/asm-python @DataDog/system-tests-core
utils/build/docker/ruby/install_ddtrace.sh                              @DataDog/ruby-guild @DataDog/asm-ruby @DataDog/system-tests-core

@nccatoni nccatoni changed the title Nccatoni/collection rework Skip container startup for empty scenarios (~17h CI compute savings) Apr 15, 2026
@datadog-prod-us1-3

datadog-prod-us1-3 Bot commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 194 Pipeline jobs failed

Testing the test | System Tests (java, dev) / End-to-end #2 / akka-http 2   View in Datadog   GitHub Actions

🧪 1 Test failed

tests.appsec.test_blocking_addresses.Test_Blocking_request_body_filenames.test_blocking[akka-http] from system_tests_suite   View in Datadog
ValueError: No appsec event validate this condition

self = <tests.appsec.test_blocking_addresses.Test_Blocking_request_body_filenames object at 0x7f35ec42ff50>

    def test_blocking(self):
        """Can block on server.request.body.filenames"""
>       interfaces.library.assert_waf_attack(self.rbf_req, rule="tst-037-014")

tests/appsec/test_blocking_addresses.py:606: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
...

Testing the test | System Tests (php, dev) / End-to-end #1 / apache-mod-7.0-zts 1   View in Datadog   GitHub Actions

🧪 1 Test failed

tests.ffe.test_exposures.Test_FFE_Exposure_Events.test_ffe_multiple_remote_config_files[apache-mod-7.0-zts] from system_tests_suite   View in Datadog
AssertionError: Timed out waiting for exposure event for flags ['test-flag-1', 'test-flag-2'] and subject 'test-user-multi'
assert False
 +  where False = <bound method ProxyBasedInterfaceValidator.wait_for of AgentInterfaceValidator('agent')>(<function wait_for_exposure_event.<locals>.<lambda> at 0x7fd5f464afc0>, timeout=30)
 +    where <bound method ProxyBasedInterfaceValidator.wait_for of AgentInterfaceValidator('agent')> = AgentInterfaceValidator('agent').wait_for
 +      where AgentInterfaceValidator('agent') = interfaces.agent

self = <tests.ffe.test_exposures.Test_FFE_Exposure_Events object at 0x7fd60a6ba7e0>

    def test_ffe_multiple_remote_config_files(self):
        """Test that FFE correctly handles multiple remote config files with different flags."""
...

Testing the test | System Tests (python, prod) / End-to-end #2 / fastapi 2   View in Datadog   GitHub Actions

🧪 1 Test failed

tests.test_config_consistency.Test_Config_RuntimeMetrics_Enabled.test_main[fastapi] from system_tests_suite   View in Datadog
assert (0 > 0 or 0 > 0)
 +  where 0 = len([])
 +  and   0 = len([])

self = <tests.test_config_consistency.Test_Config_RuntimeMetrics_Enabled object at 0x7fe0355e8da0>

    def test_main(self):
        assert self.req.status_code == 200
    
        runtime_metrics_gauges, runtime_metrics_sketches = get_runtime_metrics()
...

View all 194 failed jobs.

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 4e1abe9 | Docs | Datadog PR Page | Give us feedback!

nccatoni and others added 4 commits April 16, 2026 18:40
For the non-deferred path (old images without labels), the watchdog was
moved to post_collection_warmups alongside _wait_for_app_readiness. This
means the watchdog starts only after collection, by which time the proxy
has already written files that end up in the observer's initial snapshot
and are never ingested.

Restore the original behaviour: insert _start_interfaces_watchdog at
position 1 in warmups (before _create_network) for the elif/else paths,
matching what the original code did with warmups.insert(1, ...).

Also move _log_starting_containers into _defer_container_startup so the
"Starting containers..." message is printed when containers actually start
rather than during the pre-collection warmup phase.
In the deferred path, _set_agent_component was called from
post_collection_warmups, after pytest_collection_modifyitems had already
run. That hook builds the Manifest from context.scenario.components, and
match_condition returns False for any rule whose component is absent —
so all agent-version-gated skip/xfail markers were silently dropped,
causing tests that should be skipped to run and fail.

Since agent_version is already known from the image label at configure
time (that's the condition for taking the deferred path), call
_set_agent_component() directly during configure alongside
_set_library_component(), and remove the now-redundant call from
_defer_container_startup.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@nccatoni nccatoni changed the title Skip container startup for empty scenarios (~17h CI compute savings) Skip container startup for empty scenarios Apr 21, 2026
nccatoni added 3 commits May 6, 2026 15:09
_log_starting_containers was added at position 0, shifting _create_network
to position 1. The insert(1, watchdog) now placed the watchdog before
network creation. Use insert(2, ...) to restore the correct order:
  log → network → watchdog → containers
Use GetAssemblyVersion (already used in parametric) in poc/uds Dockerfiles
to read the version from Datadog.Trace.dll and write /system-tests-library-version
when the install script could not determine the version (i.e. .so install path).
nccatoni added 10 commits May 13, 2026 15:02
8 tests covering:
- defer path: container startup absent from warmups, present in
  post_collection_warmups in the right order (network→watchdog→containers→readiness)
- fallback/legacy paths: watchdog at index 2 (after _create_network)
- execute_post_collection_warmups: invokes all callables, calls
  close_targets() and re-raises on error
…scenario

- EndToEndScenario.configure: replace 3-branch if/elif/else with two flat
  blocks for library_known / agent_known and a single defer-or-watchdog tail.
- Container post_start methods stop emitting Library/Agent/Backend/UDS/variant
  log lines; the scenario warmup is now the sole owner of those logs (no more
  divergent ordering between label and healthcheck paths).
- Track container-startup warmups on the scenario so the defer path can move
  them to post_collection_warmups by identity instead of rebuilding lambdas.
- DebuggerScenario: pick warmup target list with a ternary instead of branching.
- GoProxiesScenario._set_components: drop defensive None guard; agent_version
  is always set by configure() (label) or post_start() (healthcheck) before
  this warmup runs.
- conftest: use truthy check on session.items.
- Drop duplicate ProxyContainer stub (identical to TestedContainer one).
- Yield-with-cleanup fixture pops the test scenario from the global group
  registry to avoid polluting subsequent tests.
- Drop unused config attrs and ad-hoc replay parameter from helpers.
- Replace exact-index assertions on warmups[0..3] (which broke when the
  'Starting containers' log entry became an anonymous lambda) with
  ordering invariants via .index().
- Whitelist SLF001/ANN001 for tests/test_the_test/* (warmup tests need
  to inspect privates and stub internal interfaces); drop the now-unused
  per-line ANN001 noqa directives in two existing files.
…tool stage

- build.sh: drop the multi-path lookup loop. Every install_ddtrace.sh on
  this branch writes /system-tests-library-version, so reading the
  canonical path is sufficient.
- Pre-build the .NET assembly-version helper image once (system_tests/dotnet-version-tool)
  and have both poc.Dockerfile and uds.Dockerfile COPY --from=that tag,
  removing the duplicated build-version-tool stage from each Dockerfile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant