Skip to content

test(pipelines): add CI validation for official oneshot samples#542

Merged
streamer45 merged 5 commits into
mainfrom
devin/1780150771-oneshot-sample-validation
May 30, 2026
Merged

test(pipelines): add CI validation for official oneshot samples#542
streamer45 merged 5 commits into
mainfrom
devin/1780150771-oneshot-sample-validation

Conversation

@staging-devin-ai-integration
Copy link
Copy Markdown
Contributor

@staging-devin-ai-integration staging-devin-ai-integration Bot commented May 30, 2026

Summary

  • The official samples/pipelines/oneshot/*.yml had zero automated coverage — only the hand-written samples/pipelines/test/ fixtures were validated. This adds a manifest-driven harness that runs every oneshot sample against a live skit server and validates the output, wired into both pipeline-validation e2e jobs (SW + GPU).
  • Coverage is gated so it stays cheap and correct: fast Tier-A samples run in the default job; software plugins / VA-API are skipped via optional_node (so the GPU job's PIPELINE_REQUIRE_NODES=1 doesn't fail on them); NVENC/Vulkan run in the GPU job; S3 samples gate on requires_env; and heavyweight showcase pipelines (realtime pacers, software AV1 — up to ~200s) are behind slow / PIPELINE_INCLUDE_SLOW=1.
  • Every oneshot YAML must have a manifest entry — a new sample without one fails the harness, so future samples get triaged instead of silently uncovered.
  • Generalized run_pipelinerun_pipeline_parts for multi-field uploads (dual-input mixers) and text inputs (TTS), added JSON-body validation (transcription / VAD), and made the request timeout configurable (PIPELINE_TIMEOUT_SECS).
  • Added tests/pipeline-validation/README.md documenting how to maintain the harness (manifest rule, skip controls, when to mark a sample slow).

This is the test-harness half of the S2 oneshot review. No oneshot sample was broken — all 38 validated locally (Tier-A/B + S3/Servo pass; HW skipped without a GPU), so there are no sample fixes or removals here.

Review-feedback follow-ups

  • De-duplicated the media expectations. The 8 ffprobe-checked fields are now a single MediaExpectations struct, flattened (#[serde(flatten)]) into both Expected and OneshotEntry. This removes the hand-written media_expected() copy (which also dead-copied output_extension), so a new ffprobe field is declared once instead of in three places. A unit test pins the flattened TOML deserialization (toml 0.8 + integer fields).
  • Marked video_nv_av1_colorbars slow. Its core::pacer paces 900 frames at real time (~30s wall-clock), and NVENC AV1 is already covered cheaply by the test/nv_av1_colorbars fixture (30 frames), so the GPU job no longer pays ~30s/run for redundant coverage. (The sibling Vulkan sample has no pacer, so it stays in the default GPU run.)
  • Added a nightly workflow (pipeline-validation-nightly.yml, schedule + manual dispatch) that runs the slow samples with PIPELINE_INCLUDE_SLOW=1 on SW codecs.

Two registry/model-packaging issues surfaced during testing (not repo bugs, not in this PR) and are now tracked: parakeet ships flat model files instead of the expected named subdir (#548), and the nllb model is XZ-compressed but labeled .tar.bz2 so the installer picks the wrong decompressor (#549).

Review & Validation

  • Sanity-check the skip semantics in oneshot-samples.toml: optional_node = true for marketplace plugins + VA-API; requires_node only for HW codecs the GPU job builds (nvcodec/vulkan); slow = true now also on video_nv_av1_colorbars.
  • Confirm the GPU runner is happy running video_vulkan_video_h264_colorbars at 1280×720 (the remaining non-slow HW sample) — validated here only via requires_node skip (no local GPU).
  • just test-oneshot-samples http://localhost:4545 against a server with plugins reproduces the full pass; add PIPELINE_INCLUDE_SLOW=1 --test-threads=1 for the showcase samples.

Notes

  • Slow samples intentionally excluded from PR CI: video_av1_compositor_demo (rav1e SW AV1) alone is ~200s, the four realtime-paced compositor demos are ~30s each, and video_nv_av1_colorbars ~30s. All runnable on demand via PIPELINE_INCLUDE_SLOW=1 and nightly.
  • tests/ is excluded from the workspace, so this crate isn't covered by just lint/just test; correctness is enforced by the e2e job. Re-ran the harness locally after the refactor (38/38, incl. Tier-B model samples and a slow compositor sample) and cargo clippy clean.

Link to Devin session: https://staging.itsdev.in/sessions/66d442e49aca4390adb6749e19cb5dec
Requested by: @streamer45


Devin Review

Status Commit
🕐 Outdated 00b41e8 (HEAD is 3461138)

Run Devin Review

Open in Devin Review (Staging)

The official samples/pipelines/oneshot/*.yml had zero automated coverage — only the small samples/pipelines/test/ fixtures were validated. This adds a manifest-driven harness that runs every oneshot sample against a live skit server and validates its output (ffprobe for media, JSON parse for transcription/VAD), wired into both e2e pipeline-validation jobs.

- New oneshot test target + oneshot-samples.toml manifest covering all 38 samples; a sample without a manifest entry fails the harness so new ones get triaged.
- Generalise run_pipeline into run_pipeline_parts to support multi-field uploads (dual-input mixers) and text inputs (TTS); add JSON output validation; make the HTTP timeout configurable via PIPELINE_TIMEOUT_SECS.
- Skip controls: requires_node, optional_node (plugins / VA-API — skip even under PIPELINE_REQUIRE_NODES=1), requires_env (S3 creds), and slow (opt in via PIPELINE_INCLUDE_SLOW=1 for realtime-paced / software-AV1 showcases).
- Add a just test-oneshot-samples recipe.

All 38 samples validated locally (Tier-A/B/EXT pass; HW skipped without a GPU). No sample was broken.

Signed-off-by: streamkit-devin <devin@streamkit.dev>
@staging-devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

Copy link
Copy Markdown
Contributor Author

@staging-devin-ai-integration staging-devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 11 potential issues.

Open in Devin Review (Staging)
Debug

Playground

Comment on lines +245 to +246
requires_node = "plugin::native::nllb"
optional_node = true
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Single-node gating lets multi-plugin samples run with missing dependencies

Several manifest entries gate an optional sample on only one node even though the YAML uses additional optional nodes, so a partially installed server can run the test instead of skipping it and then fail during pipeline construction/execution. For example, this entry only requires plugin::native::nllb, but samples/pipelines/oneshot/speech_to_text_translate.yml:29-47 also uses plugin::native::whisper and plugin::native::piper; the same pattern affects vad-filtered-stt (samples/pipelines/oneshot/vad-filtered-stt.yml:29-36), speech_to_text_translate_helsinki (samples/pipelines/oneshot/speech_to_text_translate_helsinki.yml:39-56), and video_dav1d_compositor_demo (samples/pipelines/oneshot/video_dav1d_compositor_demo.yml:103-116).

Prompt for agents
The oneshot sample manifest and harness currently model only a single requires_node per sample, but several samples need multiple optional nodes to be present. Update the manifest schema in tests/pipeline-validation/src/lib.rs and the check in tests/pipeline-validation/tests/oneshot.rs so an entry can require multiple node kinds, then update tests/pipeline-validation/oneshot-samples.toml for multi-plugin/multi-codec samples such as speech_to_text_translate, speech_to_text_translate_helsinki, vad-filtered-stt, and video_dav1d_compositor_demo. The intended behavior is that if any required node is missing, optional samples skip, while non-optional samples still fail under PIPELINE_REQUIRE_NODES=1.
Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in a1dd8f0. Added a requires_nodes list to OneshotEntry so a sample can require multiple node kinds; the harness now skips (or fails under PIPELINE_REQUIRE_NODES=1) if any required node is missing. Populated the secondary nodes for speech_to_text_translate (whisper, piper), speech_to_text_translate_helsinki (whisper, piper), vad-filtered-stt (vad), and video_dav1d_compositor_demo (svt_av1::encoder). Verified all 38 samples still pass locally.

Comment on lines +25 to +27
# ---------------------------------------------------------------------------
# Tier A — pure software, fast enough for the default CI job.
# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Section-divider comment violates repository comment rules

AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added dashed TOML section header is a direct violation of that mandatory repository rule.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a1dd8f0 — dropped the dashed divider rules; the tier sections now use a single plain comment.

Comment on lines +72 to +75
# ---------------------------------------------------------------------------
# Tier A — heavyweight showcase pipelines (realtime pacers / slow SW AV1).
# Opt in with PIPELINE_INCLUDE_SLOW=1.
# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Section-divider comment violates repository comment rules

AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added dashed TOML section header is a direct violation of that mandatory repository rule.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a1dd8f0 — dropped the dashed divider rules here too.

Comment on lines +127 to +131
# ---------------------------------------------------------------------------
# Tier B — marketplace plugins + models. optional_node = true so the GPU job
# (PIPELINE_REQUIRE_NODES=1, no plugins) skips rather than fails. Run locally
# after installing the plugin/model.
# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Section-divider comment violates repository comment rules

AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added dashed TOML section header is a direct violation of that mandatory repository rule.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a1dd8f0 — dropped the dashed divider rules here too.

Comment on lines +275 to +280
# ---------------------------------------------------------------------------
# Tier C — hardware codecs. nvcodec / vulkan_video are compiled by the
# pipeline-validation-gpu CI job, so they are NOT optional_node (a missing node
# there is a real regression). VA-API is compiled by no CI job, so it is
# optional_node = true.
# ---------------------------------------------------------------------------
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Section-divider comment violates repository comment rules

AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added dashed TOML section header is a direct violation of that mandatory repository rule.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a1dd8f0 — dropped the dashed divider rules here too.

Comment thread tests/pipeline-validation/src/lib.rs Outdated
#[serde(default)]
pub output_kind: OutputKind,

// --- Media expectations (used when `output_kind = media`). ---
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Section-divider comment violates repository comment rules

AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added implementation section marker is a direct violation of that mandatory repository rule.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a1dd8f0 — removed this // --- divider (per AGENTS.md comment guidelines).

Comment thread tests/pipeline-validation/src/lib.rs Outdated
#[serde(default)]
pub json_contains: Vec<String>,

// --- Skip controls. ---
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Section-divider comment violates repository comment rules

AGENTS.md explicitly prohibits section-divider comments and asks contributors to use blank lines or code structure instead. This newly added implementation section marker is a direct violation of that mandatory repository rule.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a1dd8f0 — removed this // --- Skip controls. --- divider.

Comment thread tests/pipeline-validation/src/lib.rs Outdated
Comment on lines +516 to +522
let first_line = body
.lines()
.find(|line| !line.trim().is_empty())
.ok_or("JSON output is empty")?;

serde_json::from_str::<serde_json::Value>(first_line)
.map_err(|e| format!("Output is not valid JSON: {e}\n first line: {first_line}"))?;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 JSON validation only checks the first NDJSON record

The new JSON path validates that the first non-empty line parses as JSON and then optionally checks raw substrings. That matches the current NDJSON-style transcription/VAD samples, but it means later malformed records would not be caught if a node emits valid JSON first and corrupt output later. If the intent is to smoke-test JSON framing rather than payload completeness this is acceptable; otherwise consider validating every non-empty line for output_kind = "json".

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strengthened in a1dd8f0. validate_json_output now first tries to parse the whole body as a single JSON document; if that fails it treats the body as NDJSON and requires every non-empty line to parse (previously only the first line was checked). Substring checks still scan the full body.

Comment thread .github/workflows/e2e.yml
Comment on lines +105 to +108
- name: Validate official oneshot samples (SW codecs only)
run: |
cd tests/pipeline-validation
PIPELINE_TEST_URL=http://127.0.0.1:4545 cargo test --test oneshot
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: Default CI intentionally skips slow showcase samples

The workflow invokes the oneshot harness without PIPELINE_INCLUDE_SLOW, so entries marked slow = true will not run in either the software or GPU CI jobs. This appears intentional from the justfile guidance and manifest comments, but reviewers should be aware that several long-running compositor/pacer samples remain manifest-covered rather than CI-executed by default.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, and intentional: the slow showcase samples (realtime pacers, rav1e SW AV1 ~200s) are gated behind PIPELINE_INCLUDE_SLOW=1 to keep CI cheap. They remain manifest-covered and runnable on demand via just test-oneshot-samples. Happy to add a periodic (nightly) job that runs them with the flag set if you'd like that coverage in CI too.

Comment on lines +92 to +108
pub fn available_nodes() -> &'static HashSet<String> {
static NODES: OnceLock<HashSet<String>> = OnceLock::new();
NODES.get_or_init(|| {
get_available_nodes(base_url()).unwrap_or_else(|err| {
if std::env::var("PIPELINE_REQUIRE_NODES").as_deref() == Ok("1") {
panic!(
"FATAL: Could not query available nodes: {err}\n \
PIPELINE_REQUIRE_NODES=1 requires a reachable server at {}",
base_url()
);
}
eprintln!("WARNING: Could not query available nodes: {err}");
eprintln!(" Node-gated tests will be skipped.");
eprintln!(" Is skit running at {}?", base_url());
HashSet::new()
})
})
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 Info: Node availability is cached for the process lifetime

Moving available_nodes() into the shared library means both harnesses now use a once-per-process snapshot of /api/v1/schema/nodes. This is fine for the CI pattern where skit is built and started before tests, but a local run against a server that loads/unloads plugins during the same test process will not see those changes. That is a behavioral constraint worth keeping in mind for plugin-focused local validation.

Open in Devin Review (Staging)

Was this helpful? React with 👍 or 👎 to provide feedback.

Debug

Playground

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — node availability is queried once and cached for the process lifetime via OnceLock. That's fine for these batch test runs: the server's registered node set is static for the duration of a cargo test invocation, and caching avoids re-querying /api/v1/... for every sample. No change needed.

- Support multiple required node kinds per sample via requires_nodes, so
  multi-plugin/codec samples (translate chains, vad-filtered-stt,
  dav1d compositor) skip when any dependency is missing instead of
  running and failing during pipeline construction.
- Validate every NDJSON record in JSON outputs, not just the first line.
- Drop section-divider comments per AGENTS.md comment guidelines.

Signed-off-by: streamkit-devin <devin@streamkit.dev>
…htly slow job

Extract the shared ffprobe checks into MediaExpectations, flattened into both
Expected and OneshotEntry, so a new media field is declared once instead of
copied by hand in media_expected() (which also dead-copied output_extension).

Mark video_nv_av1_colorbars slow: its core::pacer paces 900 frames at real
time (~30s), and NVENC AV1 is already covered cheaply by the
test/nv_av1_colorbars fixture, so the GPU job no longer pays ~30s per run.

Add a nightly workflow that runs the slow samples with PIPELINE_INCLUDE_SLOW=1.

Signed-off-by: streamkit-devin <devin@streamkit.dev>
Document the two suites, local run recipes, the per-sample manifest rule,
skip-control semantics, when to mark a sample slow, and that MediaExpectations
is the single source of truth for ffprobe checks.

Signed-off-by: streamkit-devin <devin@streamkit.dev>
slow is checked before the node check, so the oneshot suite skips nv_av1 and
only guards vulkan_video; NVENC-AV1 registration is guarded by the validate.rs
nv_av1_colorbars fixture.

Signed-off-by: streamkit-devin <devin@streamkit.dev>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.98%. Comparing base (ca6c476) to head (3461138).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #542      +/-   ##
==========================================
+ Coverage   79.96%   79.98%   +0.01%     
==========================================
  Files         234      234              
  Lines       68061    68061              
  Branches     1933     1933              
==========================================
+ Hits        54426    54438      +12     
+ Misses      13629    13617      -12     
  Partials        6        6              
Flag Coverage Δ
backend 79.75% <ø> (+0.01%) ⬆️
ui 82.24% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
core 85.37% <ø> (ø)
engine 82.94% <ø> (ø)
api 89.96% <ø> (ø)
nodes 76.89% <ø> (+0.04%) ⬆️
server 80.33% <ø> (ø)
plugin-native 83.70% <ø> (ø)
plugin-wasm 92.20% <ø> (ø)
ui-services 84.69% <ø> (ø)
ui-components 60.49% <ø> (ø)
see 1 file with indirect coverage changes
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@streamer45 streamer45 merged commit 78f8332 into main May 30, 2026
29 checks passed
@streamer45 streamer45 deleted the devin/1780150771-oneshot-sample-validation branch May 30, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants