fix(vlm): chunk video inputs for pipeline parallelism#2177
Open
khazic wants to merge 7 commits intoNVIDIA-NeMo:mainfrom
Open
fix(vlm): chunk video inputs for pipeline parallelism#2177khazic wants to merge 7 commits intoNVIDIA-NeMo:mainfrom
khazic wants to merge 7 commits intoNVIDIA-NeMo:mainfrom
Conversation
…kimi_k25 In kimi_k25_vl_collate_fn, n_images_per_sample was derived from all_grid_thws only, which is conditionally appended: * text-only samples have no grid_thws → not appended * samples whose image region got orphaned by truncation (when drop_overlong=False, the default) also skip the append, but the sample itself stays in all_expanded with image tokens replaced by pad Result: len(n_images_per_sample) < batch_size on mixed batches, while input_ids has shape [batch_size, max_len]. Downstream PP _chunk_vlm_media indexes cumsum_images by sample index up to batch_size and raises IndexError when the cumsum is shorter. Track per-sample image count in lockstep with all_expanded (zeros for text-only and orphaned samples) so the resulting tensor has length batch_size in all cases. No behavior change for batches where every sample has an intact image, since the per-sample count then equals what the old derivation produced. Adds two regression tests covering (1) text-only + image mixed batch and (2) intact-image + truncation-orphaned mixed batch. Signed-off-by: khazic <khazzz1c@gmail.com>
Signed-off-by: khazic <khazzz1c@gmail.com>
Cover the path where a single batch contains both pixel_values and pixel_values_videos under PP. Verifies that image and video are popped from schedule kwargs, both chunk arrays are sized by their per-sample counts, the shared _vlm_chunk_idx is initialized once at 0, and both streams plus the cursor are cleared after the step. Signed-off-by: khazic <khazzz1c@gmail.com>
Apply ruff format to clean up pre-existing whitespace drift in the file. No behavioral changes. Signed-off-by: khazic <khazzz1c@gmail.com>
8821cc6 to
dca45ea
Compare
Contributor
|
/ok to test dca45ea |
Contributor
|
/claude review |
| batch["video_grid_thw"] = video_grid_thw | ||
|
|
||
| if self.pp.info.has_first_stage: | ||
| self.pp.info.schedule.step(input_ids, target=targets, losses=losses, **batch) |
Contributor
There was a problem hiding this comment.
LGTM — clean fix. The video chunking mirrors the existing image path correctly, the shared _vlm_chunk_idx cursor works because both modalities are chunked by the same microbatch boundaries, and the cleanup is properly guarded. Test coverage for video-only, mixed image+video, and the n_images_per_sample batch-size regression is solid.
Signed-off-by: khazic <khazzz1c@gmail.com>
HuiyingLi
previously approved these changes
May 8, 2026
The PP video chunking refactor in this PR routes image and video chunks through separate model attributes (``_vlm_pixel_values_chunks`` / ``_vlm_image_grid_hws_chunks`` for images vs. ``_vlm_pixel_values_videos_chunks`` / ``_vlm_video_grid_thw_chunks`` for videos). The forward path is also strict: a video token in input_ids only consumes the video chunk arrays. The GPU PP-guard test still set up image chunks (legacy attribute names) while feeding only the video token 151656, so the new strict consumption path took neither branch and ``_vlm_chunk_idx`` stayed at 0 — the assertion ``assert model._vlm_chunk_idx == 1`` therefore failed in L0_Unit_Tests_GPU. Update the fixture to set ``_vlm_pixel_values_videos_chunks`` / ``_vlm_video_grid_thw_chunks`` so the video-token branch fires. The shapes mirror the existing CPU equivalent test ``TestQwen3VLMoeForConditionalGenerationPpGuardCpu::test_chunked_pixel_values_videos_consumed_for_video_token``. Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
|
/ok to test d2e13ae |
Contributor
|
/ok to test 62f7695 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
VLM pipeline parallelism previously only chunked image tensors per microbatch; video tensors (
pixel_values_videos,video_grid_thw,n_videos_per_sample) were left in the batch and got sliced along dim 0 byschedule.step, which is wrong because their dim 0 is total-patches / video-count, not batch-size. Result: VLM + PP + video either crashes on shape mismatch or silently miscorrelates patches with text tokens.This PR mirrors the existing image chunking path for video.
Changelog
pixel_values_videos/video_grid_thw/n_videos_per_samplein_forward_backward_step(PP branch) and pre-chunk them onto stage 0 via_chunk_vlm_media, then consume per-microbatch insideforward()ofqwen3_5_moe,qwen3_omni_moe,qwen3_vl_moe. A shared_vlm_chunk_idxis incremented once per call via aconsumed_vlm_chunkflag so image+video coexisting on stage 0 advance the cursor correctly._chunk_vlm_media+ end-to-end_forward_backward_step) and image+video mixed chunking, asserting popped kwargs, per-microbatch shapes, single shared cursor at 0, and post-step cleanup.ruff formatontests/unit_tests/recipes/test_finetune_vlm_helpers.pyto clean up pre-existing whitespace/quote drift. No behavioral changes.Test plan
uv run pytest tests/unit_tests/recipes/test_finetune_vlm_helpers.py— 68 passed, 3 skipped (skips are pre-existingfused_linear_ceGPU cases)uv run ruff format --checkon the touched files — clean