fix(vlm): ceil-divide PP chunker so trailing samples are not dropped#2180
Open
khazic wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Open
fix(vlm): ceil-divide PP chunker so trailing samples are not dropped#2180khazic wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
khazic wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
… dropped _chunk_vlm_media split samples per microbatch via batch_size // n_microbatches, which drops the last (batch_size % n_microbatches) samples when the division is uneven. Their text still flows through schedule.step (which uses tensor.chunk and covers all samples) but their image tensors are silently lost, leaving trailing microbatches with media tokens but no pixel data. Switch all three internal branches (Gemma4 multi-image, general flat patches, legacy 1-image-per-sample) to ceil division so the chunker mirrors torch.tensor.chunk semantics. Add three regression tests covering uneven batch_size across each branch. Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
|
/ok to test 080a77d |
Contributor
|
/claude review |
Contributor
|
/ok to test c16dcc8 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
`_chunk_vlm_media` partitioned samples per microbatch with `batch_size // n_microbatches`. When that division is uneven the trailing `batch_size % n_microbatches` samples never get assigned to a chunk — their image tensors are silently dropped. Meanwhile `schedule.step` slices `input_ids`/`labels` via `torch.tensor.chunk(n_microbatches)` which uses ceil-sized chunks and covers every sample, so the affected microbatches end up with media tokens in text but no pixel data, breaking vision scatter or producing wrong outputs.
This patch switches all three internal branches to ceil division so the chunker mirrors `tensor.chunk` semantics.
Changelog
Test plan