Skip to content

fix(vlm): ceil-divide PP chunker so trailing samples are not dropped#2180

Open
khazic wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
khazic:khazic/fix/vlm-pp-chunk-uneven-batch
Open

fix(vlm): ceil-divide PP chunker so trailing samples are not dropped#2180
khazic wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
khazic:khazic/fix/vlm-pp-chunk-uneven-batch

Conversation

@khazic
Copy link
Copy Markdown
Contributor

@khazic khazic commented May 7, 2026

Summary

`_chunk_vlm_media` partitioned samples per microbatch with `batch_size // n_microbatches`. When that division is uneven the trailing `batch_size % n_microbatches` samples never get assigned to a chunk — their image tensors are silently dropped. Meanwhile `schedule.step` slices `input_ids`/`labels` via `torch.tensor.chunk(n_microbatches)` which uses ceil-sized chunks and covers every sample, so the affected microbatches end up with media tokens in text but no pixel data, breaking vision scatter or producing wrong outputs.

This patch switches all three internal branches to ceil division so the chunker mirrors `tensor.chunk` semantics.

Changelog

  • fix(vlm): replace `batch_size // n_microbatches` with `-(-batch_size // n_microbatches)` (ceil) in `_chunk_vlm_media`'s Gemma4 multi-image, general flat-patches, and legacy 1-image-per-sample branches.
  • test(vlm): add three regression tests covering uneven `batch_size` across each branch (`batch_size=7, n_microbatches=3` for general/Gemma4; `batch_size=5, n_microbatches=3` for legacy). Each asserts no sample is dropped and chunk sizes match `tensor.chunk`.

Test plan

  • `uv run pytest tests/unit_tests/recipes/test_finetune_vlm_helpers.py` — 68 passed, 3 skipped (skips are pre-existing `fused_linear_ce` GPU cases)
  • `uv run ruff format --check` and `ruff check` on `recipes/vlm/finetune.py` — clean
  • PP=3 functional verification on a VLM recipe with `local_batch_size: 7` (requires multi-GPU host)

… dropped

_chunk_vlm_media split samples per microbatch via batch_size // n_microbatches,
which drops the last (batch_size % n_microbatches) samples when the division is
uneven. Their text still flows through schedule.step (which uses tensor.chunk
and covers all samples) but their image tensors are silently lost, leaving
trailing microbatches with media tokens but no pixel data.

Switch all three internal branches (Gemma4 multi-image, general flat patches,
legacy 1-image-per-sample) to ceil division so the chunker mirrors
torch.tensor.chunk semantics. Add three regression tests covering uneven
batch_size across each branch.

Signed-off-by: khazic <khazzz1c@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 7, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 080a77d

@HuiyingLi
Copy link
Copy Markdown
Contributor

/claude review

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Clean bug fix — the three manually-chunked branches now use ceil division to match torch.Tensor.chunk() semantics, preventing silent sample drops on uneven splits. Tests cover all three branches. No issues found.

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test c16dcc8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants