fix(vlm): ceil-divide PP chunker so trailing samples are not dropped by khazic · Pull Request #2180 · NVIDIA-NeMo/Automodel

khazic · 2026-05-07T11:47:17Z

Summary

`_chunk_vlm_media` partitioned samples per microbatch with `batch_size // n_microbatches`. When that division is uneven the trailing `batch_size % n_microbatches` samples never get assigned to a chunk — their image tensors are silently dropped. Meanwhile `schedule.step` slices `input_ids`/`labels` via `torch.tensor.chunk(n_microbatches)` which uses ceil-sized chunks and covers every sample, so the affected microbatches end up with media tokens in text but no pixel data, breaking vision scatter or producing wrong outputs.

This patch switches all three internal branches to ceil division so the chunker mirrors `tensor.chunk` semantics.

Changelog

fix(vlm): replace `batch_size // n_microbatches` with `-(-batch_size // n_microbatches)` (ceil) in `_chunk_vlm_media`'s Gemma4 multi-image, general flat-patches, and legacy 1-image-per-sample branches.
test(vlm): add three regression tests covering uneven `batch_size` across each branch (`batch_size=7, n_microbatches=3` for general/Gemma4; `batch_size=5, n_microbatches=3` for legacy). Each asserts no sample is dropped and chunk sizes match `tensor.chunk`.

Test plan

`uv run pytest tests/unit_tests/recipes/test_finetune_vlm_helpers.py` — 68 passed, 3 skipped (skips are pre-existing `fused_linear_ce` GPU cases)
`uv run ruff format --check` and `ruff check` on `recipes/vlm/finetune.py` — clean
PP=3 functional verification on a VLM recipe with `local_batch_size: 7` (requires multi-GPU host)

… dropped _chunk_vlm_media split samples per microbatch via batch_size // n_microbatches, which drops the last (batch_size % n_microbatches) samples when the division is uneven. Their text still flows through schedule.step (which uses tensor.chunk and covers all samples) but their image tensors are silently lost, leaving trailing microbatches with media tokens but no pixel data. Switch all three internal branches (Gemma4 multi-image, general flat patches, legacy 1-image-per-sample) to ceil division so the chunker mirrors torch.tensor.chunk semantics. Add three regression tests covering uneven batch_size across each branch. Signed-off-by: khazic <khazzz1c@gmail.com>

copy-pr-bot · 2026-05-07T11:47:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi · 2026-05-07T11:50:17Z

/ok to test 080a77d

HuiyingLi · 2026-05-07T21:44:34Z

/claude review

claude

LGTM. Clean bug fix — the three manually-chunked branches now use ceil division to match torch.Tensor.chunk() semantics, preventing silent sample drops on uneven splits. Tests cover all three branches. No issues found.

HuiyingLi · 2026-05-08T19:00:02Z

/ok to test c16dcc8

khazic requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners May 7, 2026 11:47

github-actions Bot added the community-request label May 7, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 11:50 Inactive

copy-pr-bot Bot temporarily deployed to test May 7, 2026 11:50 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 11:57 Inactive

claude Bot approved these changes May 7, 2026

View reviewed changes

Merge branch 'main' into khazic/fix/vlm-pp-chunk-uneven-batch

c16dcc8

copy-pr-bot Bot temporarily deployed to nemo-ci May 8, 2026 19:00 Inactive

copy-pr-bot Bot temporarily deployed to test May 8, 2026 19:00 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 8, 2026 19:06 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vlm): ceil-divide PP chunker so trailing samples are not dropped#2180

fix(vlm): ceil-divide PP chunker so trailing samples are not dropped#2180
khazic wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
khazic:khazic/fix/vlm-pp-chunk-uneven-batch

khazic commented May 7, 2026

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

HuiyingLi commented May 7, 2026

Uh oh!

HuiyingLi commented May 7, 2026

Uh oh!

claude Bot left a comment

Uh oh!

HuiyingLi commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

khazic commented May 7, 2026

Summary

Changelog

Test plan

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

HuiyingLi commented May 7, 2026

Uh oh!

HuiyingLi commented May 7, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants