fix(vlm): fail loudly in PP chunker when pixel_values cannot be aligned by khazic · Pull Request #2181 · NVIDIA-NeMo/Automodel

khazic · 2026-05-07T11:51:32Z

Summary

`_chunk_vlm_media`'s legacy fallback (`n_images != batch_size` AND `n_images_per_sample is None` AND pixel_values is not shaped `[batch_size, ...]`) used to dump every image into mb0, emit empty pixel_values for mb1..N, and log a warning. Subsequent microbatches whose text still carried media tokens would then scatter into empty tensors — silent corruption rather than a crash.

This patch replaces the warn-and-continue with `raise ValueError`, forcing the calling collate to align with the chunker's contract instead of training on broken data.

Changelog

fix(vlm): `_chunk_vlm_media` now raises `ValueError` when no layout matches, with a message listing the actual shapes / counts and pointing at the two valid options (`pixel_values [batch_size, ...]` or `n_images_per_sample` set).
test(vlm): rewrite `test_fallback_mismatched_images` and `test_pp_vlm_chunking_mismatched_images_and_batch` to assert the new `pytest.raises` contract instead of the old "warn and continue" behavior.

Behavioral note

This is a breaking change for any caller relying on the previous warn-and-continue behavior. Such callers were already producing incorrect training data — empty pixel_values scattered into media tokens cannot yield meaningful gradients — so the breakage is intentional. All in-tree collates already set `n_images_per_sample`, so no in-tree recipe is affected.

Test plan

`uv run pytest tests/unit_tests/recipes/test_finetune_vlm_helpers.py` — 65 passed, 3 skipped (skips are pre-existing `fused_linear_ce` GPU cases)
`uv run ruff check` on `recipes/vlm/finetune.py` — clean

…ed to batch The legacy fallback in _chunk_vlm_media (when n_images != batch_size and no n_images_per_sample is provided) used to log a warning, dump every image into mb0, and emit empty pixel_values for mb1..N. Subsequent microbatches whose text still contained media tokens would then scatter into empty tensors — silent corruption rather than a crash. Replace the warn-and-continue with raise ValueError. Callers must either pass pixel_values shaped [batch_size, ...] (one media tensor per sample) or include n_images_per_sample so the chunker can map images to samples. Update the two existing tests that codified the warn-and-continue behavior to assert the new contract via pytest.raises. Signed-off-by: khazic <khazzz1c@gmail.com>

copy-pr-bot · 2026-05-07T11:51:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi · 2026-05-07T11:55:03Z

/ok to test 0e2d985

HuiyingLi · 2026-05-07T21:45:24Z

/claude review

claude

LGTM

HuiyingLi · 2026-05-08T18:59:56Z

/ok to test c7b7b13

khazic requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners May 7, 2026 11:51

github-actions Bot added the community-request label May 7, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 11:55 Inactive

copy-pr-bot Bot temporarily deployed to test May 7, 2026 11:55 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 7, 2026 12:00 Inactive

claude Bot approved these changes May 7, 2026

View reviewed changes

HuiyingLi approved these changes May 8, 2026

View reviewed changes

HuiyingLi enabled auto-merge (squash) May 8, 2026 00:54

Merge branch 'main' into khazic/fix/vlm-pp-chunk-fallback-raise

c7b7b13

copy-pr-bot Bot temporarily deployed to nemo-ci May 8, 2026 19:00 Inactive

copy-pr-bot Bot temporarily deployed to test May 8, 2026 19:00 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 8, 2026 19:05 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vlm): fail loudly in PP chunker when pixel_values cannot be aligned#2181

fix(vlm): fail loudly in PP chunker when pixel_values cannot be aligned#2181
khazic wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
khazic:khazic/fix/vlm-pp-chunk-fallback-raise

khazic commented May 7, 2026

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

HuiyingLi commented May 7, 2026

Uh oh!

HuiyingLi commented May 7, 2026

Uh oh!

claude Bot left a comment

Uh oh!

HuiyingLi commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

khazic commented May 7, 2026

Summary

Changelog

Behavioral note

Test plan

Uh oh!

copy-pr-bot Bot commented May 7, 2026

Uh oh!

HuiyingLi commented May 7, 2026

Uh oh!

HuiyingLi commented May 7, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants