Release Notes
-
Highlights
- Expanded VLM line-up: Gemma 4, Mistral 4, Qwen3.5 VL
- Diffusion and discrete-diffusion LLM (new tracks)
- NeMo Retriever β bi-encoder + cross-encoder / reranker
- Knowledge Distillation scaled to TP > 1 and PP (Sepehr Sameni)
- MoE infrastructure deepening β UCCL-EP, HybridEP, grouped_mm
- SkyPilot launcher backend (Aditya Saxena, community)
- End-to-end checkpoint + convergence robustness framework
-
Model Support β newly supported families in r0.4.0
- LLM
- VLM / OMNI
- Diffusion
- Flux T2I, Hunyuan T2V, Wan 2.1 T2V (see "Diffusion" section)
- Discrete diffusion LLM
- LLaDA (see "Discrete Diffusion LLM" section)
-
Diffusion β new track in r0.4.0
- HuggingFace Diffuser integration
- r0.4.0 ships full pretrain / finetune / generate pipelines with LoRA support for diffusion models (T2V, T2I)
- Wan integrated with multi-resolution DataLoader (#1475)
- Inference utility for diffusion (#1491)
- LoRA for diffusion (#1653, Linnan Wang)
- Diffusion processor registry (#1379)
- Models / recipes shipped
- Flux T2I β pretrain, SFT, LoRA, generate
- Hunyuan T2V β SFT, LoRA, generate
- Wan 2.1 T2V β pretrain, SFT, LoRA, generate
- Documentation guides for dataset preprocessing and finetuning.
-
Discrete Diffusion LLM (dLLM) β new track in r0.4.0
-
NeMo Retriever (bi-encoder + cross-encoder)
- Refactored cross-encoder / reranker training loop (new in r0.4.0) β (#1449).
- Bi-encoder datasets can be loaded directly from the HuggingFace Hub (#1380)
- Bi-encoder masking + consistent attn_implementation default (#1349)
- Resolve retrieval dataset corpus paths relative to training file (#1367)
- Docs: docs/guides/retrieval/finetune.md
-
Knowledge Distillation β Sepehr Sameni
-
Parallelism / Performance / Train-loop
- FSDP2
- FSDP2 weight prefetching + async TP optimization (#1711)
- Context Parallel
- Pipeline Parallel
- Activation checkpointing
- Gradient_checkpointing overhead reduction i[n transformers 5.3 (#1621 β Yuki Huang)
- MoE infrastructure
- UCCL-EP alternative dispatcher (#1635 β Zhiqi Li & Hemil Desai)
- HybridEP (#1333, #1666)
- DeepEP-on-H100 RDMA fallback detection (#1275 β Piotr Ε»elasko)
- torch._grouped_mm expert backend (#1228)
- TE FusedAdam QuantizedTensor compatibility patch (#1417)
- MoE LoRA rank scaling + torch_mm path (#1300, #1392)
- Expert / diversity metrics (#1232, #1506), top-k utilization (#1418)
- Packed sequences for MoE with EP+PP (#1685)
- FlashOptim integration (#1492)
- Scheduler-driven python GC (#1391)
- fp32 RMSNorm backend + cast_model_to_dtype for improved stability (#1493)
- Native Comet ML experiment tracking (#1411, Logan Vegna, community)
- Added .generate() with KV-cache for Nemotron v3 (#1332, Piotr Ε»elasko)
- Added output_hidden_states for NemotronHForCausalLM (#1386, Desh Raj)
- FSDP2
-
Launcher & CLI
-
Checkpoint and convergence robustness
- Checkpointing: End-to-end finetune β vLLM-deploy testing (#1606)
- Models covered:
- Gemma 3
- Nemotron (Flash 1B, Super v3, Nano 9B, Nano v3)
- Phi 4, Llama 3.2, Qwen 2.5
- Qwen 3 MoE, GPT-OSS.
- What this catches: prediction divergence, packaging gaps, vLLM loading issues.
- Models covered:
- Convergence harness (#1554, #1577, #1602)
- Pipeline: Tulu-3 data prep β model verification β training β eval
- Models covered:
- GPT-OSS 20B (FlashAdamW + TE FusedAdam).
- Moonlight 16B (3 configs incl. EP8+CP2).
- Qwen3 4B (3 configs incl. CP1/CP2 variants).
- Qwen3 MoE 30B (2 configs + experiments/).
- Checkpointing: End-to-end finetune β vLLM-deploy testing (#1606)
-
Datasets
- Neat packing (greedy knapsack) for LLM and VLM (#1485 β Zhiqi Li)
- Pretokenization support for VLM.(Zhiqi Li)
- MultiImage dataset support for Qwen family (Zhiqi Li)
- Qwen family video training support (Zhiqi Li)
- LengthGroupedSampler (#1618 β Zhiqi Li)
- Chat datasets THD/BSHD + CP, padding fixes (#1416).
- reasoning_content + tool-calling support in ChatDataset (#1644, Zeel Desai, community).
- Custom chat_template override for VLM finetuning (#1525, Bambuuai, community).
- NEFTune noisy embeddings (#1686, stanley1208, community).
- JSONL malformed-line skip (#1694, Somshubra Majumdar).
-
Documentation
-
Contributions β we are grateful for all contributions π
- Khazzz1c
- optimized resolve_yaml_env_vars from scanning runtime data in instantiate() (#1827)
- additional contributions in r0.5.0.
- Logan Vegna: added native Comet ML experiment tracking support (#1411).
- Harsha Pasham: fixed error with aten::equal operator on meta tensors (#1769).
- Aditya Saxena: added SkyPilot support. (#1590).
- SwekeR-463:
- stanley1208
- Zeel Desai
- Added reasoning_content and tool-calling support to ChatDataset (#1644).
- Additional contributions in the next release.
- Bambuuai: enabled custom chat_template override for VLM fine-tuning (#1525).
- Zakir Jiwani: Fixed instantiation issue in yaml parsing (issue #1496) (#1654).
- Khazzz1c
-
Known Issues
- Minor memory regression in cohere_command_r_7b_hellaswag_fp8 and glm_4_9b_chat_hf_hellaswag_fp8
- Qwen3_5_4b_neat_packing hangs during checkpoint saving
- MegatronFSDP support postponed for 26.06
- ~2% of checkpoint loading currently exercise a less-optimized path, which is being addressed in follow-up work.
Changelog Details
- refactor: extract initialize_model_weights from load_base_model by @hemildesai :: PR: #1356
- fix: prefer moe_config for num_experts in apply_ac by @hemildesai :: PR: #1361
- fix: FSDP pre-shard combined projections on dim 1 for Qwen2.5-7B support by @ZhiyuLi-Nvidia :: PR: #1357
- ci: Update release workflow to include changelog and docs by @chtruong814 :: PR: #1320
- feat: Add
.generate()function with KV cache support for Nemotron v3 by @pzelasko :: PR: #1332 - fix: loss masking with pad eos collision by @akoumpa :: PR: #1338
- feat: add Qwen3.5 35b by @HuiyingLi :: PR: #1373
- feat: refactor retriever code by @adil-a :: PR: #1166
- fix: resolve retrieval dataset corpus paths relative to training file by @oliverholworthy :: PR: #1367
- docs: Replace latest docs with nightly by @chtruong814 :: PR: #1358
- fix: EP collective deadlock with variable-length token counts by @ShiftyBlock :: PR: #1365
- fix: guard AutoConfig.from_pretrained in PP mask precomputation by @hemildesai :: PR: #1378
- docs: fix broken links across documentation guides by @chenopis :: PR: #1374
- fix: Handle check_model_inputs removal in transformers 5.2.0 by @oliverholworthy :: PR: #1369
- fix: coverage for customizer_retrieval tests by @akoumpa :: PR: #1382
- docs: add nano-v3 full sft benchmarks by @adil-a :: PR: #1387
- docs: Added installation guidance by @onel :: PR: #1371
- docs: update readme and docs by @akoumpa :: PR: #1370
- feat: make MoE parallelizer mixed precision policy configurable via recipes by @hemildesai :: PR: #1392
- ci: Add-credentials-for-docs by @ko3n1g :: PR: #1389
- feat: add pp_seq_len field to PipelineConfig by @hemildesai :: PR: #1390
- feat: add onnx export for biencoder by @akoumpa :: PR: #1276
- feat: add scheduler-driven manual garbage collection across recipes by @hemildesai :: PR: #1391
- fix: skip instantiation of nested configs overridden by kwargs in ConfigNode by @oliverholworthy :: PR: #1397
- fix: MoE lora adapter layout by @akoumpa :: PR: #1395
- fix: update GLM 4.7 Flash TE DeepEP finetuning config by @hemildesai :: PR: #1401
- fix: read rope config from rope_parameters across all models by @hemildesai :: PR: #1400
- docs: Ensure all docs updates from main are nightly by @chtruong814 :: PR: #1402
- feat: add output_hidden_states support to NemotronHForCausalLM by @desh2608 :: PR: #1386
- refactor: use auto_map for faster init by @akoumpa :: PR: #1405
- feat: allow disabling top-k expert utilization logging in MoE metrics by @hemildesai :: PR: #1418
- feat: add TE FusedAdam QuantizedTensor compatibility patch by @hemildesai :: PR: #1417
- feat: add MoE LoRA rank scaling and torch_mm to MoE LoRA by @hemildesai :: PR: #1300
- fix: add missing vocab_size to benchmark configs using MockIterableData by @krishnakalyan3 :: PR: #1404
- fix: correct MoE auxiliary loss gradient scaling by @hemildesai :: PR: #1412
- feat: add qwen 3.5 small dense models by @HuiyingLi :: PR: #1427
- fix: add mistral common +
_remap_system_roleby @akoumpa :: PR: #1423 - feat: support loading biencoder datasets directly from HuggingFace Hub by @oliverholworthy :: PR: #1380
- feat: add merge lora tool by @akoumpa :: PR: #1424
- feat: Migrating code from DFM to Automodel by @pthombre :: PR: #1379
- fix: misc doc updates by @akoumpa :: PR: #1153
- fix: switch to bf16 + sdpa for TP parity tests by @akoumpa :: PR: #1437
- fix: default value set by @akoumpa :: PR: #1443
- fix: vlm collate leading space fix by @HuiyingLi :: PR: #1428
- fix: disable rope_fusion when context parallelism (cp > 1) is enabled by @hemildesai :: PR: #1440
- fix: TP fix for nano-v2 by @adil-a :: PR: #1448
- fix: support multiple model types in merge lora + test update by @akoumpa :: PR: #1446
- fix: biencoder bidirectional masking and consistent attn_implementation default by @oliverholworthy :: PR: #1349
- docs: add contrib button to readme by @akoumpa :: PR: #1454
- feat: improved error messages by @akoumpa :: PR: #1452
- feat: add ty for attention/config/launcher/loggers/optim by @akoumpa :: PR: #1445
- fix: native fp8 checkpoint + peft by @adil-a :: PR: #1459
- fix: move print_trainable_parameters on device by @akoumpa :: PR: #1463
- feat: parameterize onnx export test on dtype by @akoumpa :: PR: #1457
- fix: handle missing reset_parameters in Qwen3_5MoeBlock.init_weights() by @zpqiu :: PR: #1461
- fix: combined projection bias loading and rms_norm numerical instability by @ZhiyuLi-Nvidia :: PR: #1410
- fix: qwen3_8b_hellaswag_pp_peft recipe by @ZhiyuLi-Nvidia :: PR: #1335
- ci: Update pyt base container to 26.02 by @thomasdhc :: PR: #1436
- ci: Create uv sync arg docker arg by @thomasdhc :: PR: #1474
- ci: Switch to merge-commit CI by @ko3n1g :: PR: #1472
- feat: keep tokenizer assets v4-compatible by @akoumpa :: PR: #1465
- fix: CombinedProjectionStateDictAdapter._gather_1d_bias by @akoumpa :: PR: #1477
- fix: MoE parallelizer config lookup for VLM models with nested text_config by @HuiyingLi :: PR: #1466
- fix: meta device init condition by @adil-a :: PR: #1480
- cp: fix: DTensor materialization in MoE state_dict adapter for ep_shard > 1 by @HuiyingLi :: PR: #1483
- fix: biencoder PEFT adapter key remapping for merge_lora by @adil-a :: PR: #1479
- docs: Parse Finetuning Tutorial by @aasthajh :: PR: #1471
- fix: correct 3D mRoPE position_ids sharding in context parallelism by @HuiyingLi :: PR: #1482
- fix: tp plan for nemotron super by @akoumpa :: PR: #1487
- feat: fp32 RMSNorm backend and cast_model_to_dtype by @hemildesai :: PR: #1493
- feat: support chat datasets with THD, BSHD + CP and padding fixes by @hemildesai :: PR: #1416
- fix: skip initialize_weights for Gemma3ForCausalLM (DTensor TP assertion) by @terrykong :: PR: #1488
- docs: fine-tuning process and container usage by @krishnakalyan3 :: PR: #1484
- feat: TP-aware KDLoss with distributed softmax and TΒ² scaling by @Separius :: PR: #1499
- fix: make MistralCommonBackend inherit from PreTrainedTokenizerBase by @akoumpa :: PR: #1505
- fix: forward-compatible _patched_get_init_context for transformers v5.3.0 by @HuiyingLi :: PR: #1504
- fix: Log exception and error in FirstRankPerNode before exiting by @athitten :: PR: #1468
- feat: add FlashOptim optimizer integration by @hemildesai :: PR: #1492
- fix: attach CP attention-mask hooks for dense (non-TE) context parallelism by @hemildesai :: PR: #1470
- feat: add new score func and pp microbatch pixel split handling by @HuiyingLi :: PR: #1513
- feat: add MoE expert diversity metrics by @hemildesai :: PR: #1506
- fix: gpt-oss ckpt saving by @akoumpa :: PR: #1501
- fix: TP paralellizer with replicated qkvs by @akoumpa :: PR: #1519
- fix: construct rope_parameters fallback for MiniMaxM2 by @hemildesai :: PR: #1518
- feat: Super V3 by @adil-a :: PR: #1522
- feat: Add GLM 5 implementation by @hemildesai :: PR: #1372
- feat: update readme by @akoumpa :: PR: #1531
- ci: improve functional test msg by @akoumpa :: PR: #1524
- fix: de-pickle by @akoumpa :: PR: #1517
- ci: add default env vars ala .github/actions/test-template/action.yml L120 by @akoumpa :: PR: #1523
- feat: Enable custom chat_template override for VLM fine-tuning by @Bambuuai :: PR: #1525
- cp: feat: add neat packing (greedy knapsack) for LLM and VLM datasets by @HuiyingLi :: PR: #1485
- fix: Revert uv.lock to fix install test with NGC Cuda by @chtruong814 :: PR: #1534
- feat: add v4_compatible ckpt by @akoumpa :: PR: #1532
- fix: baichuan .bin ckpt loading by @akoumpa :: PR: #1515
- ci: Update uv lock codeowner and commit block by @thomasdhc :: PR: #1539
- docs: add large moe llm doc by @HuiyingLi :: PR: #1541
- feat: Integrate Wan with multi-resolution DL by @pthombre :: PR: #1475
- feat: Add native Comet ML experiment tracking by @LoganVegnaSHOP :: PR: #1411
- fix: replace pickle with torch.load(..., weights_only=True) by @akoumpa :: PR: #1546
- fix: optimized TP plan lookup in NeMo-RL by qualname by @ZhiyuLi-Nvidia :: PR: #1547
- feat: model addition by @HuiyingLi :: PR: #1550
- feat: add more example configs by @akoumpa :: PR: #1553
- fix: handle Nemotron V3 with force_hf=True in weight initialization skip logic by @RayenTian :: PR: #1551
- feat: add mistral4 recipe by @HuiyingLi :: PR: #1556
- fix: add dynamic=True to Float32RMSNorm by @akoumpa :: PR: #1555
- ci: Updating testing path to /opt/Automodel, update codecov settings by @thomasdhc :: PR: #1544
- feat: MFU logging in train recipes by @SwekeR-463 :: PR: #1413
- fix: GPT-OSS MoE aux_loss softmax and remove torch.compile from _apply_bias by @hemildesai :: PR: #1559
- ci: Add claude code review by @thomasdhc :: PR: #1545
- fix: lora test by @akoumpa :: PR: #1561
- ci: Update permissions for claude review workflow by @thomasdhc :: PR: #1562
- fix: fall back to HF for Mistral3 VLMs with non-Mistral4 text backbone by @HuiyingLi :: PR: #1557
- feat: input validation & model capability by @akoumpa :: PR: #1542
- fix: kd inference mode by @akoumpa :: PR: #1567
- fix: seq cls trainer by @akoumpa :: PR: #1564
- fix: enable Phi-4-multimodal-instruct VLM finetuning by @HuiyingLi :: PR: #1552
- docs: add navigation table by @akoumpa :: PR: #1573
- fix: patch missing mock in meta-tensor retry test by @HuiyingLi :: PR: #1575
- feat: VDR feedback: Common inference utility by @pthombre :: PR: #1491
- ci: Fix sso user check by @chtruong814 :: PR: #1578
- feat: Add context parallel support for Qwen3.5 MoE by @zpqiu :: PR: #1560
- docs: update finetune guide by @akoumpa :: PR: #1548
- ci: Update coverage path and fix coverage upload by @thomasdhc :: PR: #1582
- fix: Nemotron v3 inputs_embeds generation by @pzelasko :: PR: #1583
- fix: checkpointing for PEFT. by @akoumpa :: PR: #1576
- ci: Move source install fla to dev group by @thomasdhc :: PR: #1580
- fix: register kimi_k25 and kimi_vl configs eagerly in lazy registry by @HuiyingLi :: PR: #1579
- feat: add pipeline parallelism support for knowledge distillation by @Separius :: PR: #1500
- perf: simplify Qwen3-VL-MoE state_dict_adapter + use torch hf reader by @hemildesai :: PR: #1570
- docs: Add docs about diffusion support in AM by @pthombre :: PR: #1495
- docs: merge tables by @akoumpa :: PR: #1587
- fix: remove in-place change model config by @yuki-97 :: PR: #1595
- ci: add @pthombre to codeowners by @akoumpa :: PR: #1588
- perf: simplify Qwen3.5-MoE state_dict_adapter + DTensor passthrough by @HuiyingLi :: PR: #1589
- fix: convert DTensor biases to local in MoE _forward_loop by @hemildesai :: PR: #1565
- fix: narrow model.to(device) skip to checkpoint-loaded path only by @hemildesai :: PR: #1597
- fix: fix tp plan lookup by @yuki-97 :: PR: #1600
- ci: upgrade GitHub Actions for Node.js 24 compatibility by @ko3n1g :: PR: #1593
- ci: Add ci_tests to tests folder by @thomasdhc :: PR: #1596
- fix: resolve deadlock saving diffusion checkpoints in safetensors format by @adil-a :: PR: #1601
- fix: Remove duplicate keys in recipes by @thomasdhc :: PR: #1605
- ci: Update llm_finetune recipes for ci by @thomasdhc :: PR: #1608
- fix: resolve TP+PP for nemotron super 49B by @HuiyingLi :: PR: #1607
- ci: Update vllm_finetune ci config by @thomasdhc :: PR: #1611
- feat: add Tulu-3 E2E convergence pipeline by @hemildesai :: PR: #1554
- feat: add SkyPilot as a cloud execution backend for AutoModel by @Anakintano :: PR: #1590
- ci: Set upperbound for transformers by @thomasdhc :: PR: #1615
- ci: Enable CI variables for changing lint runner and container by @chtruong814 :: PR: #1619
- docs: update coverage doc by @HuiyingLi :: PR: #1609
- feat: add reranker training by @adil-a :: PR: #1449
- fix: fix NemotronHForCausalLM force_hf=True by @yuki-97 :: PR: #1625
- fix: fix gradient_checkpointing overhead in transformers 5.3 by @yuki-97 :: PR: #1621
- feat: Migrate diffusion recipe to use Stateful Dataloader by @pthombre :: PR: #1630
- feat: add Nemotron Nano 4B SQuAD finetune recipe by @davidoneilai :: PR: #1624
- feat: Ensure that diffusion training jobs use the safetensors checkpoint format by @pthombre :: PR: #1627
- ci: Pass argument automodel dir for transformer version check by @thomasdhc :: PR: #1617
- fix: from_pretrained with nested kwargs (e.g. text_config) crashes on VLM models by @zpqiu :: PR: #1623
- feat: add hybridep by @hemildesai :: PR: #1333
- fix: tied embedding v4 to v5 by @akoumpa :: PR: #1631
- ci: Add deleted files explicitly in coverage omit by @thomasdhc :: PR: #1637
- feat: add AGENTS.md by @akoumpa :: PR: #1638
- fix: remove redundant _keep_in_fp32_modules for layer norms in GptOssForCausalLM by @stanley1208 :: PR: #1633
- cp: feat: VLM pretokenized data pipeline with neat packing by @HuiyingLi :: PR: #1618
- refactor: CLI app and launching by @akoumpa :: PR: #1406
- feat: add missing recipe in yaml by @akoumpa :: PR: #1642
- ci: Update run time for nemotron super ci by @thomasdhc :: PR: #1614
- ci: Update mistral4 medpix ci run time by @thomasdhc :: PR: #1646
- fix: skip initialize_weights for Phi3ForCausalLM with TP sharding by @adil-a :: PR: #1648
- feat: add UCCL-EP as alternative dispatcher for expert parallelism by @hemildesai :: PR: #1635
- feat: enable TE Linear layers for PEFT/LoRA by @adil-a :: PR: #1626
- fix: Float32RMSNorm torch.compile crash on PyTorch 2.11+ by @hemildesai :: PR: #1650
- feat: GPT-OSS 20B and Moonlight 16B convergence results by @hemildesai :: PR: #1577
- docs: add gemma4 tutorial by @HuiyingLi :: PR: #1657
- feat: add gemma4 configs by @HuiyingLi :: PR: #1658
- fix: move skills to .claude/skills by @akoumpa :: PR: #1662
- feat: add gemma 4 by @HuiyingLi :: PR: #1660
- ci: Add recipe golden values by @thomasdhc :: PR: #1647
- fix: link in readme. by @akoumpa :: PR: #1664
- feat: add HybridEP example config for Qwen3-30B-A3B by @hemildesai :: PR: #1666
- fix: Mistral4 FP8 dequant on multi-dim mesh by @HuiyingLi :: PR: #1594
- fix: Finetune DeepSeek V3 (issue #1496) by @JiwaniZakir :: PR: #1654
- fix: update gemma4 configs and doc with correct model IDs by @HuiyingLi :: PR: #1670
- feat: Add discrete diffusion LLM (dLLM) supervised fine-tuning support by @zyzhou5 :: PR: #1665
- feat: context-parallel with nemotron v3 by @adil-a :: PR: #1441
- chore(beep boop π€): bump FW-CI-templates workflow pins to v0.88.0 by @svcnvidia-nemo-ci :: PR: #1669
- feat: add cp2 convergence configs and eval fixes by @hemildesai :: PR: #1602
- fix: add tp plan for phi2 by @akoumpa :: PR: #1674
- fix: move .claude/skills to skills by @akoumpa :: PR: #1673
- feat: add reasoning_content and tool-calling support to ChatDataset by @zeel2104 :: PR: #1644
- feat: Add Llada SFT support by @pthombre :: PR: #1672
- docs: update the finetune guide by @akoumpa :: PR: #1678
- fix: add best_metric_key field to CheckpointingConfig dataclass by @stanley1208 :: PR: #1641
- ci: Set target version for ruff by @thomasdhc :: PR: #1636
- docs: add per-model pages by @akoumpa :: PR: #1683
- ci: Add code freeze workflow by @thomasdhc :: PR: #1688
- feat: Allow to conditionally skip malformed jsonl lines when loading dataset by @titu1994 :: PR: #1694
- feat: Add dllm generation support by @pthombre :: PR: #1692
- feat: implement NEFTune noisy embeddings for instruction fine-tuning by @stanley1208 :: PR: #1686
- feat: enable packed sequences for Qwen3.5-MoE with EP+PP by @HuiyingLi :: PR: #1685
- fix: swap DTensor shard placements after transpose in Step3p5 state dict adapter by @adil-a :: PR: #1691
- ci: Associate recipe owners by @thomasdhc :: PR: #1690
- test: add checkpoint robustness functional tests by @adil-a :: PR: #1606
- ci: Remove duplicate ci config by @thomasdhc :: PR: #1702
- fix: freeze dead KV-sharing params to fix checkpoint resume by @HuiyingLi :: PR: #1698
- ci: Update version to 0.4.0 by @thomasdhc :: PR: #1703
- cp:
feat: integrate NeMo-Run launcher (1668)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1706 - cp:
fix: mute warning spam (1721)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1724 - cp:
(#1696)intor0.4.0by @adil-a :: PR: #1732 - cp:
fix: handle dict-typed chat_template (1696)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1716 - cp:
feat: adding lora to diffusion (1653)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1708 - cp:
feat: MoE model benchmarks, LoRA configs (1676)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1707 - cp:
fix: fixing the pooling error (1645)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1736 - cp:
ci: Address timeout is ci tests (1733)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1738 - cp:
test: Checkpoint robustness skips atexit-registered destroy_process_group() (1730)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1739 - cp:
fix: Qwen3.5 dense CP support (1710)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1737 - cp:
feat: Add lora recipes for gemma4 (1731)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1741 - cp:
test: add vLLM deployment testsintor0.4.0by @svcnvidia-nemo-ci :: PR: #1745 - cp:
build: drop rc0 pre-release tag and add dynamic git versioning (1729)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1756 - cp:
fix: Baichuan2 checkpoint robustness test CI failures (1727)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1754 - cp:
ci: Address container and source code cve (1753)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1758 - cp:
ci: Update test timeout and add ci_tests readme (1752)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1759 - cp:
fix: Update lora configs for gemma4 (1748)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1749 - cp:
fix: launcher option from being a config override. (1766)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1772 - cp:
fix: skip embedding[padding_idx] = 0 with TP (1675)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1771 - cp:
ci: add missing recipe owners (1775)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1776 - cp:
ci: Resolve cve and remove uv cache (1774)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1778 - cp: FSDP2 w weight prefetching and async TP optimization (#1711) by @ZhiyuLi-Nvidia :: PR: #1779
- cp:
fix: update yamls for vllm_deploy (1780)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1781 - cp:
docs: Add nightly CI test summary for LLM and VLM finetune configs (1791)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1792 - cp:
fix: Add per-tensor conv. in gemma4 sd adapter (1764)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1795 - cp:
feat: Enable CI benchmark with {llm,vlm}_benchmark (1793)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1794 - cp:
fix: NotImplementedError: aten::equal on meta tensors during multi-GPU init (1769)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1797 - cp:
fix: Restrict auto-discovery scopes in generate_ci_tests.py (1805)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1812 - cp:
ci: RC6 timeout fixes for release test recipes (1801)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1816 - cp:
ci: Increase benchmark timeout for GLM and Qwen3.5 MoE LoRA recipes (1818)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1819 - cp:
fix: meta init with force_hf=True (1810)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1822 - cp:
fix: enable dequantization for ministral3 and dataset limit (1807)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1820 - cp:
chore: Update GPT-OSS and Qwen3 recipe configs (1811)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1815 - cp:
fix: tie weights outside _init_model (1817)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1829 - cp:
fix: Align benchmark TEST_LEVEL check with generate_ci_tests scope (1831)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1832 - cp:
fix: gpt_oss_20b_single_gpu_peft CI crash with nproc_per_node override (1835)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1844 - cp:
fix: stop resolve_yaml_env_vars from scanning runtime data in instantiate() (1827)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1836 - cp:
fix: Re-apply PyTorch dependency overrides after full COPY in Dockerfile (1847)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1848 - cp:
fix: install ffmpeg and rebuild torchcodec for phi4mm audio decoding (1826)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1849 - cp:
fix: rotary embeddings for v4 (1821)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1851 - fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params by @HuiyingLi :: PR: #1813
- cp:
fix: relax checkpoint robustness HF KL threshold for nemotron_nano_8b_v1 (1839)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1855 - cp:
fix: trust_remote_code guard in robustness test (1845)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1857 - cp:
ci: add NMP customizer contract test configs (1712)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1858 - cp:
fix: pre-cache HF dynamic modules to prevent filesystem race in robustness test (1840)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1856 - cp:
ci: Update to transformers v5.5 (1734)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1854 - cp:
ci: Reduce default finetune step count from 100 to 50 (1874)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1876 - cp:
fix: gpt oss ci (1877)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1878 - cp: fix: Setup vllm testing with uv --no-config (#1875) by @thomasdhc :: PR: #1881
- cp:
fix: Skip snapshot_download when HF_HUB_OFFLINE=1 (1834)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1880 - cp:
fix: Allow use_cache w/ activation_checkpointing (1726)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1760 - cp:
fix: handle transformers.FineGrainedFP8Config quantization config (1864)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1888 - cp:
fix: Create diffusion_kernels group to fix HF_HUB_OFFLINE compatibility (1842)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1886 - cp:
fix: relax KL thresholds and remove invalid kwargs in Qwen3Next linear attn (1867)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1899 - cp:
ci: Add Dockerfile.deploy for deploy test environment (1804)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1902 - cp:
fix: Fix bug in diffusion generation (1850)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1900 - cp:
fix: pass unnormalized residual to MoE gate in Gemma4 decoder layer (1895)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1903 - cp:
fix(gemma4_moe): vision-aware mask when use_bidirectional_attention==vision (1905)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1907 - cp:
fix: gradient checkpointing broken for MoE models on single GPU (ep_size=1) (1873)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1879 - cp:
fix: baichuan dynamic cache (1865)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1909 - cp:
fix Qwen3.5+Phi4MM CI after transformers v5.5 update(1906)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1908 - cp:
fix: Coerce plain-dict backend to BackendConfig in model init (1784)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1803 - cp:
chore: move recipes to have perf CI/CD coverage (1885)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1897 - cp:
resolve VLM CI failures for PP recipes and collate_fn(1799)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1889 - cp:
fix: Update recipe_owner for gemma4 (1925)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1926 - cp:
fix: update defer_fsdp_grad_sync in recipes (1919)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1930 - cp:
fix: chat dataset (1921)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1931 - cp:
fix: make _get_logits pp aware in ckpt robustness (1923)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1934 - cp:
fix: disable packed sequences for nemotron_nano_4b_squad (1929)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1935 - cp:
ci: Add test_recipes for custom test scope (1915)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1954 - cp:
fix: Update recipe test time based on release test run (1955)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1956 - cp:
fix: AC silently skipped on all registered VLMs β flatten ModuleList (1941)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1958 - cp:
docs: Add container version to docs version picker (1965)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1966 - fix: Step-3.5-Flash layer_types mismatch and related recipe fixes (#1β¦ by @akoumpa :: PR: #1936
- cp:
fix: Patch wandb-core Go CVEs: bump otel SDK, add go-jose (1957)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1964 - cp:
chore: add @zyzhou5 and @athitten to codeowners (1968)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1969 - cp:
fix: Update gemm4 26b ci timeout (1962)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1967 - cp:
fix: qlora ckpt loading (1549)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1920 - fix: ministral tp plan (#1963) by @akoumpa :: PR: #1974
- cp:
fix(vlm): qwen3_5_4b_neat_packing OOM - reduce seqlen to 4096 (1975)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1977 - cp:
fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (1971)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1979 - cp:
fix: nemotron flash (1973)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1978 - cp:
fix(devstral): point 24B Squad recipes at official FP8 model (1980)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1982 - cp:
fix: vllm deploy test should fail if vllm is not present (1987)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1988 - cp:
fix: Move benchmark recipe out of llm_finetune nightly (1989)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1990 - cp:
fix: Address ci timeout test from rc8 (1991)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1993 - cp:
ci: Support per-recipe env_vars in CI config (1999)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2000 - cp:
fix: Address pillow CVE (1994)intor0.4.0by @svcnvidia-nemo-ci :: PR: #1998 - cp:
ci: Update test recipe list (2001)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2002 - cp:
fix: batch Flash 1B + Super-49B PEFT + qwen2.5-7B ckpt-robustness (1984)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2008 - cp:
fix: change drop_long_samples to True by default (2009)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2014 - cp:
fix: transformers v5.5.0 validation (2010)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2013 - cp:
ci: add --tb=short to pytest invocations in CI test scripts (2018)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2019 - cp:
fix: switch from match_all_linear to target_modules (2022)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2023 - cp:
fix: add discover pp seq len (2024)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2026 - cp:
fix: regression in tokenizer+auto_map with transformers 5.5.0 (2025)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2029 - cp: ci: add known_issue_id / allow_failure keys + triage (#2028) to r0.4.0 by @thomasdhc :: PR: #2033
- cp:
fix: gradient clip with torch_mm + EP (gpt-oss 120b recipe) (2012)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2035 - Cherry-pick #1728 to r0.4.0 (Qwen refs removed) by @pthombre :: PR: #2031
- cp:
ci: triage pipeline benchmark failures (2040)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2044 - cp:
fix: lora checkpointing (2037)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2046 - cp:
ci: triage rc9 finetune failures (2043)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2049 - cp:
ci: triage vllm_deploy rc9 failures (2047)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2050 - ci: cherry-pick #2048 (LoRA nightly tests) to r0.4.0 by @pthombre :: PR: #2051
- cp:
ci: Update base container pillow version for cve (2065)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2066 - cp:
docs: Bump docs version (2073)intor0.4.0by @svcnvidia-nemo-ci :: PR: #2074 - docs: Update docs version to 0.4.0 by @chtruong814 :: PR: #2075