Release NVIDIA NeMo-Automodel 0.4.0 · NVIDIA-NeMo/Automodel

Release Notes

Highlights
- Expanded VLM line-up: Gemma 4, Mistral 4, Qwen3.5 VL
- Diffusion and discrete-diffusion LLM (new tracks)
- NeMo Retriever – bi-encoder + cross-encoder / reranker
- Knowledge Distillation scaled to TP > 1 and PP (Sepehr Sameni)
- MoE infrastructure deepening – UCCL-EP, HybridEP, grouped_mm
- SkyPilot launcher backend (Aditya Saxena, community)
- End-to-end checkpoint + convergence robustness framework
Model Support – newly supported families in r0.4.0
- LLM
  - GLM 5 (#1372)
  - GLM 5.1 (#1720)
  - MiniMax M2.5 (#1280)
- VLM / OMNI
  - Gemma 4 family – 2B, 4B, 31B, 26B-A4B MoE (#1658, #1660, #1731)
  - Mistral Small 4 (#1556)
  - Qwen3.5 VL dense – 4B, 9B (#1427)
  - Qwen3.5 VL MoE – 35B (#1373)
- Diffusion
  - Flux T2I, Hunyuan T2V, Wan 2.1 T2V (see "Diffusion" section)
- Discrete diffusion LLM
  - LLaDA (see "Discrete Diffusion LLM" section)
Diffusion – new track in r0.4.0
- HuggingFace Diffuser integration
- r0.4.0 ships full pretrain / finetune / generate pipelines with LoRA support for diffusion models (T2V, T2I)
- Wan integrated with multi-resolution DataLoader (#1475)
- Inference utility for diffusion (#1491)
- LoRA for diffusion (#1653, Linnan Wang)
- Diffusion processor registry (#1379)
- Models / recipes shipped
  - Flux T2I – pretrain, SFT, LoRA, generate
  - Hunyuan T2V – SFT, LoRA, generate
  - Wan 2.1 T2V – pretrain, SFT, LoRA, generate
- Documentation guides for dataset preprocessing and finetuning.
Discrete Diffusion LLM (dLLM) – new track in r0.4.0
- Discrete diffusion LLM SFT support added (#1665)
- LLaDA SFT recipe (#1672)
- dLLM generation pipeline (#1692)
NeMo Retriever (bi-encoder + cross-encoder)
- Refactored cross-encoder / reranker training loop (new in r0.4.0) — (#1449).
- Bi-encoder datasets can be loaded directly from the HuggingFace Hub (#1380)
- Bi-encoder masking + consistent attn_implementation default (#1349)
- Resolve retrieval dataset corpus paths relative to training file (#1367)
- Docs: docs/guides/retrieval/finetune.md
Knowledge Distillation — Sepehr Sameni
- Enable TP > 1 in KD (#1297)
- TP-aware KDLoss with distributed softmax + T² scaling (#1499)
- Pipeline-parallelism support for KD (#1500)
Parallelism / Performance / Train-loop
- FSDP2
  - FSDP2 weight prefetching + async TP optimization (#1711)
- Context Parallel
  - Qwen3.5 dense & MoE CP (#1710, #1560 — alexchiu / Zhaopeng Qiu)
  - Mamba CP for hybrid Nemotron v3 (#1441)
  - 3D mRoPE position_ids sharding under CP (#1482)
  - CP attention-mask hooks for dense / non-TE (#1470)
- Pipeline Parallel
  - PP shape-inference optimization + pp_seq_len field in PipelineConfig (#1195, #1390)
  - Variable length for PP (#1689 – Zhiqi Li & Hemil Desai)
- Activation checkpointing
  - Gradient_checkpointing overhead reduction i[n transformers 5.3 (#1621 — Yuki Huang)
- MoE infrastructure
  - UCCL-EP alternative dispatcher (#1635 – Zhiqi Li & Hemil Desai)
  - HybridEP (#1333, #1666)
  - DeepEP-on-H100 RDMA fallback detection (#1275 — Piotr Żelasko)
  - torch._grouped_mm expert backend (#1228)
  - TE FusedAdam QuantizedTensor compatibility patch (#1417)
  - MoE LoRA rank scaling + torch_mm path (#1300, #1392)
  - Expert / diversity metrics (#1232, #1506), top-k utilization (#1418)
  - Packed sequences for MoE with EP+PP (#1685)
- FlashOptim integration (#1492)
- Scheduler-driven python GC (#1391)
- fp32 RMSNorm backend + cast_model_to_dtype for improved stability (#1493)
- Native Comet ML experiment tracking (#1411, Logan Vegna, community)
- Added .generate() with KV-cache for Nemotron v3 (#1332, Piotr Żelasko)
- Added output_hidden_states for NemotronHForCausalLM (#1386, Desh Raj)
Launcher & CLI
- SkyPilot backend (#1590 — Aditya Saxena, community contributor)
- CLI app + launching refactor (#1406)
  - Shim scripts under examples/ will be deprecated post 26.04.
- Launcher CLI flags no longer leak into recipe YAML overrides (#1766)
- MFU logging in train recipes (#1413 — SwekeR, community)
Checkpoint and convergence robustness
- Checkpointing: End-to-end finetune → vLLM-deploy testing (#1606)
  - Models covered:
    - Gemma 3
    - Nemotron (Flash 1B, Super v3, Nano 9B, Nano v3)
    - Phi 4, Llama 3.2, Qwen 2.5
    - Qwen 3 MoE, GPT-OSS.
  - What this catches: prediction divergence, packaging gaps, vLLM loading issues.
- Convergence harness (#1554, #1577, #1602)
  - Pipeline: Tulu-3 data prep → model verification → training → eval
  - Models covered:
    - GPT-OSS 20B (FlashAdamW + TE FusedAdam).
    - Moonlight 16B (3 configs incl. EP8+CP2).
    - Qwen3 4B (3 configs incl. CP1/CP2 variants).
    - Qwen3 MoE 30B (2 configs + experiments/).
Datasets
- Neat packing (greedy knapsack) for LLM and VLM (#1485 – Zhiqi Li)
- Pretokenization support for VLM.(Zhiqi Li)
- MultiImage dataset support for Qwen family (Zhiqi Li)
- Qwen family video training support (Zhiqi Li)
- LengthGroupedSampler (#1618 – Zhiqi Li)
- Chat datasets THD/BSHD + CP, padding fixes (#1416).
- reasoning_content + tool-calling support in ChatDataset (#1644, Zeel Desai, community).
- Custom chat_template override for VLM finetuning (#1525, Bambuuai, community).
- NEFTune noisy embeddings (#1686, stanley1208, community).
- JSONL malformed-line skip (#1694, Somshubra Majumdar).
Documentation
- Per-model coverage pages (#1683).
- Diffusion docs (#1495).
- Gemma 4 tutorial (#1657).
- Nemotron Parse fine-tuning notebook + assets (#1655, Krishna Kalyan).
- Finetune-process + container-usage docs (#1484, Krishna Kalyan).
- MLflow/Databricks docs (#1170, Andrei Onel).
Contributions – we are grateful for all contributions 🙇
- Khazzz1c
  - optimized resolve_yaml_env_vars from scanning runtime data in instantiate() (#1827)
  - additional contributions in r0.5.0.
- Logan Vegna: added native Comet ML experiment tracking support (#1411).
- Harsha Pasham: fixed error with aten::equal operator on meta tensors (#1769).
- Aditya Saxena: added SkyPilot support. (#1590).
- SwekeR-463:
  - Added MFU logging in train recipes (#1413).
  - Added embeddings utility functions for 15 models (#1288).
- stanley1208
  - Implemented NEFTune noisy embeddings for fine-tuning (#1686).
  - Added best_metric_key field in CheckpointingConfig (#1641).
- Zeel Desai
  - Added reasoning_content and tool-calling support to ChatDataset (#1644).
  - Additional contributions in the next release.
- Bambuuai: enabled custom chat_template override for VLM fine-tuning (#1525).
- Zakir Jiwani: Fixed instantiation issue in yaml parsing (issue #1496) (#1654).
Known Issues
- Minor memory regression in cohere_command_r_7b_hellaswag_fp8 and glm_4_9b_chat_hf_hellaswag_fp8
- Qwen3_5_4b_neat_packing hangs during checkpoint saving
- MegatronFSDP support postponed for 26.06
- ~2% of checkpoint loading currently exercise a less-optimized path, which is being addressed in follow-up work.

Changelog Details

refactor: extract initialize_model_weights from load_base_model by @hemildesai :: PR: #1356
fix: prefer moe_config for num_experts in apply_ac by @hemildesai :: PR: #1361
fix: FSDP pre-shard combined projections on dim 1 for Qwen2.5-7B support by @ZhiyuLi-Nvidia :: PR: #1357
ci: Update release workflow to include changelog and docs by @chtruong814 :: PR: #1320
feat: Add.generate() function with KV cache support for Nemotron v3 by @pzelasko :: PR: #1332
fix: loss masking with pad eos collision by @akoumpa :: PR: #1338
feat: add Qwen3.5 35b by @HuiyingLi :: PR: #1373
feat: refactor retriever code by @adil-a :: PR: #1166
fix: resolve retrieval dataset corpus paths relative to training file by @oliverholworthy :: PR: #1367
docs: Replace latest docs with nightly by @chtruong814 :: PR: #1358
fix: EP collective deadlock with variable-length token counts by @ShiftyBlock :: PR: #1365
fix: guard AutoConfig.from_pretrained in PP mask precomputation by @hemildesai :: PR: #1378
docs: fix broken links across documentation guides by @chenopis :: PR: #1374
fix: Handle check_model_inputs removal in transformers 5.2.0 by @oliverholworthy :: PR: #1369
fix: coverage for customizer_retrieval tests by @akoumpa :: PR: #1382
docs: add nano-v3 full sft benchmarks by @adil-a :: PR: #1387
docs: Added installation guidance by @onel :: PR: #1371
docs: update readme and docs by @akoumpa :: PR: #1370
feat: make MoE parallelizer mixed precision policy configurable via recipes by @hemildesai :: PR: #1392
ci: Add-credentials-for-docs by @ko3n1g :: PR: #1389
feat: add pp_seq_len field to PipelineConfig by @hemildesai :: PR: #1390
feat: add onnx export for biencoder by @akoumpa :: PR: #1276
feat: add scheduler-driven manual garbage collection across recipes by @hemildesai :: PR: #1391
fix: skip instantiation of nested configs overridden by kwargs in ConfigNode by @oliverholworthy :: PR: #1397
fix: MoE lora adapter layout by @akoumpa :: PR: #1395
fix: update GLM 4.7 Flash TE DeepEP finetuning config by @hemildesai :: PR: #1401
fix: read rope config from rope_parameters across all models by @hemildesai :: PR: #1400
docs: Ensure all docs updates from main are nightly by @chtruong814 :: PR: #1402
feat: add output_hidden_states support to NemotronHForCausalLM by @desh2608 :: PR: #1386
refactor: use auto_map for faster init by @akoumpa :: PR: #1405
feat: allow disabling top-k expert utilization logging in MoE metrics by @hemildesai :: PR: #1418
feat: add TE FusedAdam QuantizedTensor compatibility patch by @hemildesai :: PR: #1417
feat: add MoE LoRA rank scaling and torch_mm to MoE LoRA by @hemildesai :: PR: #1300
fix: add missing vocab_size to benchmark configs using MockIterableData by @krishnakalyan3 :: PR: #1404
fix: correct MoE auxiliary loss gradient scaling by @hemildesai :: PR: #1412
feat: add qwen 3.5 small dense models by @HuiyingLi :: PR: #1427
fix: add mistral common + _remap_system_role by @akoumpa :: PR: #1423
feat: support loading biencoder datasets directly from HuggingFace Hub by @oliverholworthy :: PR: #1380
feat: add merge lora tool by @akoumpa :: PR: #1424
feat: Migrating code from DFM to Automodel by @pthombre :: PR: #1379
fix: misc doc updates by @akoumpa :: PR: #1153
fix: switch to bf16 + sdpa for TP parity tests by @akoumpa :: PR: #1437
fix: default value set by @akoumpa :: PR: #1443
fix: vlm collate leading space fix by @HuiyingLi :: PR: #1428
fix: disable rope_fusion when context parallelism (cp > 1) is enabled by @hemildesai :: PR: #1440
fix: TP fix for nano-v2 by @adil-a :: PR: #1448
fix: support multiple model types in merge lora + test update by @akoumpa :: PR: #1446
fix: biencoder bidirectional masking and consistent attn_implementation default by @oliverholworthy :: PR: #1349
docs: add contrib button to readme by @akoumpa :: PR: #1454
feat: improved error messages by @akoumpa :: PR: #1452
feat: add ty for attention/config/launcher/loggers/optim by @akoumpa :: PR: #1445
fix: native fp8 checkpoint + peft by @adil-a :: PR: #1459
fix: move print_trainable_parameters on device by @akoumpa :: PR: #1463
feat: parameterize onnx export test on dtype by @akoumpa :: PR: #1457
fix: handle missing reset_parameters in Qwen3_5MoeBlock.init_weights() by @zpqiu :: PR: #1461
fix: combined projection bias loading and rms_norm numerical instability by @ZhiyuLi-Nvidia :: PR: #1410
fix: qwen3_8b_hellaswag_pp_peft recipe by @ZhiyuLi-Nvidia :: PR: #1335
ci: Update pyt base container to 26.02 by @thomasdhc :: PR: #1436
ci: Create uv sync arg docker arg by @thomasdhc :: PR: #1474
ci: Switch to merge-commit CI by @ko3n1g :: PR: #1472
feat: keep tokenizer assets v4-compatible by @akoumpa :: PR: #1465
fix: CombinedProjectionStateDictAdapter._gather_1d_bias by @akoumpa :: PR: #1477
fix: MoE parallelizer config lookup for VLM models with nested text_config by @HuiyingLi :: PR: #1466
fix: meta device init condition by @adil-a :: PR: #1480
cp: fix: DTensor materialization in MoE state_dict adapter for ep_shard > 1 by @HuiyingLi :: PR: #1483
fix: biencoder PEFT adapter key remapping for merge_lora by @adil-a :: PR: #1479
docs: Parse Finetuning Tutorial by @aasthajh :: PR: #1471
fix: correct 3D mRoPE position_ids sharding in context parallelism by @HuiyingLi :: PR: #1482
fix: tp plan for nemotron super by @akoumpa :: PR: #1487
feat: fp32 RMSNorm backend and cast_model_to_dtype by @hemildesai :: PR: #1493
feat: support chat datasets with THD, BSHD + CP and padding fixes by @hemildesai :: PR: #1416
fix: skip initialize_weights for Gemma3ForCausalLM (DTensor TP assertion) by @terrykong :: PR: #1488
docs: fine-tuning process and container usage by @krishnakalyan3 :: PR: #1484
feat: TP-aware KDLoss with distributed softmax and T² scaling by @Separius :: PR: #1499
fix: make MistralCommonBackend inherit from PreTrainedTokenizerBase by @akoumpa :: PR: #1505
fix: forward-compatible _patched_get_init_context for transformers v5.3.0 by @HuiyingLi :: PR: #1504
fix: Log exception and error in FirstRankPerNode before exiting by @athitten :: PR: #1468
feat: add FlashOptim optimizer integration by @hemildesai :: PR: #1492
fix: attach CP attention-mask hooks for dense (non-TE) context parallelism by @hemildesai :: PR: #1470
feat: add new score func and pp microbatch pixel split handling by @HuiyingLi :: PR: #1513
feat: add MoE expert diversity metrics by @hemildesai :: PR: #1506
fix: gpt-oss ckpt saving by @akoumpa :: PR: #1501
fix: TP paralellizer with replicated qkvs by @akoumpa :: PR: #1519
fix: construct rope_parameters fallback for MiniMaxM2 by @hemildesai :: PR: #1518
feat: Super V3 by @adil-a :: PR: #1522
feat: Add GLM 5 implementation by @hemildesai :: PR: #1372
feat: update readme by @akoumpa :: PR: #1531
ci: improve functional test msg by @akoumpa :: PR: #1524
fix: de-pickle by @akoumpa :: PR: #1517
ci: add default env vars ala .github/actions/test-template/action.yml L120 by @akoumpa :: PR: #1523
feat: Enable custom chat_template override for VLM fine-tuning by @Bambuuai :: PR: #1525
cp: feat: add neat packing (greedy knapsack) for LLM and VLM datasets by @HuiyingLi :: PR: #1485
fix: Revert uv.lock to fix install test with NGC Cuda by @chtruong814 :: PR: #1534
feat: add v4_compatible ckpt by @akoumpa :: PR: #1532
fix: baichuan .bin ckpt loading by @akoumpa :: PR: #1515
ci: Update uv lock codeowner and commit block by @thomasdhc :: PR: #1539
docs: add large moe llm doc by @HuiyingLi :: PR: #1541
feat: Integrate Wan with multi-resolution DL by @pthombre :: PR: #1475
feat: Add native Comet ML experiment tracking by @LoganVegnaSHOP :: PR: #1411
fix: replace pickle with torch.load(..., weights_only=True) by @akoumpa :: PR: #1546
fix: optimized TP plan lookup in NeMo-RL by qualname by @ZhiyuLi-Nvidia :: PR: #1547
feat: model addition by @HuiyingLi :: PR: #1550
feat: add more example configs by @akoumpa :: PR: #1553
fix: handle Nemotron V3 with force_hf=True in weight initialization skip logic by @RayenTian :: PR: #1551
feat: add mistral4 recipe by @HuiyingLi :: PR: #1556
fix: add dynamic=True to Float32RMSNorm by @akoumpa :: PR: #1555
ci: Updating testing path to /opt/Automodel, update codecov settings by @thomasdhc :: PR: #1544
feat: MFU logging in train recipes by @SwekeR-463 :: PR: #1413
fix: GPT-OSS MoE aux_loss softmax and remove torch.compile from _apply_bias by @hemildesai :: PR: #1559
ci: Add claude code review by @thomasdhc :: PR: #1545
fix: lora test by @akoumpa :: PR: #1561
ci: Update permissions for claude review workflow by @thomasdhc :: PR: #1562
fix: fall back to HF for Mistral3 VLMs with non-Mistral4 text backbone by @HuiyingLi :: PR: #1557
feat: input validation & model capability by @akoumpa :: PR: #1542
fix: kd inference mode by @akoumpa :: PR: #1567
fix: seq cls trainer by @akoumpa :: PR: #1564
fix: enable Phi-4-multimodal-instruct VLM finetuning by @HuiyingLi :: PR: #1552
docs: add navigation table by @akoumpa :: PR: #1573
fix: patch missing mock in meta-tensor retry test by @HuiyingLi :: PR: #1575
feat: VDR feedback: Common inference utility by @pthombre :: PR: #1491
ci: Fix sso user check by @chtruong814 :: PR: #1578
feat: Add context parallel support for Qwen3.5 MoE by @zpqiu :: PR: #1560
docs: update finetune guide by @akoumpa :: PR: #1548
ci: Update coverage path and fix coverage upload by @thomasdhc :: PR: #1582
fix: Nemotron v3 inputs_embeds generation by @pzelasko :: PR: #1583
fix: checkpointing for PEFT. by @akoumpa :: PR: #1576
ci: Move source install fla to dev group by @thomasdhc :: PR: #1580
fix: register kimi_k25 and kimi_vl configs eagerly in lazy registry by @HuiyingLi :: PR: #1579
feat: add pipeline parallelism support for knowledge distillation by @Separius :: PR: #1500
perf: simplify Qwen3-VL-MoE state_dict_adapter + use torch hf reader by @hemildesai :: PR: #1570
docs: Add docs about diffusion support in AM by @pthombre :: PR: #1495
docs: merge tables by @akoumpa :: PR: #1587
fix: remove in-place change model config by @yuki-97 :: PR: #1595
ci: add @pthombre to codeowners by @akoumpa :: PR: #1588
perf: simplify Qwen3.5-MoE state_dict_adapter + DTensor passthrough by @HuiyingLi :: PR: #1589
fix: convert DTensor biases to local in MoE _forward_loop by @hemildesai :: PR: #1565
fix: narrow model.to(device) skip to checkpoint-loaded path only by @hemildesai :: PR: #1597
fix: fix tp plan lookup by @yuki-97 :: PR: #1600
ci: upgrade GitHub Actions for Node.js 24 compatibility by @ko3n1g :: PR: #1593
ci: Add ci_tests to tests folder by @thomasdhc :: PR: #1596
fix: resolve deadlock saving diffusion checkpoints in safetensors format by @adil-a :: PR: #1601
fix: Remove duplicate keys in recipes by @thomasdhc :: PR: #1605
ci: Update llm_finetune recipes for ci by @thomasdhc :: PR: #1608
fix: resolve TP+PP for nemotron super 49B by @HuiyingLi :: PR: #1607
ci: Update vllm_finetune ci config by @thomasdhc :: PR: #1611
feat: add Tulu-3 E2E convergence pipeline by @hemildesai :: PR: #1554
feat: add SkyPilot as a cloud execution backend for AutoModel by @Anakintano :: PR: #1590
ci: Set upperbound for transformers by @thomasdhc :: PR: #1615
ci: Enable CI variables for changing lint runner and container by @chtruong814 :: PR: #1619
docs: update coverage doc by @HuiyingLi :: PR: #1609
feat: add reranker training by @adil-a :: PR: #1449
fix: fix NemotronHForCausalLM force_hf=True by @yuki-97 :: PR: #1625
fix: fix gradient_checkpointing overhead in transformers 5.3 by @yuki-97 :: PR: #1621
feat: Migrate diffusion recipe to use Stateful Dataloader by @pthombre :: PR: #1630
feat: add Nemotron Nano 4B SQuAD finetune recipe by @davidoneilai :: PR: #1624
feat: Ensure that diffusion training jobs use the safetensors checkpoint format by @pthombre :: PR: #1627
ci: Pass argument automodel dir for transformer version check by @thomasdhc :: PR: #1617
fix: from_pretrained with nested kwargs (e.g. text_config) crashes on VLM models by @zpqiu :: PR: #1623
feat: add hybridep by @hemildesai :: PR: #1333
fix: tied embedding v4 to v5 by @akoumpa :: PR: #1631
ci: Add deleted files explicitly in coverage omit by @thomasdhc :: PR: #1637
feat: add AGENTS.md by @akoumpa :: PR: #1638
fix: remove redundant _keep_in_fp32_modules for layer norms in GptOssForCausalLM by @stanley1208 :: PR: #1633
cp: feat: VLM pretokenized data pipeline with neat packing by @HuiyingLi :: PR: #1618
refactor: CLI app and launching by @akoumpa :: PR: #1406
feat: add missing recipe in yaml by @akoumpa :: PR: #1642
ci: Update run time for nemotron super ci by @thomasdhc :: PR: #1614
ci: Update mistral4 medpix ci run time by @thomasdhc :: PR: #1646
fix: skip initialize_weights for Phi3ForCausalLM with TP sharding by @adil-a :: PR: #1648
feat: add UCCL-EP as alternative dispatcher for expert parallelism by @hemildesai :: PR: #1635
feat: enable TE Linear layers for PEFT/LoRA by @adil-a :: PR: #1626
fix: Float32RMSNorm torch.compile crash on PyTorch 2.11+ by @hemildesai :: PR: #1650
feat: GPT-OSS 20B and Moonlight 16B convergence results by @hemildesai :: PR: #1577
docs: add gemma4 tutorial by @HuiyingLi :: PR: #1657
feat: add gemma4 configs by @HuiyingLi :: PR: #1658
fix: move skills to .claude/skills by @akoumpa :: PR: #1662
feat: add gemma 4 by @HuiyingLi :: PR: #1660
ci: Add recipe golden values by @thomasdhc :: PR: #1647
fix: link in readme. by @akoumpa :: PR: #1664
feat: add HybridEP example config for Qwen3-30B-A3B by @hemildesai :: PR: #1666
fix: Mistral4 FP8 dequant on multi-dim mesh by @HuiyingLi :: PR: #1594
fix: Finetune DeepSeek V3 (issue #1496) by @JiwaniZakir :: PR: #1654
fix: update gemma4 configs and doc with correct model IDs by @HuiyingLi :: PR: #1670
feat: Add discrete diffusion LLM (dLLM) supervised fine-tuning support by @zyzhou5 :: PR: #1665
feat: context-parallel with nemotron v3 by @adil-a :: PR: #1441
chore(beep boop 🤖): bump FW-CI-templates workflow pins to v0.88.0 by @svcnvidia-nemo-ci :: PR: #1669
feat: add cp2 convergence configs and eval fixes by @hemildesai :: PR: #1602
fix: add tp plan for phi2 by @akoumpa :: PR: #1674
fix: move .claude/skills to skills by @akoumpa :: PR: #1673
feat: add reasoning_content and tool-calling support to ChatDataset by @zeel2104 :: PR: #1644
feat: Add Llada SFT support by @pthombre :: PR: #1672
docs: update the finetune guide by @akoumpa :: PR: #1678
fix: add best_metric_key field to CheckpointingConfig dataclass by @stanley1208 :: PR: #1641
ci: Set target version for ruff by @thomasdhc :: PR: #1636
docs: add per-model pages by @akoumpa :: PR: #1683
ci: Add code freeze workflow by @thomasdhc :: PR: #1688
feat: Allow to conditionally skip malformed jsonl lines when loading dataset by @titu1994 :: PR: #1694
feat: Add dllm generation support by @pthombre :: PR: #1692
feat: implement NEFTune noisy embeddings for instruction fine-tuning by @stanley1208 :: PR: #1686
feat: enable packed sequences for Qwen3.5-MoE with EP+PP by @HuiyingLi :: PR: #1685
fix: swap DTensor shard placements after transpose in Step3p5 state dict adapter by @adil-a :: PR: #1691
ci: Associate recipe owners by @thomasdhc :: PR: #1690
test: add checkpoint robustness functional tests by @adil-a :: PR: #1606
ci: Remove duplicate ci config by @thomasdhc :: PR: #1702
fix: freeze dead KV-sharing params to fix checkpoint resume by @HuiyingLi :: PR: #1698
ci: Update version to 0.4.0 by @thomasdhc :: PR: #1703
cp: feat: integrate NeMo-Run launcher (1668) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1706
cp: fix: mute warning spam (1721) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1724
cp: (#1696) into r0.4.0 by @adil-a :: PR: #1732
cp: fix: handle dict-typed chat_template (1696) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1716
cp: feat: adding lora to diffusion (1653) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1708
cp: feat: MoE model benchmarks, LoRA configs (1676) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1707
cp: fix: fixing the pooling error (1645) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1736
cp: ci: Address timeout is ci tests (1733) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1738
cp: test: Checkpoint robustness skips atexit-registered destroy_process_group() (1730) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1739
cp: fix: Qwen3.5 dense CP support (1710) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1737
cp: feat: Add lora recipes for gemma4 (1731) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1741
cp: test: add vLLM deployment tests into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1745
cp: build: drop rc0 pre-release tag and add dynamic git versioning (1729) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1756
cp: fix: Baichuan2 checkpoint robustness test CI failures (1727) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1754
cp: ci: Address container and source code cve (1753) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1758
cp: ci: Update test timeout and add ci_tests readme (1752) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1759
cp: fix: Update lora configs for gemma4 (1748) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1749
cp: fix: launcher option from being a config override. (1766) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1772
cp: fix: skip embedding[padding_idx] = 0 with TP (1675) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1771
cp: ci: add missing recipe owners (1775) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1776
cp: ci: Resolve cve and remove uv cache (1774) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1778
cp: FSDP2 w weight prefetching and async TP optimization (#1711) by @ZhiyuLi-Nvidia :: PR: #1779
cp: fix: update yamls for vllm_deploy (1780) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1781
cp: docs: Add nightly CI test summary for LLM and VLM finetune configs (1791) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1792
cp: fix: Add per-tensor conv. in gemma4 sd adapter (1764) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1795
cp: feat: Enable CI benchmark with {llm,vlm}_benchmark (1793) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1794
cp: fix: NotImplementedError: aten::equal on meta tensors during multi-GPU init (1769) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1797
cp: fix: Restrict auto-discovery scopes in generate_ci_tests.py (1805) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1812
cp: ci: RC6 timeout fixes for release test recipes (1801) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1816
cp: ci: Increase benchmark timeout for GLM and Qwen3.5 MoE LoRA recipes (1818) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1819
cp: fix: meta init with force_hf=True (1810) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1822
cp: fix: enable dequantization for ministral3 and dataset limit (1807) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1820
cp: chore: Update GPT-OSS and Qwen3 recipe configs (1811) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1815
cp: fix: tie weights outside _init_model (1817) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1829
cp: fix: Align benchmark TEST_LEVEL check with generate_ci_tests scope (1831) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1832
cp: fix: gpt_oss_20b_single_gpu_peft CI crash with nproc_per_node override (1835) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1844
cp: fix: stop resolve_yaml_env_vars from scanning runtime data in instantiate() (1827) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1836
cp: fix: Re-apply PyTorch dependency overrides after full COPY in Dockerfile (1847) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1848
cp: fix: install ffmpeg and rebuild torchcodec for phi4mm audio decoding (1826) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1849
cp: fix: rotary embeddings for v4 (1821) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1851
fix: FSDP2 meta-device crash for Qwen3.5 GatedDeltaNet fp32 params by @HuiyingLi :: PR: #1813
cp: fix: relax checkpoint robustness HF KL threshold for nemotron_nano_8b_v1 (1839) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1855
cp: fix: trust_remote_code guard in robustness test (1845) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1857
cp: ci: add NMP customizer contract test configs (1712) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1858
cp: fix: pre-cache HF dynamic modules to prevent filesystem race in robustness test (1840) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1856
cp: ci: Update to transformers v5.5 (1734) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1854
cp: ci: Reduce default finetune step count from 100 to 50 (1874) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1876
cp: fix: gpt oss ci (1877) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1878
cp: fix: Setup vllm testing with uv --no-config (#1875) by @thomasdhc :: PR: #1881
cp: fix: Skip snapshot_download when HF_HUB_OFFLINE=1 (1834) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1880
cp: fix: Allow use_cache w/ activation_checkpointing (1726) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1760
cp: fix: handle transformers.FineGrainedFP8Config quantization config (1864) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1888
cp: fix: Create diffusion_kernels group to fix HF_HUB_OFFLINE compatibility (1842) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1886
cp: fix: relax KL thresholds and remove invalid kwargs in Qwen3Next linear attn (1867) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1899
cp: ci: Add Dockerfile.deploy for deploy test environment (1804) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1902
cp: fix: Fix bug in diffusion generation (1850) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1900
cp: fix: pass unnormalized residual to MoE gate in Gemma4 decoder layer (1895) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1903
cp: fix(gemma4_moe): vision-aware mask when use_bidirectional_attention==vision (1905) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1907
cp: fix: gradient checkpointing broken for MoE models on single GPU (ep_size=1) (1873) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1879
cp: fix: baichuan dynamic cache (1865) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1909
cp: fix Qwen3.5+Phi4MM CI after transformers v5.5 update(1906) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1908
cp: fix: Coerce plain-dict backend to BackendConfig in model init (1784) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1803
cp: chore: move recipes to have perf CI/CD coverage (1885) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1897
cp: resolve VLM CI failures for PP recipes and collate_fn(1799) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1889
cp: fix: Update recipe_owner for gemma4 (1925) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1926
cp: fix: update defer_fsdp_grad_sync in recipes (1919) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1930
cp: fix: chat dataset (1921) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1931
cp: fix: make _get_logits pp aware in ckpt robustness (1923) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1934
cp: fix: disable packed sequences for nemotron_nano_4b_squad (1929) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1935
cp: ci: Add test_recipes for custom test scope (1915) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1954
cp: fix: Update recipe test time based on release test run (1955) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1956
cp: fix: AC silently skipped on all registered VLMs — flatten ModuleList (1941) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1958
cp: docs: Add container version to docs version picker (1965) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1966
fix: Step-3.5-Flash layer_types mismatch and related recipe fixes (#1… by @akoumpa :: PR: #1936
cp: fix: Patch wandb-core Go CVEs: bump otel SDK, add go-jose (1957) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1964
cp: chore: add @zyzhou5 and @athitten to codeowners (1968) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1969
cp: fix: Update gemm4 26b ci timeout (1962) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1967
cp: fix: qlora ckpt loading (1549) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1920
fix: ministral tp plan (#1963) by @akoumpa :: PR: #1974
cp: fix(vlm): qwen3_5_4b_neat_packing OOM - reduce seqlen to 4096 (1975) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1977
cp: fix: batch ckpt-robustness fixes for pipeline 48953745 (supersedes 9 PRs) (1971) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1979
cp: fix: nemotron flash (1973) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1978
cp: fix(devstral): point 24B Squad recipes at official FP8 model (1980) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1982
cp: fix: vllm deploy test should fail if vllm is not present (1987) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1988
cp: fix: Move benchmark recipe out of llm_finetune nightly (1989) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1990
cp: fix: Address ci timeout test from rc8 (1991) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1993
cp: ci: Support per-recipe env_vars in CI config (1999) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2000
cp: fix: Address pillow CVE (1994) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #1998
cp: ci: Update test recipe list (2001) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2002
cp: fix: batch Flash 1B + Super-49B PEFT + qwen2.5-7B ckpt-robustness (1984) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2008
cp: fix: change drop_long_samples to True by default (2009) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2014
cp: fix: transformers v5.5.0 validation (2010) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2013
cp: ci: add --tb=short to pytest invocations in CI test scripts (2018) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2019
cp: fix: switch from match_all_linear to target_modules (2022) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2023
cp: fix: add discover pp seq len (2024) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2026
cp: fix: regression in tokenizer+auto_map with transformers 5.5.0 (2025) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2029
cp: ci: add known_issue_id / allow_failure keys + triage (#2028) to r0.4.0 by @thomasdhc :: PR: #2033
cp: fix: gradient clip with torch_mm + EP (gpt-oss 120b recipe) (2012) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2035
Cherry-pick #1728 to r0.4.0 (Qwen refs removed) by @pthombre :: PR: #2031
cp: ci: triage pipeline benchmark failures (2040) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2044
cp: fix: lora checkpointing (2037) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2046
cp: ci: triage rc9 finetune failures (2043) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2049
cp: ci: triage vllm_deploy rc9 failures (2047) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2050
ci: cherry-pick #2048 (LoRA nightly tests) to r0.4.0 by @pthombre :: PR: #2051
cp: ci: Update base container pillow version for cve (2065) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2066
cp: docs: Bump docs version (2073) into r0.4.0 by @svcnvidia-nemo-ci :: PR: #2074
docs: Update docs version to 0.4.0 by @chtruong814 :: PR: #2075

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVIDIA NeMo-Automodel 0.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Contributors

Uh oh!