Skip to content

NVIDIA NeMo-Automodel 0.4.0

Latest

Choose a tag to compare

@svcnvidia-nemo-ci svcnvidia-nemo-ci released this 28 Apr 20:04
b651aa8

Release Notes

  • Highlights

    • Expanded VLM line-up: Gemma 4, Mistral 4, Qwen3.5 VL
    • Diffusion and discrete-diffusion LLM (new tracks)
    • NeMo Retriever – bi-encoder + cross-encoder / reranker
    • Knowledge Distillation scaled to TP > 1 and PP (Sepehr Sameni)
    • MoE infrastructure deepening – UCCL-EP, HybridEP, grouped_mm
    • SkyPilot launcher backend (Aditya Saxena, community)
    • End-to-end checkpoint + convergence robustness framework
  • Model Support – newly supported families in r0.4.0

    • LLM
    • VLM / OMNI
      • Gemma 4 family – 2B, 4B, 31B, 26B-A4B MoE (#1658, #1660, #1731)
      • Mistral Small 4 (#1556)
      • Qwen3.5 VL dense – 4B, 9B (#1427)
      • Qwen3.5 VL MoE – 35B (#1373)
    • Diffusion
      • Flux T2I, Hunyuan T2V, Wan 2.1 T2V (see "Diffusion" section)
    • Discrete diffusion LLM
      • LLaDA (see "Discrete Diffusion LLM" section)
  • Diffusion – new track in r0.4.0

    • HuggingFace Diffuser integration
    • r0.4.0 ships full pretrain / finetune / generate pipelines with LoRA support for diffusion models (T2V, T2I)
    • Wan integrated with multi-resolution DataLoader (#1475)
    • Inference utility for diffusion (#1491)
    • LoRA for diffusion (#1653, Linnan Wang)
    • Diffusion processor registry (#1379)
    • Models / recipes shipped
      • Flux T2I – pretrain, SFT, LoRA, generate
      • Hunyuan T2V – SFT, LoRA, generate
      • Wan 2.1 T2V – pretrain, SFT, LoRA, generate
    • Documentation guides for dataset preprocessing and finetuning.
  • Discrete Diffusion LLM (dLLM) – new track in r0.4.0

    • Discrete diffusion LLM SFT support added (#1665)
    • LLaDA SFT recipe (#1672)
    • dLLM generation pipeline (#1692)
  • NeMo Retriever (bi-encoder + cross-encoder)

    • Refactored cross-encoder / reranker training loop (new in r0.4.0) β€” (#1449).
    • Bi-encoder datasets can be loaded directly from the HuggingFace Hub (#1380)
    • Bi-encoder masking + consistent attn_implementation default (#1349)
    • Resolve retrieval dataset corpus paths relative to training file (#1367)
    • Docs: docs/guides/retrieval/finetune.md
  • Knowledge Distillation β€” Sepehr Sameni

    • Enable TP > 1 in KD (#1297)
    • TP-aware KDLoss with distributed softmax + TΒ² scaling (#1499)
    • Pipeline-parallelism support for KD (#1500)
  • Parallelism / Performance / Train-loop

    • FSDP2
      • FSDP2 weight prefetching + async TP optimization (#1711)
    • Context Parallel
      • Qwen3.5 dense & MoE CP (#1710, #1560 β€” alexchiu / Zhaopeng Qiu)
      • Mamba CP for hybrid Nemotron v3 (#1441)
      • 3D mRoPE position_ids sharding under CP (#1482)
      • CP attention-mask hooks for dense / non-TE (#1470)
    • Pipeline Parallel
      • PP shape-inference optimization + pp_seq_len field in PipelineConfig (#1195, #1390)
      • Variable length for PP (#1689 – Zhiqi Li & Hemil Desai)
    • Activation checkpointing
      • Gradient_checkpointing overhead reduction i[n transformers 5.3 (#1621 β€” Yuki Huang)
    • MoE infrastructure
      • UCCL-EP alternative dispatcher (#1635 – Zhiqi Li & Hemil Desai)
      • HybridEP (#1333, #1666)
      • DeepEP-on-H100 RDMA fallback detection (#1275 β€” Piotr Ε»elasko)
      • torch._grouped_mm expert backend (#1228)
      • TE FusedAdam QuantizedTensor compatibility patch (#1417)
      • MoE LoRA rank scaling + torch_mm path (#1300, #1392)
      • Expert / diversity metrics (#1232, #1506), top-k utilization (#1418)
      • Packed sequences for MoE with EP+PP (#1685)
    • FlashOptim integration (#1492)
    • Scheduler-driven python GC (#1391)
    • fp32 RMSNorm backend + cast_model_to_dtype for improved stability (#1493)
    • Native Comet ML experiment tracking (#1411, Logan Vegna, community)
    • Added .generate() with KV-cache for Nemotron v3 (#1332, Piotr Ε»elasko)
    • Added output_hidden_states for NemotronHForCausalLM (#1386, Desh Raj)
  • Launcher & CLI

    • SkyPilot backend (#1590 β€” Aditya Saxena, community contributor)
    • CLI app + launching refactor (#1406)
      • Shim scripts under examples/ will be deprecated post 26.04.
    • Launcher CLI flags no longer leak into recipe YAML overrides (#1766)
    • MFU logging in train recipes (#1413 β€” SwekeR, community)
  • Checkpoint and convergence robustness

    • Checkpointing: End-to-end finetune β†’ vLLM-deploy testing (#1606)
      • Models covered:
        • Gemma 3
        • Nemotron (Flash 1B, Super v3, Nano 9B, Nano v3)
        • Phi 4, Llama 3.2, Qwen 2.5
        • Qwen 3 MoE, GPT-OSS.
      • What this catches: prediction divergence, packaging gaps, vLLM loading issues.
    • Convergence harness (#1554, #1577, #1602)
      • Pipeline: Tulu-3 data prep β†’ model verification β†’ training β†’ eval
      • Models covered:
        • GPT-OSS 20B (FlashAdamW + TE FusedAdam).
        • Moonlight 16B (3 configs incl. EP8+CP2).
        • Qwen3 4B (3 configs incl. CP1/CP2 variants).
        • Qwen3 MoE 30B (2 configs + experiments/).
  • Datasets

    • Neat packing (greedy knapsack) for LLM and VLM (#1485 – Zhiqi Li)
    • Pretokenization support for VLM.(Zhiqi Li)
    • MultiImage dataset support for Qwen family (Zhiqi Li)
    • Qwen family video training support (Zhiqi Li)
    • LengthGroupedSampler (#1618 – Zhiqi Li)
    • Chat datasets THD/BSHD + CP, padding fixes (#1416).
    • reasoning_content + tool-calling support in ChatDataset (#1644, Zeel Desai, community).
    • Custom chat_template override for VLM finetuning (#1525, Bambuuai, community).
    • NEFTune noisy embeddings (#1686, stanley1208, community).
    • JSONL malformed-line skip (#1694, Somshubra Majumdar).
  • Documentation

    • Per-model coverage pages (#1683).
    • Diffusion docs (#1495).
    • Gemma 4 tutorial (#1657).
    • Nemotron Parse fine-tuning notebook + assets (#1655, Krishna Kalyan).
    • Finetune-process + container-usage docs (#1484, Krishna Kalyan).
    • MLflow/Databricks docs (#1170, Andrei Onel).
  • Contributions – we are grateful for all contributions πŸ™‡

    • Khazzz1c
      • optimized resolve_yaml_env_vars from scanning runtime data in instantiate() (#1827)
      • additional contributions in r0.5.0.
    • Logan Vegna: added native Comet ML experiment tracking support (#1411).
    • Harsha Pasham: fixed error with aten::equal operator on meta tensors (#1769).
    • Aditya Saxena: added SkyPilot support. (#1590).
    • SwekeR-463:
      • Added MFU logging in train recipes (#1413).
      • Added embeddings utility functions for 15 models (#1288).
    • stanley1208
      • Implemented NEFTune noisy embeddings for fine-tuning (#1686).
      • Added best_metric_key field in CheckpointingConfig (#1641).
    • Zeel Desai
      • Added reasoning_content and tool-calling support to ChatDataset (#1644).
      • Additional contributions in the next release.
    • Bambuuai: enabled custom chat_template override for VLM fine-tuning (#1525).
    • Zakir Jiwani: Fixed instantiation issue in yaml parsing (issue #1496) (#1654).
  • Known Issues

    • Minor memory regression in cohere_command_r_7b_hellaswag_fp8 and glm_4_9b_chat_hf_hellaswag_fp8
    • Qwen3_5_4b_neat_packing hangs during checkpoint saving
    • MegatronFSDP support postponed for 26.06
    • ~2% of checkpoint loading currently exercise a less-optimized path, which is being addressed in follow-up work.
Changelog Details