Skip to content

Add main-branch SWE async RL benchmark with Qwen3.5 #2378

@anwithk

Description

@anwithk

Add main-branch SWE async RL benchmark with Qwen3.5

Background

Bin investigated the async SWE RL benchmark path from #2049. The initial target was clarified to use Qwen3-30B with r2r-gym training data and SWE-verified eval.

A scaled-down Super stage2 SWE2 setup was reproduced on 16 nodes, but the result was not suitable for v0.6. The experiment used Qwen3-30B-A3B-Instruct-2507, disabled reasoning, and switched to the hermes tool parser. SWE-verified eval showed step_25 worse than base on the 333 instances that completed, with 167 failures caused by Docker/image pull/network issues rather than model errors.

The main issue appears to be parser/template mismatch:

  • Super SWE data/config expects qwen3_coder-style tool parsing.
  • The tested Qwen3 instruct model uses hermes and has no thinking path enabled.
  • Qwen3 thinking + tool calling currently has NeMo Gym parser/template compatibility issues.
  • Qwen3.5-35B is a better target, but requires newer vLLM and transformers support available on main.

Goal

Create a clean main-branch SWE-style agentic RL benchmark that validates NeMo RL async training on long-horizon, multi-turn, tool-calling workloads, using Qwen3.5 or another compatible model.

The benchmark should produce a clear story:

{model} trained for {time} on {cluster config} improves from {baseline} to {result} on {benchmark}, with key training health metrics verified.

Scope

  • Add/clean up YAML configs and launch scripts for SWE-style async RL training on main.
  • Use a model/parser/template combination that supports both tool calling and reasoning where needed.
  • Validate compatibility across NeMo RL, NeMo Gym, tokenizer/chat template, tool parser, and reasoning parser.
  • Run downstream SWE-verified eval before and after training.
  • Track training health metrics:
    • reward curve
    • KL / logprob error
    • max turns
    • max sequence length
    • tool-call success/failure behavior
    • rollout throughput / training speed
  • Compare async vs sync performance if feasible.
  • Document required dependency versions, especially vLLM and transformers.

Acceptance Criteria

  • A main-branch runnable SWE async RL recipe is added.
  • Training runs successfully with Qwen3.5 or a clearly justified compatible model.
  • SWE-verified eval reports baseline and trained-model results.
  • Training health metrics are captured and summarized.
  • The issue includes a recommendation on whether the benchmark is ready for docs/release mention.
  • Any parser/template/config limitations are documented.
  • Follow-up issues are filed for larger parser or config validation refactors if needed.

Open Questions

  • Should Qwen3.5-35B be the default target, despite the high baseline making reward improvement harder to show?
  • What is the minimum eval size needed to avoid high-variance conclusions?
  • Do we require async-vs-sync comparison for this milestone, or is a stable async baseline enough?
  • Should NeMo RL add joint validation for RL + Gym config compatibility?

References

Metadata

Metadata

Assignees

Labels

DocumentationImprovements or additions to documentationFeatureenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions