Add main-branch SWE async RL benchmark with Qwen3.5
Background
Bin investigated the async SWE RL benchmark path from #2049. The initial target was clarified to use Qwen3-30B with r2r-gym training data and SWE-verified eval.
A scaled-down Super stage2 SWE2 setup was reproduced on 16 nodes, but the result was not suitable for v0.6. The experiment used Qwen3-30B-A3B-Instruct-2507, disabled reasoning, and switched to the hermes tool parser. SWE-verified eval showed step_25 worse than base on the 333 instances that completed, with 167 failures caused by Docker/image pull/network issues rather than model errors.
The main issue appears to be parser/template mismatch:
- Super SWE data/config expects
qwen3_coder-style tool parsing.
- The tested Qwen3 instruct model uses
hermes and has no thinking path enabled.
- Qwen3 thinking + tool calling currently has NeMo Gym parser/template compatibility issues.
- Qwen3.5-35B is a better target, but requires newer vLLM and transformers support available on main.
Goal
Create a clean main-branch SWE-style agentic RL benchmark that validates NeMo RL async training on long-horizon, multi-turn, tool-calling workloads, using Qwen3.5 or another compatible model.
The benchmark should produce a clear story:
{model} trained for {time} on {cluster config} improves from {baseline} to {result} on {benchmark}, with key training health metrics verified.
Scope
- Add/clean up YAML configs and launch scripts for SWE-style async RL training on main.
- Use a model/parser/template combination that supports both tool calling and reasoning where needed.
- Validate compatibility across NeMo RL, NeMo Gym, tokenizer/chat template, tool parser, and reasoning parser.
- Run downstream SWE-verified eval before and after training.
- Track training health metrics:
- reward curve
- KL / logprob error
- max turns
- max sequence length
- tool-call success/failure behavior
- rollout throughput / training speed
- Compare async vs sync performance if feasible.
- Document required dependency versions, especially vLLM and transformers.
Acceptance Criteria
- A main-branch runnable SWE async RL recipe is added.
- Training runs successfully with Qwen3.5 or a clearly justified compatible model.
- SWE-verified eval reports baseline and trained-model results.
- Training health metrics are captured and summarized.
- The issue includes a recommendation on whether the benchmark is ready for docs/release mention.
- Any parser/template/config limitations are documented.
- Follow-up issues are filed for larger parser or config validation refactors if needed.
Open Questions
- Should Qwen3.5-35B be the default target, despite the high baseline making reward improvement harder to show?
- What is the minimum eval size needed to avoid high-variance conclusions?
- Do we require async-vs-sync comparison for this milestone, or is a stable async baseline enough?
- Should NeMo RL add joint validation for RL + Gym config compatibility?
References
Add main-branch SWE async RL benchmark with Qwen3.5
Background
Bin investigated the async SWE RL benchmark path from #2049. The initial target was clarified to use Qwen3-30B with r2r-gym training data and SWE-verified eval.
A scaled-down Super stage2 SWE2 setup was reproduced on 16 nodes, but the result was not suitable for v0.6. The experiment used
Qwen3-30B-A3B-Instruct-2507, disabled reasoning, and switched to thehermestool parser. SWE-verified eval showedstep_25worse than base on the 333 instances that completed, with 167 failures caused by Docker/image pull/network issues rather than model errors.The main issue appears to be parser/template mismatch:
qwen3_coder-style tool parsing.hermesand has no thinking path enabled.Goal
Create a clean main-branch SWE-style agentic RL benchmark that validates NeMo RL async training on long-horizon, multi-turn, tool-calling workloads, using Qwen3.5 or another compatible model.
The benchmark should produce a clear story:
Scope
Acceptance Criteria
Open Questions
References