Add main-branch SWE async RL benchmark with Qwen3.5

## Add main-branch SWE async RL benchmark with Qwen3.5

### Background

Bin investigated the async SWE RL benchmark path from NVIDIA-NeMo/RL#2049. The initial target was clarified to use Qwen3-30B with r2r-gym training data and SWE-verified eval.

A scaled-down Super stage2 SWE2 setup was reproduced on 16 nodes, but the result was not suitable for v0.6. The experiment used `Qwen3-30B-A3B-Instruct-2507`, disabled reasoning, and switched to the `hermes` tool parser. SWE-verified eval showed `step_25` worse than base on the 333 instances that completed, with 167 failures caused by Docker/image pull/network issues rather than model errors.

The main issue appears to be parser/template mismatch:

* Super SWE data/config expects `qwen3_coder`-style tool parsing.
* The tested Qwen3 instruct model uses `hermes` and has no thinking path enabled.
* Qwen3 thinking + tool calling currently has NeMo Gym parser/template compatibility issues.
* Qwen3.5-35B is a better target, but requires newer vLLM and transformers support available on main.

### Goal

Create a clean main-branch SWE-style agentic RL benchmark that validates NeMo RL async training on long-horizon, multi-turn, tool-calling workloads, using Qwen3.5 or another compatible model.

The benchmark should produce a clear story:

> `{model}` trained for `{time}` on `{cluster config}` improves from `{baseline}` to `{result}` on `{benchmark}`, with key training health metrics verified.

### Scope

* Add/clean up YAML configs and launch scripts for SWE-style async RL training on main.
* Use a model/parser/template combination that supports both tool calling and reasoning where needed.
* Validate compatibility across NeMo RL, NeMo Gym, tokenizer/chat template, tool parser, and reasoning parser.
* Run downstream SWE-verified eval before and after training.
* Track training health metrics:
  * reward curve
  * KL / logprob error
  * max turns
  * max sequence length
  * tool-call success/failure behavior
  * rollout throughput / training speed
* Compare async vs sync performance if feasible.
* Document required dependency versions, especially vLLM and transformers.

### Acceptance Criteria

* A main-branch runnable SWE async RL recipe is added.
* Training runs successfully with Qwen3.5 or a clearly justified compatible model.
* SWE-verified eval reports baseline and trained-model results.
* Training health metrics are captured and summarized.
* The issue includes a recommendation on whether the benchmark is ready for docs/release mention.
* Any parser/template/config limitations are documented.
* Follow-up issues are filed for larger parser or config validation refactors if needed.

### Open Questions

* Should Qwen3.5-35B be the default target, despite the high baseline making reward improvement harder to show?
* What is the minimum eval size needed to avoid high-variance conclusions?
* Do we require async-vs-sync comparison for this milestone, or is a stable async baseline enough?
* Should NeMo RL add joint validation for RL + Gym config compatibility?

### References

* Original issue: [https://github.com/NVIDIA-NeMo/RL/issues/2049](<https://github.com/NVIDIA-NeMo/RL/issues/2049>)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add main-branch SWE async RL benchmark with Qwen3.5 #2378