DAPO dynamic sampling filters on shaped reward instead of raw task metric

**Describe the bug**

NeMo-RL DAPO dynamic sampling filters prompt groups using the reward after DAPO overlong reward shaping has been applied. This differs from the original verl DAPO logic: verl keeps raw task metrics such as `acc` / `score` separate from the final reward tensor, and standard DAPO group filtering uses the raw metric `acc` while the overlong-shaped reward is used for optimization.

As a result, a prompt group where every response is raw incorrect can pass filtering if response lengths differ. Short or empty wrong responses keep the raw wrong reward, while overlong wrong responses receive an extra overlong penalty, creating non-zero shaped reward std even though raw task accuracy std is zero. This can give short wrong or empty EOS responses positive relative advantage versus overlong wrong responses.

**Steps/Code to reproduce bug**


**Expected behavior**

Group filtering should use the raw task metric, such as raw `acc` or raw `score`, rather than the reward after DAPO overlong shaping. Prompt groups where all responses have identical raw task outcomes should be filtered out, regardless of length-based overlong penalty differences.

This matches the original verl DAPO behavior at the implementation level: verl stores raw task metrics separately and standard DAPO filtering uses `acc`, while the shaped reward can still be used for optimization.

**Additional context**

This surfaced in a DeepSeek V4 Base DAPO run. A training step selected empty assistant / EOS samples with positive advantage because they were less penalized than overlong wrong responses. After the policy update, the following step collapsed into effectively empty outputs across different dynamic-sampling retry batches, which suggests the issue is not tied to a particular data batch.

A possible fix is to preserve the raw task metric before reward shaping and use that metric for dynamic sampling group filtering, while keeping the shaped reward for advantage computation and policy optimization if desired.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAPO dynamic sampling filters on shaped reward instead of raw task metric #2431

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DAPO dynamic sampling filters on shaped reward instead of raw task metric #2431

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions