Skip to content

DAPO dynamic sampling filters on shaped reward instead of raw task metric #2431

@zpqiu

Description

@zpqiu

Describe the bug

NeMo-RL DAPO dynamic sampling filters prompt groups using the reward after DAPO overlong reward shaping has been applied. This differs from the original verl DAPO logic: verl keeps raw task metrics such as acc / score separate from the final reward tensor, and standard DAPO group filtering uses the raw metric acc while the overlong-shaped reward is used for optimization.

As a result, a prompt group where every response is raw incorrect can pass filtering if response lengths differ. Short or empty wrong responses keep the raw wrong reward, while overlong wrong responses receive an extra overlong penalty, creating non-zero shaped reward std even though raw task accuracy std is zero. This can give short wrong or empty EOS responses positive relative advantage versus overlong wrong responses.

Steps/Code to reproduce bug

Expected behavior

Group filtering should use the raw task metric, such as raw acc or raw score, rather than the reward after DAPO overlong shaping. Prompt groups where all responses have identical raw task outcomes should be filtered out, regardless of length-based overlong penalty differences.

This matches the original verl DAPO behavior at the implementation level: verl stores raw task metrics separately and standard DAPO filtering uses acc, while the shaped reward can still be used for optimization.

Additional context

This surfaced in a DeepSeek V4 Base DAPO run. A training step selected empty assistant / EOS samples with positive advantage because they were less penalized than overlong wrong responses. After the policy update, the following step collapsed into effectively empty outputs across different dynamic-sampling retry batches, which suggests the issue is not tied to a particular data batch.

A possible fix is to preserve the raw task metric before reward shaping and use that metric for dynamic sampling group filtering, while keeping the shaped reward for advantage computation and policy optimization if desired.

Metadata

Metadata

Assignees

Labels

accuracybugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions