Describe the bug
NeMo-RL DAPO dynamic sampling filters prompt groups using the reward after DAPO overlong reward shaping has been applied. This differs from the original verl DAPO logic: verl keeps raw task metrics such as acc / score separate from the final reward tensor, and standard DAPO group filtering uses the raw metric acc while the overlong-shaped reward is used for optimization.
As a result, a prompt group where every response is raw incorrect can pass filtering if response lengths differ. Short or empty wrong responses keep the raw wrong reward, while overlong wrong responses receive an extra overlong penalty, creating non-zero shaped reward std even though raw task accuracy std is zero. This can give short wrong or empty EOS responses positive relative advantage versus overlong wrong responses.
Steps/Code to reproduce bug
Expected behavior
Group filtering should use the raw task metric, such as raw acc or raw score, rather than the reward after DAPO overlong shaping. Prompt groups where all responses have identical raw task outcomes should be filtered out, regardless of length-based overlong penalty differences.
This matches the original verl DAPO behavior at the implementation level: verl stores raw task metrics separately and standard DAPO filtering uses acc, while the shaped reward can still be used for optimization.
Additional context
This surfaced in a DeepSeek V4 Base DAPO run. A training step selected empty assistant / EOS samples with positive advantage because they were less penalized than overlong wrong responses. After the policy update, the following step collapsed into effectively empty outputs across different dynamic-sampling retry batches, which suggests the issue is not tied to a particular data batch.
A possible fix is to preserve the raw task metric before reward shaping and use that metric for dynamic sampling group filtering, while keeping the shaped reward for advantage computation and policy optimization if desired.
Describe the bug
NeMo-RL DAPO dynamic sampling filters prompt groups using the reward after DAPO overlong reward shaping has been applied. This differs from the original verl DAPO logic: verl keeps raw task metrics such as
acc/scoreseparate from the final reward tensor, and standard DAPO group filtering uses the raw metricaccwhile the overlong-shaped reward is used for optimization.As a result, a prompt group where every response is raw incorrect can pass filtering if response lengths differ. Short or empty wrong responses keep the raw wrong reward, while overlong wrong responses receive an extra overlong penalty, creating non-zero shaped reward std even though raw task accuracy std is zero. This can give short wrong or empty EOS responses positive relative advantage versus overlong wrong responses.
Steps/Code to reproduce bug
Expected behavior
Group filtering should use the raw task metric, such as raw
accor rawscore, rather than the reward after DAPO overlong shaping. Prompt groups where all responses have identical raw task outcomes should be filtered out, regardless of length-based overlong penalty differences.This matches the original verl DAPO behavior at the implementation level: verl stores raw task metrics separately and standard DAPO filtering uses
acc, while the shaped reward can still be used for optimization.Additional context
This surfaced in a DeepSeek V4 Base DAPO run. A training step selected empty assistant / EOS samples with positive advantage because they were less penalized than overlong wrong responses. After the policy update, the following step collapsed into effectively empty outputs across different dynamic-sampling retry batches, which suggests the issue is not tied to a particular data batch.
A possible fix is to preserve the raw task metric before reward shaping and use that metric for dynamic sampling group filtering, while keeping the shaped reward for advantage computation and policy optimization if desired.