Skip to content

Improve RL parallelism ValueError message and UXR documentation#3960

Merged
copybara-service[bot] merged 1 commit into
mainfrom
fix-rl-error-msg
May 21, 2026
Merged

Improve RL parallelism ValueError message and UXR documentation#3960
copybara-service[bot] merged 1 commit into
mainfrom
fix-rl-error-msg

Conversation

@darisoy
Copy link
Copy Markdown
Collaborator

@darisoy darisoy commented May 20, 2026

Description

This PR improves the usability of MaxText Reinforcement Learning (RL) training by enhancing the error message for invalid rollout parallelism configurations and correcting the multi-host RL UXR tutorial documentation.

Why this change is being made

During MaxText 2.0 RL UXR testing (reported in b/497920004), users encountered immediate job failures with a cryptic ValueError when submitting training workloads. The root cause was twofold:

  1. The default configuration (rl.yml) sets both rollout_tensor_parallelism and rollout_data_parallelism to -1 (auto-derived), which is an invalid state since the system can only auto-derive at most one parameter.
  2. The example commands in the rl_on_multi_host.md tutorial did not override these defaults, leading to out-of-the-box failures for users copying the commands.

Solution

  • Actionable Error Message: Updated the ValueError in train_rl.py to print the resolved values of rollout_tensor_parallelism, rollout_data_parallelism, and rollout_expert_parallelism, and added a concrete suggestion on how to fix it (e.g., adding rollout_tensor_parallelism=4).
  • Corrected Documentation: Updated the example xpk workload creation commands in rl_on_multi_host.md to explicitly include rollout_tensor_parallelism=8 (optimal for Llama 3.1 70B on v5p-128 GKE clusters).
  • Troubleshooting Entry: Added a dedicated entry in the troubleshooting section of the tutorial explaining the error and its resolution.

Future Improvements

While this runtime check prevents confusing failures, a future improvement could involve implementing pre-flight validation in the launcher/configuration loader to fail-fast locally before the job is submitted to the GKE queue.

FIXES: b/497920004

Tests

Manual Verification

The changes were implemented and verified on a TPU v6e-4 VM (darisoy-gvnic-test):

  1. Verified the syntax of the updated train_rl.py Python file.
  2. Verified the rendering of the updated rl_on_multi_host.md Markdown file.
  3. Ran git diff to ensure only the intended lines were modified.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@darisoy darisoy force-pushed the fix-rl-error-msg branch from 9b11ec0 to b13252b Compare May 20, 2026 22:36
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment thread docs/tutorials/posttraining/rl_on_multi_host.md Outdated
@darisoy darisoy force-pushed the fix-rl-error-msg branch from b13252b to 014c45c Compare May 21, 2026 01:00
@darisoy darisoy force-pushed the fix-rl-error-msg branch from 014c45c to a101308 Compare May 21, 2026 15:44
@copybara-service copybara-service Bot merged commit ee8f49c into main May 21, 2026
50 checks passed
@copybara-service copybara-service Bot deleted the fix-rl-error-msg branch May 21, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants