Improve RL parallelism ValueError message and UXR documentation by darisoy · Pull Request #3960 · AI-Hypercomputer/maxtext

darisoy · 2026-05-20T22:32:06Z

Description

This PR improves the usability of MaxText Reinforcement Learning (RL) training by enhancing the error message for invalid rollout parallelism configurations and correcting the multi-host RL UXR tutorial documentation.

Why this change is being made

During MaxText 2.0 RL UXR testing (reported in b/497920004), users encountered immediate job failures with a cryptic ValueError when submitting training workloads. The root cause was twofold:

The default configuration (rl.yml) sets both rollout_tensor_parallelism and rollout_data_parallelism to -1 (auto-derived), which is an invalid state since the system can only auto-derive at most one parameter.
The example commands in the rl_on_multi_host.md tutorial did not override these defaults, leading to out-of-the-box failures for users copying the commands.

Solution

Actionable Error Message: Updated the ValueError in train_rl.py to print the resolved values of rollout_tensor_parallelism, rollout_data_parallelism, and rollout_expert_parallelism, and added a concrete suggestion on how to fix it (e.g., adding rollout_tensor_parallelism=4).
Corrected Documentation: Updated the example xpk workload creation commands in rl_on_multi_host.md to explicitly include rollout_tensor_parallelism=8 (optimal for Llama 3.1 70B on v5p-128 GKE clusters).
Troubleshooting Entry: Added a dedicated entry in the troubleshooting section of the tutorial explaining the error and its resolution.

Future Improvements

While this runtime check prevents confusing failures, a future improvement could involve implementing pre-flight validation in the launcher/configuration loader to fail-fast locally before the job is submitted to the GKE queue.

FIXES: b/497920004

Tests

Manual Verification

The changes were implemented and verified on a TPU v6e-4 VM (darisoy-gvnic-test):

Verified the syntax of the updated train_rl.py Python file.
Verified the rendering of the updated rl_on_multi_host.md Markdown file.
Ran git diff to ensure only the intended lines were modified.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-05-20T22:42:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

darisoy force-pushed the fix-rl-error-msg branch from 9b11ec0 to b13252b Compare May 20, 2026 22:36

darisoy marked this pull request as ready for review May 20, 2026 22:57

bvandermoon reviewed May 20, 2026

View reviewed changes

Comment thread docs/tutorials/posttraining/rl_on_multi_host.md Outdated

darisoy force-pushed the fix-rl-error-msg branch from b13252b to 014c45c Compare May 21, 2026 01:00

khatwanimohit approved these changes May 21, 2026

View reviewed changes

Improve RL parallelism ValueError message and UXR documentation

a101308

darisoy force-pushed the fix-rl-error-msg branch from 014c45c to a101308 Compare May 21, 2026 15:44

SurbhiJainUSC approved these changes May 21, 2026

View reviewed changes

github-actions Bot added the pull ready label May 21, 2026

copybara-service Bot merged commit ee8f49c into main May 21, 2026
50 checks passed

copybara-service Bot deleted the fix-rl-error-msg branch May 21, 2026 16:47

github-actions Bot mentioned this pull request May 21, 2026

Failed build: MaxText Package Tests #3953

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve RL parallelism ValueError message and UXR documentation#3960

Improve RL parallelism ValueError message and UXR documentation#3960
copybara-service[bot] merged 1 commit into
mainfrom
fix-rl-error-msg

darisoy commented May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

darisoy commented May 20, 2026

Description

Why this change is being made

Solution

Future Improvements

Tests

Manual Verification

Checklist

Uh oh!

codecov Bot commented May 20, 2026

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants