Skip to content

Generation Trajectory Checkpointing #2415

@terrykong

Description

@terrykong

Problem

During a multi-turn/multi-step trajectory (in Gym), any component invoked during the rollout could fail resulting in the trajectory needing to be recomputed if failure not handled gracefully. This could be the result of a failed tool call or the generation backend failing. For long context rollouts (e.g., 40min per rollout), this would lead to poor efficiency if a 40min rollout had to be started again from scratch

Ideal solution

When one step or turn fails in the trajectory collection, the trajectory collection should be able to resume from the partial state of the collection.

Proposal

Related to #2414

Introduce a data plane to store all the partial trajectories. In the event of a failure during trajectory collection, the collection can be restarted by pulling the last partial state from the data plane.

Alternatives

Opt 1

In the current implementation of gym, we could also potentially enabling trajectory checkpointing by storing the partial states in the ray object store. This way if the gym agent has to be restarted, it can be passed a handle of where to fetch the data in that event. @ananthsub

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions