Multi-step RL with CUBE#143
Conversation
cube-harness changesrl/pipelinerl-cube adds lightweight LLM metadata plumbing needed by the PipelineRL cube rollout path. Key changes:
Design NotesThe PipelineRL cube actor path needs to trace each generated training sample back to the vLLM route and LLM call that produced it. This PR keeps that plumbing minimal: The router still owns infrastructure metadata such as route ID, vLLM server ID, lease wait time, and LLM latency. cube-harness only preserves that metadata on the response/call objects so downstream rollout builders can attach it to training samples. No agent behavior changes are intended. The metadata field is optional and defaults to an empty dict, so existing non-routed cube-harness usage continues to work. TIR Agent and math-tool-use CubeTIR AgentThe cube-harness side includes math-tool-use CubeThe In the PipelineRL config,
|
Summary
This PR adds and extends the cube-specific PipelineRL actor path under
pipelinerl/cube_rl, wiring cube-harness rollouts into the Ray actor entrypoint flow with per-generation vLLM routing, elastic multi-cube scheduling, richer rollout metadata, and results-viewer support.Key changes:
VLLMRouterActorplusRayVLLMRouteradapter for per-generation vLLM routing and admission control.actor.llm_max_rolloutsas per-vLLM generation capacity.cube_params.cubes, allowing multiple train/test cubes in one run.CubeTaskRef(cube_id, task_id).TrainingText.metadata.Design Notes
The cube actor loop handles rollout-level scheduling, retries, metrics, eval boundaries, and bounded pending work over a global Ray worker pool. The vLLM router owns generation-level admission control.
Each cube-harness LLM generation acquires a short-lived route lease from
VLLMRouterActorand releases it after completion or error. This keeps routing tied to actual model calls rather than whole rollouts, which matters for multi-step agent trajectories.The newer multi-cube path treats cubes more like datasets: each configured cube owns its benchmark and agent config, while Ray workers are generic execution slots. Workers lazily install/setup the cube they are assigned and the scheduler prefers workers already warm for the requested cube.
Rollout construction remains inside Ray workers for now. Workers run the cube-harness
Episode, inspect the resulting trajectory, assign rewards, and emitTrainingTextsamples. The new metadata path makes those samples auditable without moving trajectory construction into the central actor loop.Multi-node / Ray Cluster Support
actor.ray_address, withray.init(address="auto")fallback before starting a local Ray runtime.actor.cube_workers * actor.cube_workers_num_cpus.VLLMRouterActor, so distributed Ray workers share one per-generation admission-control state.Optional rollout artifact writing remains opt-in and file-based. In multi-node clusters it should only be enabled with a shared filesystem path or a cluster-aware storage backend; normal actor-stream training data does not depend on those artifacts.
Testing
I trained
Qwen3-4B-Instruct-2507on TIR using both the default PipelineRL implementation and the Cube-based version of TIR, on a single node.Shared hyperparameters
Cube-RL-specific hyperparameters
Cube-RL Ray cluster configs
Results
with
Effect of scaling up llm_max_rollouts
from 32 to 64
Eval (Comparison between running with and without evaluation for both TIR and CUBE)
Multi-node run (2 nodes)
Setup and Cube-compatible repositories
This setup has been tested with the
uvpackage manager and Python3.12.13.Required repositories:
Clone
cube-harnessandcube-standardalongside thePipelineRLrepository before running the project. For example: