fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic by RaymondLi0 · Pull Request #142 · ServiceNow/PipelineRL

RaymondLi0 · 2026-05-12T19:31:28Z

Summary

pipelinerl.actor.is_trainer_finished() decides when the rollout actor should stop feeding the redis data stream. On the legacy HF/DeepSpeed trainer path the implementation is correct, but on the Fast-LLM path it fires several optimizer steps early, the trainer then times out waiting for documents, and the job dies.

Symptom (single 8-GPU node, math gspo recipe, max_train_steps=400, docs_per_step=1024, train_iters=400, gradient_accumulation_passes=1024 auto-adjusted to 1026 to divide 6 trainer ranks):

[finetune rank 0] training @ step 393/400 | consumed tokens: 452,736,000 | ...
[actor]: ... is_trainer_finished() returns True at samples_processed=410,400
[actor]: train scheduler for llms 0,1: rollout task encountered ServerDisconnectedError
         after trainer completion; ignoring
[actor]: Rollout maker loop closed
... (10 minutes later, no new docs in redis) ...
fast_llm/data/dataset/streaming.py:175: TimeoutError: No document received after 600 seconds
vLLM stderr: [FastLLM] training_finished was not received; forcing stop
              EngineDeadError

Root cause

The current check is the same for both trainer paths (pipelinerl/actor.py:158, 170-174):

samples_target = final_steps * cfg.finetune.train_batch_size * cfg.finetune.gradient_accumulation_passes

def is_trainer_finished() -> bool:
    return (
        trainer_state.samples_processed is not None
        and trainer_state.samples_processed >= samples_target
    )

This formula is correct for the HF/DeepSpeed path: gradient_accumulation_passes counts microbatches per optimizer step, and train_batch_size × gradient_accumulation_passes = samples per step.

Under Fast-LLM (use_fast_llm=True), gradient_accumulation_passes is vestigial — it is not propagated to the Fast-LLM trainer, which has its own schedule.docs_per_step configuration. Fast-LLM's _prefetch_to_doc_target (fast_llm/engine/training/trainer.py) also always overshoots docs_per_step by a few documents per step:

target = self._config.schedule.docs_per_step
total_docs = 0
while total_docs < target:
    mb = next(data_iterator)
    total_docs += mb[0].num_documents_in_batch  # variable per microbatch

The compound effect: with docs_per_step=1024, real per-step consumption is ~1043–1044 docs, so 400 trainer steps consume ~417k docs total. But the actor's samples_target is computed as 400 × 1 × 1026 = 410,400. The actor declares completion when the trainer is around step ~393/400, stops feeding redis, and the trainer eventually hits its 600s RedisStreamingDataset timeout.

The colleague's multinode submit script (submit_eai_math_7b_multinode.sh) hits the same bug latent — gradient_accumulation_passes=1024 (no divisibility adjustment because finetune_fraction=4 × 4 nodes = 16 ranks divides 1024 evenly), so the actor stops at trainer step ~392 instead of 400. The truncation is small enough it has gone unnoticed.

Fix

Fast-LLM already publishes an explicit {"type": "training_finished"} event to the fast_llm_events redis stream at the natural end of training (fast_llm/engine/training/streaming.py:train_end), and pipelinerl/state.py:112-115 already listens for it and sets TrainerState.training_done = True. The vLLM workers and TrainerState consume the same signal (pipelinerl/vllm1.py:534-547, pipelinerl/state.py:112-115).

This PR makes the actor's completion check route to that signal under Fast-LLM, falling back to the sample-counting heuristic for the HF/DeepSpeed path. Two call sites are updated symmetrically:

pipelinerl/actor.py:170-174 (is_trainer_finished function used by schedule_rollouts).
pipelinerl/actor.py:612-616 (inline mirror inside ActorLoop.run).

The HF/DeepSpeed path is unchanged. No new config keys, no race, no overshoot accounting, no implicit dependence on gradient_accumulation_passes.

Test plan

Reproduce the bug locally: math gspo run, max_train_steps=400, single 8-GPU node — observe is_trainer_finished() returning True at samples_processed=410,400 with the trainer at step 393.
After this PR: rerun same config, verify the trainer reaches step 400, emits training_finished, the actor sees trainer_state.training_done=True, and shuts down cleanly without EngineDeadError.
Spot-check the HF/DeepSpeed path with use_fast_llm=false is unchanged: samples_target calculation and check are unmodified for that branch.

🤖 Generated with Claude Code

…ng heuristic The actor decides when training is done via `is_trainer_finished()`. On the legacy HF/DeepSpeed trainer path, the check is: samples_target = max_train_steps * train_batch_size * gradient_accumulation_passes samples_processed >= samples_target This is correct for that path because `gradient_accumulation_passes` counts the microbatches per optimizer step, so the product is samples per step. Under Fast-LLM (`use_fast_llm=True`), the trainer uses its own `schedule.docs_per_step` knob instead, and `gradient_accumulation_passes` becomes vestigial — it is not propagated to the Fast-LLM trainer. Worse, Fast-LLM's `_prefetch_to_doc_target` always overshoots `docs_per_step` by a few documents per step (the loop runs `while total_docs < target` and stops just after crossing it). The accumulated overshoot, combined with the unrelated `gradient_accumulation_passes` value (which auto-adjusts to be divisible by the trainer-rank count, e.g. 1024 -> 1026 for finetune_fraction=6), makes `samples_target` finish hundreds of documents short of the trainer's true 400-step consumption. Concrete reproduction on a single 8-GPU node with the math gspo recipe (`max_train_steps=400`, `gradient_accumulation_passes=1026`, `docs_per_step=1024`, `train_iters=400`): actor declares completion at samples_processed=410,400 while the Fast-LLM trainer is only at step ~393/400, stops feeding redis, the trainer's `RedisStreamingDataset` times out after 600s with `No document received`, vLLM logs `[FastLLM] training_finished was not received; forcing stop`, and the job dies with `EngineDeadError`. Fast-LLM already publishes an explicit `{"type": "training_finished"}` event to the `fast_llm_events` redis stream at the natural end of training (`fast_llm/engine/training/streaming.py:train_end`), and PipelineRL already listens for it and sets `TrainerState.training_done = True` (`pipelinerl/state.py:112-115`). This commit makes `is_trainer_finished()` — and its inline mirror in `ActorLoop.run()` at line ~614 — consult that flag when `use_fast_llm=True`, while preserving the sample-counting heuristic for the HF/DeepSpeed path. No race, no overshoot accounting, no implicit dependence on `gradient_accumulation_passes`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 14-line block duplicated the PR description. Keep the non-obvious why (Fast-LLM ignores gradient_accumulation_passes and overshoots docs_per_step); the symptom numbers live in the PR description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jlamypoirier

Codex review:

Thanks, this fixes one of the two producer-side early-stop paths, but I think the PR is still incomplete for the Fast-LLM starvation bug.

pipelinerl/preprocess.py still computes samples_target = final_train_steps * train_batch_size * gradient_accumulation_passes and exits the preprocessing loop when trainer_state.samples_processed >= samples_target. Under Fast-LLM, samples_processed is derived from Redis consumer-group entries read, not natural optimizer-step completion. That is the same unit mismatch described in the PR body.

If the preprocessor exits early, actors may continue writing raw rollouts to the actor stream, but nothing converts those rollouts into Fast-LLM fast_llm_streaming documents. The trainer can still drain the Fast-LLM stream and eventually hit No document received even though this PR keeps the actor loop alive.

I would mirror the new actor predicate in the preprocessor: when cfg.use_fast_llm, stop only on trainer_state.training_done; keep the sample/entry count for lag and backpressure only. The legacy HF/DeepSpeed path can keep the current sample-counting condition.

Separately, the launcher cleanup path still appears to only stop remaining processes when the only alive processes are inference processes. That can leave redis/actor/preprocessor alive after a clean Fast-LLM training_finished event. That may be a separate PR, but the test plan here should not claim clean shutdown unless that path is covered too.

RaymondLi0 and others added 2 commits May 12, 2026 15:30

jlamypoirier requested changes May 29, 2026

View reviewed changes

jlamypoirier mentioned this pull request May 29, 2026

fix(fast-llm): stop producers and helpers on training completion #144

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142

fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142
RaymondLi0 wants to merge 2 commits into
fast-llmfrom
fix/actor-finished-uses-training-done-for-fast-llm

RaymondLi0 commented May 12, 2026

Uh oh!

jlamypoirier left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RaymondLi0 commented May 12, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants