fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142
fix(actor): use training_done event under Fast-LLM, not sample-counting heuristic#142RaymondLi0 wants to merge 2 commits into
Conversation
…ng heuristic
The actor decides when training is done via `is_trainer_finished()`. On the
legacy HF/DeepSpeed trainer path, the check is:
samples_target = max_train_steps * train_batch_size * gradient_accumulation_passes
samples_processed >= samples_target
This is correct for that path because `gradient_accumulation_passes` counts
the microbatches per optimizer step, so the product is samples per step.
Under Fast-LLM (`use_fast_llm=True`), the trainer uses its own
`schedule.docs_per_step` knob instead, and `gradient_accumulation_passes`
becomes vestigial — it is not propagated to the Fast-LLM trainer. Worse,
Fast-LLM's `_prefetch_to_doc_target` always overshoots `docs_per_step` by a
few documents per step (the loop runs `while total_docs < target` and stops
just after crossing it). The accumulated overshoot, combined with the
unrelated `gradient_accumulation_passes` value (which auto-adjusts to be
divisible by the trainer-rank count, e.g. 1024 -> 1026 for finetune_fraction=6),
makes `samples_target` finish hundreds of documents short of the trainer's
true 400-step consumption.
Concrete reproduction on a single 8-GPU node with the math gspo recipe
(`max_train_steps=400`, `gradient_accumulation_passes=1026`, `docs_per_step=1024`,
`train_iters=400`): actor declares completion at samples_processed=410,400 while
the Fast-LLM trainer is only at step ~393/400, stops feeding redis, the
trainer's `RedisStreamingDataset` times out after 600s with
`No document received`, vLLM logs `[FastLLM] training_finished was not received;
forcing stop`, and the job dies with `EngineDeadError`.
Fast-LLM already publishes an explicit `{"type": "training_finished"}` event to
the `fast_llm_events` redis stream at the natural end of training
(`fast_llm/engine/training/streaming.py:train_end`), and PipelineRL already
listens for it and sets `TrainerState.training_done = True`
(`pipelinerl/state.py:112-115`). This commit makes `is_trainer_finished()` —
and its inline mirror in `ActorLoop.run()` at line ~614 — consult that flag
when `use_fast_llm=True`, while preserving the sample-counting heuristic for
the HF/DeepSpeed path. No race, no overshoot accounting, no implicit dependence
on `gradient_accumulation_passes`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 14-line block duplicated the PR description. Keep the non-obvious why (Fast-LLM ignores gradient_accumulation_passes and overshoots docs_per_step); the symptom numbers live in the PR description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jlamypoirier
left a comment
There was a problem hiding this comment.
Codex review:
Thanks, this fixes one of the two producer-side early-stop paths, but I think the PR is still incomplete for the Fast-LLM starvation bug.
pipelinerl/preprocess.py still computes samples_target = final_train_steps * train_batch_size * gradient_accumulation_passes and exits the preprocessing loop when trainer_state.samples_processed >= samples_target. Under Fast-LLM, samples_processed is derived from Redis consumer-group entries read, not natural optimizer-step completion. That is the same unit mismatch described in the PR body.
If the preprocessor exits early, actors may continue writing raw rollouts to the actor stream, but nothing converts those rollouts into Fast-LLM fast_llm_streaming documents. The trainer can still drain the Fast-LLM stream and eventually hit No document received even though this PR keeps the actor loop alive.
I would mirror the new actor predicate in the preprocessor: when cfg.use_fast_llm, stop only on trainer_state.training_done; keep the sample/entry count for lag and backpressure only. The legacy HF/DeepSpeed path can keep the current sample-counting condition.
Separately, the launcher cleanup path still appears to only stop remaining processes when the only alive processes are inference processes. That can leave redis/actor/preprocessor alive after a clean Fast-LLM training_finished event. That may be a separate PR, but the test plan here should not claim clean shutdown unless that path is covered too.
Summary
pipelinerl.actor.is_trainer_finished()decides when the rollout actor should stop feeding the redis data stream. On the legacy HF/DeepSpeed trainer path the implementation is correct, but on the Fast-LLM path it fires several optimizer steps early, the trainer then times out waiting for documents, and the job dies.Symptom (single 8-GPU node, math gspo recipe,
max_train_steps=400,docs_per_step=1024,train_iters=400,gradient_accumulation_passes=1024auto-adjusted to 1026 to divide 6 trainer ranks):Root cause
The current check is the same for both trainer paths (
pipelinerl/actor.py:158, 170-174):This formula is correct for the HF/DeepSpeed path:
gradient_accumulation_passescounts microbatches per optimizer step, andtrain_batch_size × gradient_accumulation_passes= samples per step.Under Fast-LLM (
use_fast_llm=True),gradient_accumulation_passesis vestigial — it is not propagated to the Fast-LLM trainer, which has its ownschedule.docs_per_stepconfiguration. Fast-LLM's_prefetch_to_doc_target(fast_llm/engine/training/trainer.py) also always overshootsdocs_per_stepby a few documents per step:The compound effect: with
docs_per_step=1024, real per-step consumption is ~1043–1044 docs, so 400 trainer steps consume ~417k docs total. But the actor'ssamples_targetis computed as400 × 1 × 1026 = 410,400. The actor declares completion when the trainer is around step ~393/400, stops feeding redis, and the trainer eventually hits its 600sRedisStreamingDatasettimeout.The colleague's multinode submit script (
submit_eai_math_7b_multinode.sh) hits the same bug latent —gradient_accumulation_passes=1024(no divisibility adjustment because finetune_fraction=4 × 4 nodes = 16 ranks divides 1024 evenly), so the actor stops at trainer step ~392 instead of 400. The truncation is small enough it has gone unnoticed.Fix
Fast-LLM already publishes an explicit
{"type": "training_finished"}event to thefast_llm_eventsredis stream at the natural end of training (fast_llm/engine/training/streaming.py:train_end), andpipelinerl/state.py:112-115already listens for it and setsTrainerState.training_done = True. The vLLM workers andTrainerStateconsume the same signal (pipelinerl/vllm1.py:534-547,pipelinerl/state.py:112-115).This PR makes the actor's completion check route to that signal under Fast-LLM, falling back to the sample-counting heuristic for the HF/DeepSpeed path. Two call sites are updated symmetrically:
pipelinerl/actor.py:170-174(is_trainer_finishedfunction used byschedule_rollouts).pipelinerl/actor.py:612-616(inline mirror insideActorLoop.run).The HF/DeepSpeed path is unchanged. No new config keys, no race, no overshoot accounting, no implicit dependence on
gradient_accumulation_passes.Test plan
max_train_steps=400, single 8-GPU node — observeis_trainer_finished()returning True at samples_processed=410,400 with the trainer at step 393.training_finished, the actor seestrainer_state.training_done=True, and shuts down cleanly withoutEngineDeadError.use_fast_llm=falseis unchanged:samples_targetcalculation and check are unmodified for that branch.🤖 Generated with Claude Code