Batch-size scaling experiment for Adam (square-root rule): configs + analysis by jlamypoirier · Pull Request #539 · ServiceNow/Fast-LLM

jlamypoirier · 2026-06-11T22:37:48Z

Claude Opus 4.8 note (drafted via Claude Code): opening as a draft — the training runs are still going, so the Results section is marked preliminary.

Adds a self-contained example under examples/batch_size_scaling/ testing whether small-batch Adam training reproduces large-batch training when the hyperparameters are scaled by the square-root (SDE) rule (Malladi et al., 2205.10287), and how that compares to the "keep lr, scale β2" paper rule (Marek et al., 2507.07101).

Separate concern from #525 (the layer-wise numerical-error tool) — this is full training runs on Qwen2.5-0.5B / FineWeb-Edu, not a per-step precision probe.

prepare.yaml / warmup.yaml / arm_base.yaml — tokenization, throwaway from-scratch warmup, and the shared arm base (per-arm overrides in the README).
README.md — reproduction steps + arm matrix (the two √-rule pairs A↔H and B↔J).
ANALYSIS.md — the theory (why the SGD linear rule fails for Adam, the √/SDE rule and its equivalence guarantee, equivalence-vs-optimality), predictions, and a preliminary Results section.

Headline result (preliminary)

In the noise-dominated regime (deep in training — the regime the √-rule is derived for), the √-scaled small-batch arms overlay the large-batch trajectory: the pairs A↔H and B↔J match to ~0.0002–0.0006 nats, ~10× below the spread between operating points. Early on (signal-dominated) the rule's knobs wash out and it isn't even testable — which reframes batch-size effects there as an update-count/drift phenomenon, not the noise-averaging the rule addresses. Small secondary signals: β1-scaling helps slightly (favoring the full SDE rule over β2-only), and fp16 edges bf16. Full writeup and caveats in ANALYSIS.md.

Caveats

Runs ongoing / not converged; comparisons use training loss because validation-loss logging is currently broken (#538); single model + dataset. (W&B loss curves can be attached.)

🤖 Generated with Claude Code

Self-contained example under examples/batch_size_scaling/ testing whether small-batch Adam training reproduces large-batch under the square-root (SDE) scaling rule, vs the keep-lr/scale-beta2 paper rule. Includes prepare/warmup/ arm configs, a README arm matrix, and ANALYSIS.md (theory + predictions + preliminary results). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch-size scaling experiment for Adam (square-root rule): configs + analysis#539

Batch-size scaling experiment for Adam (square-root rule): configs + analysis#539
jlamypoirier wants to merge 1 commit into
mainfrom
jlp_batch_size_scaling

jlamypoirier commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented Jun 11, 2026

Contents

Headline result (preliminary)

Caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant