Skip to content

Batch-size scaling experiment for Adam (square-root rule): configs + analysis#539

Draft
jlamypoirier wants to merge 1 commit into
mainfrom
jlp_batch_size_scaling
Draft

Batch-size scaling experiment for Adam (square-root rule): configs + analysis#539
jlamypoirier wants to merge 1 commit into
mainfrom
jlp_batch_size_scaling

Conversation

@jlamypoirier

Copy link
Copy Markdown
Collaborator

Claude Opus 4.8 note (drafted via Claude Code): opening as a draft — the training runs are still going, so the Results section is marked preliminary.

Adds a self-contained example under examples/batch_size_scaling/ testing whether small-batch Adam training reproduces large-batch training when the hyperparameters are scaled by the square-root (SDE) rule (Malladi et al., 2205.10287), and how that compares to the "keep lr, scale β2" paper rule (Marek et al., 2507.07101).

Separate concern from #525 (the layer-wise numerical-error tool) — this is full training runs on Qwen2.5-0.5B / FineWeb-Edu, not a per-step precision probe.

Contents

  • prepare.yaml / warmup.yaml / arm_base.yaml — tokenization, throwaway from-scratch warmup, and the shared arm base (per-arm overrides in the README).
  • README.md — reproduction steps + arm matrix (the two √-rule pairs A↔H and B↔J).
  • ANALYSIS.md — the theory (why the SGD linear rule fails for Adam, the √/SDE rule and its equivalence guarantee, equivalence-vs-optimality), predictions, and a preliminary Results section.

Headline result (preliminary)

In the noise-dominated regime (deep in training — the regime the √-rule is derived for), the √-scaled small-batch arms overlay the large-batch trajectory: the pairs A↔H and B↔J match to ~0.0002–0.0006 nats, ~10× below the spread between operating points. Early on (signal-dominated) the rule's knobs wash out and it isn't even testable — which reframes batch-size effects there as an update-count/drift phenomenon, not the noise-averaging the rule addresses. Small secondary signals: β1-scaling helps slightly (favoring the full SDE rule over β2-only), and fp16 edges bf16. Full writeup and caveats in ANALYSIS.md.

Caveats

Runs ongoing / not converged; comparisons use training loss because validation-loss logging is currently broken (#538); single model + dataset. (W&B loss curves can be attached.)

🤖 Generated with Claude Code

Self-contained example under examples/batch_size_scaling/ testing whether
small-batch Adam training reproduces large-batch under the square-root (SDE)
scaling rule, vs the keep-lr/scale-beta2 paper rule. Includes prepare/warmup/
arm configs, a README arm matrix, and ANALYSIS.md (theory + predictions +
preliminary results).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant