Batch-size scaling experiment for Adam (square-root rule): configs + analysis#539
Draft
jlamypoirier wants to merge 1 commit into
Draft
Batch-size scaling experiment for Adam (square-root rule): configs + analysis#539jlamypoirier wants to merge 1 commit into
jlamypoirier wants to merge 1 commit into
Conversation
Self-contained example under examples/batch_size_scaling/ testing whether small-batch Adam training reproduces large-batch under the square-root (SDE) scaling rule, vs the keep-lr/scale-beta2 paper rule. Includes prepare/warmup/ arm configs, a README arm matrix, and ANALYSIS.md (theory + predictions + preliminary results). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Claude Opus 4.8 note (drafted via Claude Code): opening as a draft — the training runs are still going, so the Results section is marked preliminary.
Adds a self-contained example under
examples/batch_size_scaling/testing whether small-batch Adam training reproduces large-batch training when the hyperparameters are scaled by the square-root (SDE) rule (Malladi et al., 2205.10287), and how that compares to the "keep lr, scale β2" paper rule (Marek et al., 2507.07101).Separate concern from #525 (the layer-wise numerical-error tool) — this is full training runs on Qwen2.5-0.5B / FineWeb-Edu, not a per-step precision probe.
Contents
prepare.yaml/warmup.yaml/arm_base.yaml— tokenization, throwaway from-scratch warmup, and the shared arm base (per-arm overrides in the README).README.md— reproduction steps + arm matrix (the two √-rule pairs A↔H and B↔J).ANALYSIS.md— the theory (why the SGD linear rule fails for Adam, the √/SDE rule and its equivalence guarantee, equivalence-vs-optimality), predictions, and a preliminary Results section.Headline result (preliminary)
In the noise-dominated regime (deep in training — the regime the √-rule is derived for), the √-scaled small-batch arms overlay the large-batch trajectory: the pairs A↔H and B↔J match to ~0.0002–0.0006 nats, ~10× below the spread between operating points. Early on (signal-dominated) the rule's knobs wash out and it isn't even testable — which reframes batch-size effects there as an update-count/drift phenomenon, not the noise-averaging the rule addresses. Small secondary signals: β1-scaling helps slightly (favoring the full SDE rule over β2-only), and fp16 edges bf16. Full writeup and caveats in
ANALYSIS.md.Caveats
Runs ongoing / not converged; comparisons use training loss because validation-loss logging is currently broken (#538); single model + dataset. (W&B loss curves can be attached.)
🤖 Generated with Claude Code