Skip to content

Navigation Menu

Appearance settings

- AI CODE CREATION
  GitHub CopilotWrite better code with AI
  GitHub Copilot appDirect agents from issue to merge
  MCP Registry^NewIntegrate external tools
- DEVELOPER WORKFLOWS
  ActionsAutomate any workflow
  CodespacesInstant dev environments
  IssuesPlan and track work
  Code ReviewManage code changes
- APPLICATION SECURITY
  GitHub Advanced SecurityFind and fix vulnerabilities
  Code securitySecure your code as you build
  Secret protectionStop leaks before they start
- EXPLORE
  Why GitHub
  Documentation
  Blog
  Changelog
  Marketplace
View all features
- BY COMPANY SIZE
  Enterprises
  Small and medium teams
  Startups
  Nonprofits
- BY USE CASE
  App Modernization
  DevSecOps
  DevOps
  CI/CD
  View all use cases
- BY INDUSTRY
  Healthcare
  Financial services
  Manufacturing
  Government
  View all industries
View all solutions
- EXPLORE BY TOPIC
  AI
  Software Development
  DevOps
  Security
  View all topics
- EXPLORE BY TYPE
  Customer stories
  Events & webinars
  Ebooks & reports
  Business insights
  GitHub Skills
- SUPPORT & SERVICES
  Documentation
  Customer support
  Community forum
  Trust center
  Partners
View all resources
- COMMUNITY
  GitHub SponsorsFund open source developers
- PROGRAMS
  Security Lab
  Maintainer Community
  Accelerator
  GitHub Stars
  Archive Program
- REPOSITORIES
  Topics
  Trending
  Collections
- ENTERPRISE SOLUTIONS
  Enterprise platformAI-powered developer platform
- AVAILABLE ADD-ONS
  GitHub Advanced SecurityEnterprise-grade security features
  Copilot for BusinessEnterprise-grade AI features
  Premium SupportEnterprise-grade 24/7 support
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Appearance settings

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Uh oh!

There was an error while loading. Please reload this page.

NVIDIA / nvalchemi-toolkit Public

Notifications You must be signed in to change notification settings
Fork 24
Star 102

Code
Issues 9
Pull requests 10
Discussions
Actions
Projects
Security and quality
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security and quality
Insights

Adding training functionalities to Toolkit#108

Open

laserkelvin wants to merge 378 commits into

NVIDIA:mainNVIDIA/nvalchemi-toolkit:mainfrom

laserkelvin:training-epiclaserkelvin/nvalchemi-toolkit:training-epicCopy head branch name to clipboard

Conversation Commits378 (378)Checks Files changed

Open

Adding training functionalities to Toolkit#108
laserkelvin wants to merge 378 commits into
NVIDIA:mainfrom
laserkelvin:training-epic

Conversation

laserkelvin commented Jun 9, 2026 •

edited

Loading

Copy link

Copy Markdown

Collaborator

ALCHEMI Toolkit Pull Request

Description

This PR introduces the core functionalities required to support training and fine-tuning of models in nvalchemi-toolkit.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Performance improvement
Documentation update
Refactoring (no functional changes)
CI/CD or infrastructure change

Related Issues

Changes Made

create_model_spec methods and dynamic pydantic model creation for pickle-less serialization of configuration
Adds a few base loss functions, the general loss abstraction including individual losses and a composed loss function. The latter can be adjusted with weight scheduling, allowing the relative weighting of different losses to be adjusted over the course of training
Adds a TrainingStrategy pydantic model as a recipe validation and loop executor. The execution is highly modular and extendible, allowing for (hopefully) arbitrarily complex training workflows to be built, and not limited to MLIPs
Adds a FineTuningStrategy that specializes TrainingStrategy for...fine-tuning workflows by making pre-existing checkpoints and layer addition/modification integral to the workflow
Adds data loading optimizations; the main changes is addition of "batched" pre-fetching, which amortizes I/O for non-contiguous data samples. This is crucial for Zarr performance when shuffling data
Adds multidataset support, with a "meta" sampler that allows users to implement different cross-dataset sampling strategies (e.g. to account for dataset size imbalances)
Adds several training-related hooks, such as model averaging, mixed precision, checkpointing
Adds a CLI for training and fine-tuning: the intended use of this CLI is to provide a relatively straightforward on-ramp for users looking to get fine-tune (or train a model from scratch) quickly without needing to know the full training API

Testing

Unit tests pass locally (make pytest)
Linting passes (make lint)
New tests added for new functionality meets coverage expectations?

Checklist

I have read and understand the Contributing Guidelines
I have updated the CHANGELOG.md
I have performed a self-review of my code
I have added docstrings to new functions/classes
I have updated the documentation (if applicable)

Additional Notes

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

All reactions

laserkelvin added 30 commits

May 26, 2026 08:16


          docs(training): document mixed precision hooks

627b16a

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          fix(training): narrow AMP autocast scope

ee9c3e0

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          refactor(training): dispatch mixed precision hook stages

75ea950

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          test(training): align mixed precision tests with train batch helper

c63e3c1

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          fix(training): prevent duplicate mixed precision hooks

8b75aad

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          test(training): align update hook API expectations

2396c0c

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          Merge pull request #9 from laserkelvin/feat-training-update-orchestrator

14580ac

Add TrainingUpdateHook framework and orchestrator


          Merge remote-tracking branch 'fork/training-epic' into feat-mixed-pre…

b9aa80b

…cision-hook


          test: consolidating and using existing device fixture

ad7ba4c

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          Merge pull request #7 from laserkelvin/feat-mixed-precision-hook

92eaa19

Add `MixedPrecisionHook`


          Merge branch 'feat-training-update-orchestrator' into feat-ema-hook

c85c44f

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

# Conflicts:
#	nvalchemi/training/hooks/update.py
#	test/training/test_strategy.py


          Merge remote-tracking branch 'fork/training-epic' into feat-ema-hook

03b307b

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

# Conflicts:
#	docs/modules/training/hooks.rst
#	test/training/test_strategy.py


          Merge remote-tracking branch 'fork/training-epic' into feat-ema-hook

cdbe62e

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

# Conflicts:
#	nvalchemi/training/hooks/update.py
#	test/training/test_strategy.py


          feat(training): add strategy checkpoint restart loading

6b810c1

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          fix(training): restore checkpoint restart consistency

e441123

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          docs(training): note checkpoint restart workflow

690b5d3

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          Merge pull request #8 from laserkelvin/feat-ema-hook

8acb8b2

Add EMAHook for exponential moving average of model weights


          fix(data): generate edge rows in io benchmark

9720c5b


          refactor(data): profile io benchmark readback

af3095d


          refactor(data): batch zarr dataloader reads

a424720


          feat(data): compare zarr readback modes

01bc4f3


          docs(data): document zarr readback modes

e1a23e8


          docs(data): refresh zarr benchmark examples

849adf0


          Merge remote-tracking branch 'origin/main' into training-epic

eb51e24

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          Merge remote-tracking branch 'fork/training-epic' into feat-checkpoin…

e774991

…t-loading

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          feat(training): add periodic checkpoint hook

35f76ee

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          fix(training): respect checkpoint hook lifecycle

def6893

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          fix(training): make checkpoint hook cadence explicit

2b64eac

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          refactor: simplifying mutual exclusion

03b5b8e

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          test(training): cover checkpoint hook restart cycles

5263c85

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

laserkelvin added 6 commits

June 22, 2026 20:49


          feat(training): add validation batch preparation callback

22ecded

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          Revert "feat(training): add validation batch preparation callback"

e542169

This reverts commit 22ecded.

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          docs: clarifying validation

8c01fea

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          feat(training): add CLI runtime hook stage overrides

f5dc0ef

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          docs(training): align fine-tuning CLI guidance

0c27326

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          docs: noting equivariant models need specialized readout

8073ecf

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

laserkelvin commented Jun 23, 2026

Copy link

Copy Markdown

Collaborator Author

/ok to test 8073ecf

copy-pr-bot[bot] reacted with thumbs up emoji

All reactions

👍 1 reaction

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.


          Merge branch 'main' into training-epic

bda67ad

laserkelvin commented Jun 23, 2026

Copy link

Copy Markdown

Collaborator Author

/ok to test bda67ad

copy-pr-bot[bot] reacted with thumbs up emoji

All reactions

👍 1 reaction

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

laserkelvin added 10 commits

June 23, 2026 11:41


          fix(training): wire CLI validation data

04b0ebe

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          fix(training): require checkpoint cadence

09ae1a0

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          perf(data): streamline multidataset samplers

5caa269

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          refactor(distributed): share rank and collective helpers

1aabcac

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          feat(training): add loss dtype alignment policy

485b81f

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          docs(training): document loss dtype alignment

691d06a

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          feat(training): expose loss dtype policy in CLI

05ae3e0

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          docs(training): document CLI dtype alignment

44dc143

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          fix(training): restore filtered optimizer checkpoints

6c18cd6

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          feat(training): add MACE source E0 options

4d4e024

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

laserkelvin commented Jun 23, 2026

Copy link

Copy Markdown

Collaborator Author

/ok to test 8aa39b4

copy-pr-bot[bot] reacted with thumbs up emoji

All reactions

👍 1 reaction

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

laserkelvin force-pushed the training-epic branch from 8aa39b4 to 4d4e024 Compare

June 23, 2026 22:44

laserkelvin added 2 commits

June 23, 2026 16:49


          fix(training): preserve optimizer state on checkpoint resume

2b35272

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>


          feat(training): add CLI resume and validation options

4d49093

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

laserkelvin commented Jun 24, 2026

Copy link

Copy Markdown

Collaborator Author

/ok to test 4d49093

copy-pr-bot[bot] reacted with thumbs up emoji

All reactions

👍 1 reaction

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.


          docs: adding agent skill for reporting abstraction

9d99fc3

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

laserkelvin commented Jun 24, 2026

Copy link

Copy Markdown

Collaborator Author

/ok to test 9d99fc3

copy-pr-bot[bot] reacted with thumbs up emoji

All reactions

👍 1 reaction

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

laserkelvin added 2 commits

June 23, 2026 21:49


          fix(training): seek dataloader on resume

9d38d2a


          fix(hooks): restore reporting collective device helper

b66aeaf

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

laserkelvin commented Jun 24, 2026

Copy link

Copy Markdown

Collaborator Author

/ok to test b66aeaf

copy-pr-bot[bot] reacted with thumbs up emoji

All reactions

👍 1 reaction

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

dallasfoster reviewed

View reviewed changes

Comment thread

docs/modules/dynamics/hooks.rst

Comment on lines +260 to +269

    
              Use :class:`~nvalchemi.dynamics.hooks.StageTimingHook` for lightweight stage

              timing and optional NVTX ranges.

              .. code-block:: python

                 from nvalchemi.dynamics.hooks import ProfilerHook

                 from nvalchemi.dynamics.hooks import StageTimingHook

                 hook = ProfilerHook(enable_nvtx=True, enable_timer=True, frequency=10)

                 hook = StageTimingHook("step", frequency=10, log_path="stage_timing.csv")

                 dynamics = DemoDynamics(model=model, n_steps=1_000, dt=0.5, hooks=[hook])

                 dynamics.run(batch)

dallasfoster Jun 24, 2026

Copy link

Copy Markdown

Collaborator

There was a problem hiding this comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't really tell me what the heck this hook is doing, what "step" refers to, what "frequency" means. We don't need full api doc here but I would expect just another sentence with sufficient exposition explaining what is going on here.

Sorry, something went wrong.

Uh oh!

There was an error while loading. Please reload this page.

All reactions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ys-teh left review comments

dallasfoster requested changes

zubatyuk

Awaiting requested review from zubatyuk

atulcthakur

Awaiting requested review from atulcthakur

Requested changes must be addressed to merge this pull request.

Assignees

No one assigned

Labels

New feature or request

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

Uh oh!

There was an error while loading. Please reload this page.

4 participants

Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Footer

© 2026 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Community
Docs
Contact

You can’t perform that action at this time.