Skip to content

Adding training functionalities to Toolkit#108

Open
laserkelvin wants to merge 378 commits into
NVIDIA:mainfrom
laserkelvin:training-epic
Open

Adding training functionalities to Toolkit#108
laserkelvin wants to merge 378 commits into
NVIDIA:mainfrom
laserkelvin:training-epic

Conversation

@laserkelvin

@laserkelvin laserkelvin commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

ALCHEMI Toolkit Pull Request

Description

This PR introduces the core functionalities required to support training and fine-tuning of models in nvalchemi-toolkit.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or infrastructure change

Related Issues

Changes Made

  • create_model_spec methods and dynamic pydantic model creation for pickle-less serialization of configuration
  • Adds a few base loss functions, the general loss abstraction including individual losses and a composed loss function. The latter can be adjusted with weight scheduling, allowing the relative weighting of different losses to be adjusted over the course of training
  • Adds a TrainingStrategy pydantic model as a recipe validation and loop executor. The execution is highly modular and extendible, allowing for (hopefully) arbitrarily complex training workflows to be built, and not limited to MLIPs
  • Adds a FineTuningStrategy that specializes TrainingStrategy for...fine-tuning workflows by making pre-existing checkpoints and layer addition/modification integral to the workflow
  • Adds data loading optimizations; the main changes is addition of "batched" pre-fetching, which amortizes I/O for non-contiguous data samples. This is crucial for Zarr performance when shuffling data
  • Adds multidataset support, with a "meta" sampler that allows users to implement different cross-dataset sampling strategies (e.g. to account for dataset size imbalances)
  • Adds several training-related hooks, such as model averaging, mixed precision, checkpointing
  • Adds a CLI for training and fine-tuning: the intended use of this CLI is to provide a relatively straightforward on-ramp for users looking to get fine-tune (or train a model from scratch) quickly without needing to know the full training API

Testing

  • Unit tests pass locally (make pytest)
  • Linting passes (make lint)
  • New tests added for new functionality meets coverage expectations?

Checklist

  • I have read and understand the Contributing Guidelines
  • I have updated the CHANGELOG.md
  • I have performed a self-review of my code
  • I have added docstrings to new functions/classes
  • I have updated the documentation (if applicable)

Additional Notes

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Add TrainingUpdateHook framework and orchestrator
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

# Conflicts:
#	nvalchemi/training/hooks/update.py
#	test/training/test_strategy.py
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

# Conflicts:
#	docs/modules/training/hooks.rst
#	test/training/test_strategy.py
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>

# Conflicts:
#	nvalchemi/training/hooks/update.py
#	test/training/test_strategy.py
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Add EMAHook for exponential moving average of model weights
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
…t-loading

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
This reverts commit 22ecded.

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
@laserkelvin

Copy link
Copy Markdown
Collaborator Author

/ok to test 8073ecf

@laserkelvin

Copy link
Copy Markdown
Collaborator Author

/ok to test bda67ad

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
@laserkelvin

Copy link
Copy Markdown
Collaborator Author

/ok to test 8aa39b4

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
@laserkelvin

Copy link
Copy Markdown
Collaborator Author

/ok to test 4d49093

Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
@laserkelvin

Copy link
Copy Markdown
Collaborator Author

/ok to test 9d99fc3

@laserkelvin

Copy link
Copy Markdown
Collaborator Author

/ok to test b66aeaf

Comment on lines +260 to +269
Use :class:`~nvalchemi.dynamics.hooks.StageTimingHook` for lightweight stage
timing and optional NVTX ranges.

.. code-block:: python

from nvalchemi.dynamics.hooks import ProfilerHook
from nvalchemi.dynamics.hooks import StageTimingHook

hook = ProfilerHook(enable_nvtx=True, enable_timer=True, frequency=10)
hook = StageTimingHook("step", frequency=10, log_path="stage_timing.csv")
dynamics = DemoDynamics(model=model, n_steps=1_000, dt=0.5, hooks=[hook])
dynamics.run(batch)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't really tell me what the heck this hook is doing, what "step" refers to, what "frequency" means. We don't need full api doc here but I would expect just another sentence with sufficient exposition explaining what is going on here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants