Skip to content

Model Training Example with MACE#109

Open
ys-teh wants to merge 14 commits into
NVIDIA:mainfrom
ys-teh:feature/mace-training-ex
Open

Model Training Example with MACE#109
ys-teh wants to merge 14 commits into
NVIDIA:mainfrom
ys-teh:feature/mace-training-ex

Conversation

@ys-teh

@ys-teh ys-teh commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

ALCHEMI Toolkit Pull Request

Description

This PR adds an advanced training example for a charged MACE model and the supporting code modifications needed to train it with available ALCHEMI tools.

Note: This can only be merged after #108 is merged.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or infrastructure change

Related Issues

Changes Made

  • Adds a MACE training example examples/advanced/10_mace_training.py (along with config examples/advanced/10_vanilla_mace.yaml) using nvalchemi model training pipeline.
  • Adds examples/advanced/_mace_training_helpers.py with additional training utilities including stress unit conversion, training loss logging, validation, parameter counting, and gradient clipping hook.
  • Adds examples/advanced/_mace_models.py with builders for vanilla MACE model, including cuEquivariance config support.
  • Adds MACE training user guide docs/userguide/mace_training_example.md.

Testing

  • Unit tests pass locally (make pytest)
  • Linting passes (make lint)
  • New tests added for new functionality meets coverage expectations?

Run training

Checklist

  • I have read and understand the Contributing Guidelines
  • I have updated the CHANGELOG.md
  • I have performed a self-review of my code
  • I have added docstrings to new functions/classes
  • I have updated the documentation (if applicable)

Additional Notes

Below are NVT and NVE stability results for the trained MACE model on a 324-atom MgVF4 3x3x3 MatPES-r2SCAN test structure. These runs completed all 20,000 steps without numerical instability or force/temperature warnings.

image image

Tip

This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.

@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ys-teh ys-teh force-pushed the feature/mace-training-ex branch 2 times, most recently from 83363c2 to 4addb20 Compare June 25, 2026 16:27
@ys-teh ys-teh marked this pull request as ready for review June 25, 2026 23:37
ys-teh added 8 commits June 26, 2026 01:34
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
@ys-teh ys-teh force-pushed the feature/mace-training-ex branch from 837a527 to bb9efcf Compare June 26, 2026 01:51
@ys-teh ys-teh requested a review from laserkelvin June 26, 2026 01:53
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds an end-to-end MACE model training example (10_mace_training.py) and its supporting helpers (_mace_models.py, _mace_training_helpers.py, 10_vanilla_mace.yaml) using the ALCHEMI training pipeline, targeting the MatPES r2SCAN dataset with a two-stage cosine-then-constant learning-rate schedule and step-scheduled Huber losses.

  • 10_mace_training.py: Hydra entrypoint that composes data loaders, model, loss, optimizer, and hook list; the finally block correctly delegates to close_zarr_loaders before calling DistributedManager.cleanup().
  • _mace_training_helpers.py: Provides TwoStageCosineConstantLR, GradientClipHook, TrainingMetricsLogger, and shared data utilities; the stage-boundary LR continuity constraint between eta_min and second_stage_lr is documented in the YAML but not enforced in code.
  • _mace_models.py: Contains build_vanilla_mace_model / build_training_mace_model; distance_transform=\"Agnesi\" is hardcoded and not overridable from config.

Important Files Changed

Filename Overview
examples/advanced/10_mace_training.py New Hydra entrypoint for MACE training; well-structured with clear separation of concerns; finally-block resource handling correctly delegates to close_zarr_loaders.
examples/advanced/_mace_training_helpers.py LR schedule, metrics logging, and data transform utilities; TwoStageCosineConstantLR has no guard against LR discontinuity at the stage boundary when eta_min differs from second_stage_lr.
examples/advanced/_mace_models.py MACE model builders; distance_transform is hardcoded to Agnesi and not configurable from the Hydra config.
examples/advanced/10_vanilla_mace.yaml Hydra config for the training example; eta_min and stage_two_lr are manually kept equal via a comment but not enforced in code.
CHANGELOG.md Added one-line entry for the MACE training example under Added.

Reviews (7): Last reviewed commit: "address potential reader-close failure" | Re-trigger Greptile

ys-teh added 2 commits June 26, 2026 08:47
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Comment thread examples/advanced/10_mace_training.py Outdated
ys-teh added 3 commits June 26, 2026 21:30
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
@greptile-apps

greptile-apps Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Want your agent to iterate on Greptile's feedback? Try greploops.

Comment thread examples/advanced/_mace_training_helpers.py
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
@ys-teh ys-teh requested a review from dallasfoster June 27, 2026 02:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant